Blog on UnWeb

How to Scrape JavaScript-Rendered Pages in Python

Mon, 18 May 2026 00:00:00 +0000

You find a page with the data you need, write a quick scraper, and get back something like this:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.get_text()) # → mostly empty. Navigation chrome. A loading spinner div.

The page looked full of content in your browser. The scraper returns a skeleton. This is the JavaScript-rendering problem, and it affects a large fraction of the modern web — every React, Vue, Angular, and Next.js site that hydrates on the client, every SPA that fetches content via XHR after the initial page load.

Web Scraping with Node.js: Clean Markdown from Any URL

Tue, 12 May 2026 00:00:00 +0000

The standard Node.js scraping stack is axios + cheerio. It works fine for static sites. It fails quietly on the modern web.

const axios = require('axios');
const cheerio = require('cheerio');

const response = await axios.get('https://some-react-app.com/products');
const $ = cheerio.load(response.data);
console.log($('body').text()); // "Loading..."

The page looks full of content in Chrome. Your scraper gets a shell. This is because client-side-rendered apps — React, Next.js, Vue — ship an empty <div id="root"></div> and populate it with JavaScript after the browser executes the bundle. axios.get() fires before any of that runs.

Building a Web Research Agent with Python and Claude

Mon, 04 May 2026 00:00:00 +0000

The standard approach to “web research” in LLM applications is: fetch the URL, pass the HTML to the model, ask it to extract what you need. This works in demos. In production, it fails on roughly 40% of real-world URLs — the ones that are JavaScript-rendered, paywalled, login-gated, or just structurally noisy enough to confuse extraction.

The model is not to blame. When you hand Claude a page full of navigation chrome, cookie consent text, related-articles widgets, and advertising tags wrapped around three sentences of actual content, you are asking it to find a needle in a noisy haystack on every single request. Sometimes it works. Often it misses things. The failure mode is silent: the agent reports back, sounds confident, and has extracted from the wrong part of the page.

Markdown as Developer Workflow Infrastructure

Tue, 28 Apr 2026 00:00:00 +0000

A few years ago, Markdown was primarily a lightweight authoring format — a way to write documentation without fighting a rich-text editor. That was the extent of its infrastructure role.

Something changed when LLMs became central to developer workflows. Markdown is now the exchange format that holds the modern AI development stack together. Prompts are written in Markdown. Context windows are assembled from Markdown. RAG pipelines store and retrieve Markdown chunks. Documentation sites are generated from Markdown. AI coding assistants communicate through Markdown. The MCP protocol uses Markdown as its primary content representation.

How to Build a RAG Pipeline with Live Web Data Using Python

Mon, 27 Apr 2026 00:00:00 +0000

You built a RAG pipeline. You seeded your vector store with 800 documentation pages. Retrieval looks great in your smoke tests. Then you ship it, and three days later a user asks a perfectly reasonable question about your product’s API — and your LLM confidently answers with <div id="root"></div>.

That is not a hallucination. That is your pipeline ingesting the skeleton HTML returned by a JavaScript-rendered SPA, embedding it faithfully, and then retrieving it because, token-for-token, it is the closest thing in your index to “what does this endpoint return.” The embeddings matched. The content was garbage.

How to Convert Any Webpage to Markdown from Claude Code with UnWeb MCP

Wed, 01 Apr 2026 00:00:00 +0000

If you’re building RAG pipelines, you’ve hit this wall: you need clean text from web pages, but 30% of them are JS-rendered SPAs that return empty markup. You don’t find out until your retrieval quality craters.

The UnWeb MCP server (@mbsoftsystems/unweb-mcp) gives Claude Code, Cursor, and Windsurf five tools for web-to-markdown conversion — and every response includes a content quality score (0–100) so your agent knows immediately whether the extraction worked.