You find a page with the data you need, write a quick scraper, and get back something like this:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.get_text()) # → mostly empty. Navigation chrome. A loading spinner div.
The page looked full of content in your browser. The scraper returns a skeleton. This is the JavaScript-rendering problem, and it affects a large fraction of the modern web — every React, Vue, Angular, and Next.js site that hydrates on the client, every SPA that fetches content via XHR after the initial page load.
This post covers why this happens, what the standard solutions look like (and where they fall short), and a pattern that works cleanly for LLM pipelines where content quality matters.
Why requests Returns Empty Content
When your browser visits a modern web page, it does several things in sequence:
- The server returns an HTML document — often just a shell: a
<div id="root">or<div id="app">with no content inside it - The browser downloads and executes JavaScript bundles
- The JavaScript framework makes API calls to fetch actual data
- The framework injects rendered HTML into the DOM
requests.get() only does step 1. It gets the raw HTML document that the server sends, before any JavaScript runs. For a JavaScript-rendered SPA, that document contains no content — just the scaffold that the framework fills in after execution.
What the server actually sends might look like this:
<!DOCTYPE html>
<html>
<head>
<title>Products</title>
<script src="/static/js/main.abc123.js"></script>
</head>
<body>
<div id="root"></div>
</body>
</html>
The <div id="root"> is empty. The 800KB JavaScript bundle that fills it in never runs when you use requests.
The Standard Solutions
Playwright and Selenium (Headless Browser)
The textbook fix: spin up a real browser, wait for JavaScript to execute, then extract the rendered DOM.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/products")
page.wait_for_load_state("networkidle") # Wait for JS to settle
content = page.content() # Fully rendered HTML
browser.close()
This works. It also has real costs:
- Speed: launching a browser and waiting for networkidle is 3–10x slower than a simple HTTP request. For pipelines processing dozens of URLs, this adds up fast.
- Resource usage: each Chromium instance uses 200–400MB of RAM. Running 10 concurrent scrapers requires 2–4GB just for the browsers.
- Anti-bot detection: headless Chromium has known fingerprinting signatures. Sites with bot detection (Cloudflare, PerimeterX) will frequently challenge or block headless browsers.
- Infrastructure: you need a machine with a display server or Xvfb to run a headless browser in CI/CD or a Docker container without a GPU.
requests-html
The requests-html library embeds a lightweight JavaScript engine to render pages. It’s less resource-intensive than a full browser, but the project is effectively abandoned (last release 2020) and has known issues with modern JavaScript frameworks. It’s not a reliable option for production pipelines.
Scrapy + Splash
Splash is a dedicated JavaScript rendering service you run separately. Your Scrapy spider sends requests to Splash, which renders the page and returns the HTML. This works well at scale, but requires running and maintaining the Splash service, which adds operational complexity and another component to your infrastructure.
The Problem with Rendered HTML for LLM Pipelines
Even when you solve the rendering problem with Playwright or Splash, you are left with raw HTML — and raw HTML is a poor input for LLMs.
A fully rendered page might look like this in raw HTML:
- 6,000 tokens of content
- 2,500 tokens of navigation, header, footer, sidebar
- 1,500 tokens of cookie consent, newsletter popups, related articles widgets
- 800 tokens of class names, style attributes, and aria labels
Out of 10,800 tokens, perhaps 40% is the content you actually care about. The model has to find the signal inside the noise on every single request. The extraction is never wrong in an obvious way — it produces answers that look reasonable but are drawn from the wrong part of the page.
There is a second failure mode: quality. A headless browser might execute the JavaScript and get back:
- A Cloudflare challenge page
- A login wall with a redirect
- A “loading…” skeleton while an XHR call is still pending
- A cookie consent modal blocking the content
All of these look like non-empty HTML to a naive scraper. Without a quality check, your pipeline processes them and produces wrong answers from garbage input.
A Quality-Gated Approach
The pattern that works well for LLM pipelines is: convert to clean Markdown with a quality score, filter on score before passing anything to a model.
pip install unweb
import os
from unweb import UnwebClient
client = UnwebClient(api_key=os.environ["UNWEB_API_KEY"])
result = client.convert.url("https://example.com/products")
print(f"Quality score: {result.quality_score}") # 0-100
print(f"Title: {result.title}")
print(result.markdown[:500]) # Clean prose, code blocks preserved, no nav chrome
The response includes a quality_score from 0 to 100. Pages that rendered correctly and contain real content score above 60. Skeleton HTML, login walls, and Cloudflare challenges score below 30. The score gives you a gate:
def fetch_page(url: str, min_quality: int = 40) -> dict | None:
result = client.convert.url(url)
if result.quality_score < min_quality:
print(f"Low quality ({result.quality_score}), skipping: {url}")
return None
return {
"url": url,
"markdown": result.markdown,
"quality_score": result.quality_score,
"title": result.title,
}
What you get back is clean, dense Markdown. Navigation chrome is stripped. Code blocks are preserved with their language annotations. Tables are converted to Markdown tables. Images have their alt text. The structure of the content is retained; the structural noise of the page layout is removed.
Comparison: Approach by Approach
| Approach | JS rendering | Quality detection | LLM-ready output | Complexity |
|---|---|---|---|---|
| requests + BeautifulSoup | No | No | No (raw HTML) | Minimal |
| Playwright (headless) | Yes | No | No (raw HTML) | High (browser infra) |
| Scrapy + Splash | Yes | No | No (raw HTML) | High (Splash service) |
| UnWeb | Yes | Yes (0–100 score) | Yes (clean Markdown) | Minimal (API call) |
Bulk Processing with Async
For processing multiple URLs, use the async client to parallelize requests:
import asyncio
import os
from unweb import AsyncUnwebClient
async def fetch_pages(urls: list[str], min_quality: int = 40) -> list[dict]:
client = AsyncUnwebClient(api_key=os.environ["UNWEB_API_KEY"])
async def fetch_one(url: str) -> dict | None:
result = await client.convert.url_async(url)
if result.quality_score < min_quality:
return None
return {
"url": url,
"markdown": result.markdown,
"quality_score": result.quality_score,
"title": result.title,
}
results = await asyncio.gather(*[fetch_one(url) for url in urls])
return [r for r in results if r is not None]
# Usage
async def main():
urls = [
"https://example.com/products",
"https://docs.example.com/api",
"https://example.com/pricing",
]
pages = await fetch_pages(urls)
print(f"Fetched {len(pages)}/{len(urls)} pages above quality threshold")
for page in pages:
print(f" [{page['quality_score']}] {page['title']} — {page['url']}")
asyncio.run(main())
Concurrent requests typically reduce total fetch time by 60–80% compared to sequential processing, with no additional infrastructure required.
Practical Quality Thresholds
The right quality threshold depends on what you do with the content:
- LLM extraction (factual): use 50+. You want high-confidence content. Borderline pages waste API calls and produce unreliable extractions.
- RAG ingestion: use 40+. For vector store pipelines, you can tolerate slightly lower quality pages as long as you are embedding the Markdown, not the raw HTML.
- General research / reading: use 30+. If you are just browsing and summarizing, a lower bar is acceptable.
- Link following / crawling: use 20+. When crawling a site to find pages worth processing, a low threshold lets you visit everything and filter the output.
Score distribution in practice: Most pages are bimodal — they score above 65 (rendered correctly, real content) or below 30 (skeleton HTML, blocked, login wall). The 30–65 range is relatively sparse. If a page scores 35–55, it often means partial rendering: some content is there but the JS hadn’t fully settled. Whether to include these depends on how much noise you can tolerate downstream.
When to Use Playwright Instead
UnWeb is the right choice for most LLM and data extraction pipelines. There are cases where you specifically need Playwright:
- Interaction required: if you need to click buttons, fill forms, scroll to trigger lazy loading, or navigate pagination, you need a real browser. Conversion APIs operate on a static URL.
- Screenshot capture: if your pipeline needs visual output (screenshots for visual grounding, for example), Playwright is the right tool.
- Highly authenticated flows: if every page requires a session cookie from a complex login flow, a browser that maintains cookie state across requests is more practical.
For the common case — fetch a URL, get the text content, pass it to a model — the headless browser approach adds infrastructure complexity without improving the quality of what the model receives. The quality-scored Markdown approach is faster, lighter, and produces better model inputs.
Integration with LLM Pipelines
With clean, quality-gated Markdown, the LLM integration step is straightforward:
import anthropic
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def extract_from_page(page: dict, question: str) -> str:
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Question: {question}
Page: {page['title']} ({page['url']})
{page['markdown'][:6000]}
Answer the question based on this page. If the page doesn't contain relevant information, say so."""
}]
)
return response.content[0].text
# Example
async def main():
pages = await fetch_pages([
"https://example.com/pricing",
"https://docs.example.com/limits",
])
for page in pages:
answer = extract_from_page(page, "What are the rate limits?")
print(f"\n--- {page['title']} ---")
print(answer)
The model gets clean prose. No DOM navigation required in the prompt. No “ignore the navigation, look only at the main content area” hedging. The quality filter already ensured the page has real content. The extraction is reliable.
Stop feeding your models skeleton HTML.
UnWeb converts any URL to clean Markdown — JavaScript-rendered or not — with a content quality score on every response. Free tier: 500 conversions per month.