How to Scrape JavaScript-Rendered Pages in Python

You find a page with the data you need, write a quick scraper, and get back something like this:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.get_text())  # → mostly empty. Navigation chrome. A loading spinner div.

The page looked full of content in your browser. The scraper returns a skeleton. This is the JavaScript-rendering problem, and it affects a large fraction of the modern web — every React, Vue, Angular, and Next.js site that hydrates on the client, every SPA that fetches content via XHR after the initial page load.

This post covers why this happens, what the standard solutions look like (and where they fall short), and a pattern that works cleanly for LLM pipelines where content quality matters.


Why requests Returns Empty Content

When your browser visits a modern web page, it does several things in sequence:

  1. The server returns an HTML document — often just a shell: a <div id="root"> or <div id="app"> with no content inside it
  2. The browser downloads and executes JavaScript bundles
  3. The JavaScript framework makes API calls to fetch actual data
  4. The framework injects rendered HTML into the DOM

requests.get() only does step 1. It gets the raw HTML document that the server sends, before any JavaScript runs. For a JavaScript-rendered SPA, that document contains no content — just the scaffold that the framework fills in after execution.

What the server actually sends might look like this:

<!DOCTYPE html>
<html>
  <head>
    <title>Products</title>
    <script src="/static/js/main.abc123.js"></script>
  </head>
  <body>
    <div id="root"></div>
  </body>
</html>

The <div id="root"> is empty. The 800KB JavaScript bundle that fills it in never runs when you use requests.


The Standard Solutions

Playwright and Selenium (Headless Browser)

The textbook fix: spin up a real browser, wait for JavaScript to execute, then extract the rendered DOM.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")
    page.wait_for_load_state("networkidle")  # Wait for JS to settle
    content = page.content()  # Fully rendered HTML
    browser.close()

This works. It also has real costs:

requests-html

The requests-html library embeds a lightweight JavaScript engine to render pages. It’s less resource-intensive than a full browser, but the project is effectively abandoned (last release 2020) and has known issues with modern JavaScript frameworks. It’s not a reliable option for production pipelines.

Scrapy + Splash

Splash is a dedicated JavaScript rendering service you run separately. Your Scrapy spider sends requests to Splash, which renders the page and returns the HTML. This works well at scale, but requires running and maintaining the Splash service, which adds operational complexity and another component to your infrastructure.


The Problem with Rendered HTML for LLM Pipelines

Even when you solve the rendering problem with Playwright or Splash, you are left with raw HTML — and raw HTML is a poor input for LLMs.

A fully rendered page might look like this in raw HTML:

Out of 10,800 tokens, perhaps 40% is the content you actually care about. The model has to find the signal inside the noise on every single request. The extraction is never wrong in an obvious way — it produces answers that look reasonable but are drawn from the wrong part of the page.

There is a second failure mode: quality. A headless browser might execute the JavaScript and get back:

All of these look like non-empty HTML to a naive scraper. Without a quality check, your pipeline processes them and produces wrong answers from garbage input.


A Quality-Gated Approach

The pattern that works well for LLM pipelines is: convert to clean Markdown with a quality score, filter on score before passing anything to a model.

pip install unweb
import os
from unweb import UnwebClient

client = UnwebClient(api_key=os.environ["UNWEB_API_KEY"])

result = client.convert.url("https://example.com/products")

print(f"Quality score: {result.quality_score}")  # 0-100
print(f"Title: {result.title}")
print(result.markdown[:500])  # Clean prose, code blocks preserved, no nav chrome

The response includes a quality_score from 0 to 100. Pages that rendered correctly and contain real content score above 60. Skeleton HTML, login walls, and Cloudflare challenges score below 30. The score gives you a gate:

def fetch_page(url: str, min_quality: int = 40) -> dict | None:
    result = client.convert.url(url)

    if result.quality_score < min_quality:
        print(f"Low quality ({result.quality_score}), skipping: {url}")
        return None

    return {
        "url": url,
        "markdown": result.markdown,
        "quality_score": result.quality_score,
        "title": result.title,
    }

What you get back is clean, dense Markdown. Navigation chrome is stripped. Code blocks are preserved with their language annotations. Tables are converted to Markdown tables. Images have their alt text. The structure of the content is retained; the structural noise of the page layout is removed.


Comparison: Approach by Approach

ApproachJS renderingQuality detectionLLM-ready outputComplexity
requests + BeautifulSoupNoNoNo (raw HTML)Minimal
Playwright (headless)YesNoNo (raw HTML)High (browser infra)
Scrapy + SplashYesNoNo (raw HTML)High (Splash service)
UnWebYesYes (0–100 score)Yes (clean Markdown)Minimal (API call)

Bulk Processing with Async

For processing multiple URLs, use the async client to parallelize requests:

import asyncio
import os
from unweb import AsyncUnwebClient

async def fetch_pages(urls: list[str], min_quality: int = 40) -> list[dict]:
    client = AsyncUnwebClient(api_key=os.environ["UNWEB_API_KEY"])

    async def fetch_one(url: str) -> dict | None:
        result = await client.convert.url_async(url)
        if result.quality_score < min_quality:
            return None
        return {
            "url": url,
            "markdown": result.markdown,
            "quality_score": result.quality_score,
            "title": result.title,
        }

    results = await asyncio.gather(*[fetch_one(url) for url in urls])
    return [r for r in results if r is not None]


# Usage
async def main():
    urls = [
        "https://example.com/products",
        "https://docs.example.com/api",
        "https://example.com/pricing",
    ]
    pages = await fetch_pages(urls)
    print(f"Fetched {len(pages)}/{len(urls)} pages above quality threshold")
    for page in pages:
        print(f"  [{page['quality_score']}] {page['title']}{page['url']}")

asyncio.run(main())

Concurrent requests typically reduce total fetch time by 60–80% compared to sequential processing, with no additional infrastructure required.


Practical Quality Thresholds

The right quality threshold depends on what you do with the content:

Score distribution in practice: Most pages are bimodal — they score above 65 (rendered correctly, real content) or below 30 (skeleton HTML, blocked, login wall). The 30–65 range is relatively sparse. If a page scores 35–55, it often means partial rendering: some content is there but the JS hadn’t fully settled. Whether to include these depends on how much noise you can tolerate downstream.


When to Use Playwright Instead

UnWeb is the right choice for most LLM and data extraction pipelines. There are cases where you specifically need Playwright:

For the common case — fetch a URL, get the text content, pass it to a model — the headless browser approach adds infrastructure complexity without improving the quality of what the model receives. The quality-scored Markdown approach is faster, lighter, and produces better model inputs.


Integration with LLM Pipelines

With clean, quality-gated Markdown, the LLM integration step is straightforward:

import anthropic

claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def extract_from_page(page: dict, question: str) -> str:
    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Question: {question}

Page: {page['title']} ({page['url']})

{page['markdown'][:6000]}

Answer the question based on this page. If the page doesn't contain relevant information, say so."""
        }]
    )
    return response.content[0].text


# Example
async def main():
    pages = await fetch_pages([
        "https://example.com/pricing",
        "https://docs.example.com/limits",
    ])

    for page in pages:
        answer = extract_from_page(page, "What are the rate limits?")
        print(f"\n--- {page['title']} ---")
        print(answer)

The model gets clean prose. No DOM navigation required in the prompt. No “ignore the navigation, look only at the main content area” hedging. The quality filter already ensured the page has real content. The extraction is reliable.

Stop feeding your models skeleton HTML.

UnWeb converts any URL to clean Markdown — JavaScript-rendered or not — with a content quality score on every response. Free tier: 500 conversions per month.

Get your API key