Extracting Structured Data from Any Web Page with Python and Claude

Extracting structured data from web pages is one of those problems that sounds solved until you actually do it at scale. CSS selectors work fine — until the site redesigns and all your selectors break. Regex on HTML works fine — until it doesn’t, and the edge cases multiply. LLMs on raw HTML should work in theory — until you discover that a typical product page is 60% navigation, ads, footer links, and JavaScript tags that produce nothing useful from a 50,000-token blob.

The approach that works reliably: convert the page to clean Markdown first, then send the Markdown to Claude for structured extraction. The Markdown step strips noise and preserves content. The Claude step handles format variation, infers missing fields, and returns typed JSON without custom selectors for every site layout.

This post walks through the full implementation: fetching and filtering pages, defining extraction schemas, handling batch jobs, and validating output with Pydantic.

Why Raw HTML Breaks LLM Extraction

Consider a typical e-commerce product page. You want: title, price, rating, review count, description, availability. The data is there. But when you send the raw HTML to Claude, you are also sending:

Navigation menus, mega-menus, and dropdown trees
Cookie consent banners and modal overlays
Inline JavaScript — event handlers, analytics payloads, A/B test configuration
Schema.org JSON-LD blobs (useful if the site has them, noise if they’re stale)
Recommendation carousels (“Customers also bought” — 40+ product names)
Review pagination, filters, and sort controls
Footer links, legal disclaimers, social media widgets

On a moderately complex product page, this noise accounts for 70–85% of the token count. The extraction prompt has to work harder, the model’s attention is diluted across irrelevant content, and field extraction becomes less reliable — especially for fields that appear in multiple contexts (the word “price” appears in navigation “Compare prices,” in competitor ads, in review text, and in the actual product price).

Clean Markdown removes all of that. What remains is a linear, readable representation of the actual page content — the kind of text the model was trained on, not the kind of soup it was not.

Setup

pip install unweb anthropic pydantic

import os
import json
import asyncio
from typing import Optional
from pydantic import BaseModel, ValidationError

import unweb
import anthropic

unweb_client = unweb.Client(api_key=os.environ["UNWEB_API_KEY"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

Step 1: Fetch and Filter Pages

Before sending anything to Claude, use UnWeb’s quality score to skip pages that won’t produce usable content. Pages scoring below 35 are typically login walls, error pages, Cloudflare challenges, or skeleton HTML from JavaScript-heavy sites where the content never loaded.

def fetch_page(url: str, quality_threshold: int = 35) -> Optional[str]:
    """
    Fetch a URL and return clean Markdown.
    Returns None if the page is below quality threshold (login wall, error, etc).
    """
    result = unweb_client.convert.url(url)

    if result.quality_score < quality_threshold:
        print(f"Skipped {url} — quality score {result.quality_score} below threshold")
        return None

    # Trim to a reasonable token budget; most content is in the first ~6000 chars
    return result.markdown[:6000]

Quality score guide: 0–30 is skeleton HTML (JavaScript hasn’t rendered, or the request was blocked). 30–55 is thin content — navigation-heavy, stub pages, or low-value landing pages. 55–80 is usable but may have noise. 80+ is dense, content-rich pages. Set your threshold based on the site type: 40 works well for e-commerce, 50 for news, 30 for developer documentation which tends to be content-dense.

Step 2: Define Your Extraction Schema

Use Pydantic to define exactly what you want back. The schema serves two purposes: it generates the description Claude uses to understand the extraction task, and it validates the output before you use it downstream. Invalid or missing fields surface immediately instead of silently corrupting your dataset.

class ProductData(BaseModel):
    name: str
    price: Optional[float] = None           # None if not found
    currency: Optional[str] = None          # "USD", "EUR", etc.
    rating: Optional[float] = None          # 0.0–5.0
    review_count: Optional[int] = None
    availability: Optional[str] = None      # "In stock", "Ships in 3 days", etc.
    description: Optional[str] = None       # First 300 chars of product description
    brand: Optional[str] = None


class ArticleData(BaseModel):
    title: str
    author: Optional[str] = None
    published_date: Optional[str] = None    # ISO 8601 if possible, verbatim otherwise
    summary: str                            # 2–3 sentence summary
    word_count_estimate: Optional[int] = None
    primary_topic: Optional[str] = None     # Single topic label

Step 3: Extract with Claude

Send the Markdown to Claude with instructions to return JSON matching your schema. The system prompt is static and can be prompt-cached when processing many pages. The user message is the page content plus a schema description generated from the Pydantic model.

def extract_structured_data(markdown: str, schema: type[BaseModel]) -> dict:
    """
    Extract structured data from Markdown using Claude.
    Returns a dict matching the schema, or raises ValidationError.
    """
    # Build schema description from Pydantic model
    fields = []
    for field_name, field_info in schema.model_fields.items():
        annotation = field_info.annotation
        description = field_info.description or ""
        fields.append(f"- {field_name} ({annotation}): {description}")
    schema_desc = "\n".join(fields)

    system_prompt = (
        "You are a data extraction assistant. Extract structured data from web page "
        "content provided as Markdown. Return only valid JSON matching the requested schema. "
        "Use null for fields you cannot find. Do not hallucinate data."
    )

    user_prompt = f"""Extract data from this web page content:

---
{markdown}
---

Return a JSON object with these fields:
{schema_desc}

Respond with only the JSON object. No explanation, no markdown fences."""

    message = claude.messages.create(
        model="claude-haiku-4-5-20251001",   # fast and cheap for extraction
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}]
    )

    raw_json = message.content[0].text.strip()

    # Strip markdown fences if the model added them anyway
    if raw_json.startswith("```"):
        raw_json = raw_json.split("```")[1]
        if raw_json.startswith("json"):
            raw_json = raw_json[4:]

    data = json.loads(raw_json)
    validated = schema(**data)  # Raises ValidationError if schema doesn't match
    return validated.model_dump()

Step 4: Putting It Together

def extract_from_url(url: str, schema: type[BaseModel]) -> Optional[dict]:
    """Fetch a URL and extract structured data in one call."""
    markdown = fetch_page(url)
    if markdown is None:
        return None

    try:
        return extract_structured_data(markdown, schema)
    except (json.JSONDecodeError, ValidationError) as e:
        print(f"Extraction failed for {url}: {e}")
        return None


# Single URL
product = extract_from_url("https://example.com/products/widget", ProductData)
if product:
    print(f"{product['name']}: ${product['price']} ({product['review_count']} reviews)")

# Article
article = extract_from_url("https://example.com/blog/post-123", ArticleData)
if article:
    print(f"{article['title']} — {article['published_date']}")
    print(article['summary'])

Step 5: Batch Processing

For lists of URLs, async fetching significantly improves throughput. UnWeb’s API supports concurrent requests; use asyncio to parallelize the fetch step while keeping Claude calls sequential (or use a semaphore to control concurrency).

async def fetch_page_async(url: str, quality_threshold: int = 35) -> tuple[str, Optional[str]]:
    """Async version of fetch_page — returns (url, markdown_or_None)."""
    result = await unweb_client.convert.url_async(url)
    if result.quality_score < quality_threshold:
        return url, None
    return url, result.markdown[:6000]


async def batch_extract(urls: list[str], schema: type[BaseModel], concurrency: int = 5) -> list[dict]:
    """
    Fetch all URLs concurrently, then extract structured data from each.
    Skips failed fetches and extraction errors.
    """
    sem = asyncio.Semaphore(concurrency)

    async def fetch_with_sem(url):
        async with sem:
            return await fetch_page_async(url)

    # Fetch all pages concurrently
    fetch_tasks = [fetch_with_sem(url) for url in urls]
    fetch_results = await asyncio.gather(*fetch_tasks)

    results = []
    for url, markdown in fetch_results:
        if markdown is None:
            continue
        try:
            data = extract_structured_data(markdown, schema)
            data["_source_url"] = url
            results.append(data)
        except Exception as e:
            print(f"Failed {url}: {e}")

    return results


# Usage
product_urls = [
    "https://shop.example.com/products/item-1",
    "https://shop.example.com/products/item-2",
    "https://shop.example.com/products/item-3",
]

products = asyncio.run(batch_extract(product_urls, ProductData))
print(f"Extracted {len(products)} products from {len(product_urls)} URLs")

Comparison: Selectors vs. LLM on HTML vs. UnWeb + LLM

Approach	Reliability	Maintenance	New sites	Cost
CSS selectors / XPath	Breaks on redesign	Per-site selectors	Custom work each time	Free
LLM on raw HTML	Inconsistent — noise dilutes accuracy	No selectors needed	Works out of the box	High token cost
UnWeb + LLM (this post)	Consistent — noise removed before extraction	No selectors needed	Works out of the box	UnWeb API + LLM tokens (lower than raw HTML)

The token cost point is significant: a typical product page as raw HTML runs 15,000–40,000 tokens per page. The same page as clean Markdown is typically 1,000–4,000 tokens. At scale, the UnWeb conversion cost is offset by the reduction in LLM token spend.

Handling Extraction Failures

Three failure modes to handle explicitly:

Quality filter rejects. The page was a login wall, Cloudflare challenge, or 404. Handled above — fetch_page returns None and you skip it.

JSON parse error. Claude returned something that wasn’t valid JSON. Rare with current models, but wrap the parse in a try/except and log the raw response for inspection. Usually indicates the prompt is confusing the model with edge-case content.

Schema validation error. Claude returned valid JSON, but required fields are missing or the types are wrong. Pydantic surfaces this immediately. Either the page didn’t have the data (acceptable — use Optional fields broadly) or the model misidentified the field (worth logging and reviewing).

def extract_with_fallback(url: str, schema: type[BaseModel]) -> dict:
    """
    Extract with explicit error categorization.
    Returns a result dict with 'status' field: 'ok', 'low_quality', 'parse_error', 'validation_error'.
    """
    markdown = fetch_page(url)
    if markdown is None:
        return {"status": "low_quality", "url": url}

    try:
        data = extract_structured_data(markdown, schema)
        return {"status": "ok", "url": url, **data}
    except json.JSONDecodeError as e:
        return {"status": "parse_error", "url": url, "error": str(e)}
    except ValidationError as e:
        return {"status": "validation_error", "url": url, "error": str(e)}

Production Considerations

Model choice. The examples above use claude-haiku-4-5-20251001 — it’s the right choice for extraction tasks where you’re processing many pages. Reserve claude-sonnet-4-6 for synthesis steps where reasoning quality matters more than throughput.

Prompt caching. The system prompt in extract_structured_data is identical across all calls. Cache it using Anthropic’s prompt caching API and you will save tokens on every call after the first.

Schema versioning. As your data needs evolve, version your Pydantic models explicitly. Extraction from old cached Markdown against a new schema may silently drop fields the model didn’t know to extract. Keep the schema and the extraction together.

Rate limits. UnWeb’s free tier includes 500 conversions/month. For batch jobs, watch the rate limit headers on the API response and back off on 429s. The async batch function above handles concurrency — pair it with exponential backoff on rate limit errors for production workloads.

Reusing Markdown across tasks: if you need to extract multiple schemas from the same URL (e.g., both product data and review data), fetch the Markdown once and call extract_structured_data multiple times against the same string. No need to re-fetch.

Clean Markdown in, structured JSON out.

UnWeb converts any URL to clean Markdown with a quality score — so your extraction pipeline skips noise and works from signal. Free tier includes 500 conversions per month.

Get your API key