Building a Web Research Agent with Python and Claude

The standard approach to “web research” in LLM applications is: fetch the URL, pass the HTML to the model, ask it to extract what you need. This works in demos. In production, it fails on roughly 40% of real-world URLs — the ones that are JavaScript-rendered, paywalled, login-gated, or just structurally noisy enough to confuse extraction.

The model is not to blame. When you hand Claude a page full of navigation chrome, cookie consent text, related-articles widgets, and advertising tags wrapped around three sentences of actual content, you are asking it to find a needle in a noisy haystack on every single request. Sometimes it works. Often it misses things. The failure mode is silent: the agent reports back, sounds confident, and has extracted from the wrong part of the page.

This post walks through a complete web research agent that sidesteps that problem entirely. Clean the input first, then extract. The model gets structured prose, not raw DOM.

The Architecture

The agent takes a list of URLs and a research question, visits each URL, extracts clean Markdown, uses Claude to pull structured facts from each page, and then synthesizes a final report. Four layers:

Fetch + clean — UnWeb converts each URL to clean Markdown, returning a quality score
Filter — skip low-quality pages (JS-rendered, empty, login walls) before they waste a Claude API call
Extract — Claude reads clean Markdown and returns structured JSON for each page
Synthesize — Claude receives all page facts and produces a final answer

The clean/filter step is what makes the extract step reliable. When Claude sees clean, dense prose with preserved structure, extraction is deterministic. When it sees 4,000 tokens of navigation HTML with the content buried in the middle, it is not.

Setup

pip install unweb anthropic

You need an UnWeb API key (free at app.unweb.info) and an Anthropic API key.

import os
from unweb import UnwebClient
import anthropic

unweb = UnwebClient(api_key=os.environ["UNWEB_API_KEY"])
claude = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

Step 1: Fetch and Filter

For each URL, convert to Markdown and check the quality score before doing anything else:

def fetch_clean_pages(urls: list[str], quality_threshold: int = 40) -> list[dict]:
    """Convert URLs to clean Markdown, filtering low-quality pages."""
    pages = []

    for url in urls:
        result = unweb.convert.url(url)

        if result.quality_score < quality_threshold:
            print(f"Skipped (score {result.quality_score}): {url}")
            continue

        pages.append({
            "url": url,
            "markdown": result.markdown,
            "quality_score": result.quality_score,
            "title": result.title,
        })
        print(f"Fetched (score {result.quality_score}): {url}")

    return pages

The quality score check is the entire value of this step. Pages that score below 40 are consistently unreliable — skeleton HTML from SPAs, login walls that returned a redirect, nearly-empty pages. Skipping them costs nothing. Passing them to Claude wastes tokens and produces wrong answers.

On thresholds: 40 is the right default for most research tasks. If you are working with technically complex sites (interactive documentation, React SPAs with server-side rendering) you may want to adjust to 35. If you need high precision and can afford to miss some pages, use 60. The score distribution is bimodal in practice — most pages are either above 65 or below 30, with relatively few in the 30–65 range.

Step 2: Extract Structured Facts Per Page

With clean Markdown, the extraction prompt is simple. Claude does not need to navigate DOM noise — it can read the page like a human would:

import json

def extract_facts(page: dict, research_question: str) -> dict | None:
    """Extract structured facts relevant to the research question from a clean page."""

    system_prompt = """You are a research assistant. Extract facts relevant to the question from the provided article.
Return a JSON object with these fields:
- relevant: boolean (is this page relevant to the question?)
- facts: list of strings (key facts from this page, each a complete sentence)
- confidence: "high" | "medium" | "low"
- source_quality: brief note on how authoritative this source appears

Return ONLY the JSON object, no other text."""

    user_prompt = f"""Research question: {research_question}

Article from {page['url']}:

{page['markdown'][:8000]}"""  # Truncate to avoid very long pages

    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}]
    )

    try:
        facts = json.loads(response.content[0].text)
        facts["url"] = page["url"]
        facts["title"] = page.get("title", "")
        return facts if facts.get("relevant") else None
    except json.JSONDecodeError:
        return None

The relevant flag matters. Not every page in your input list will actually address the research question — letting Claude self-filter means the synthesis step only receives signal, not noise. A page about company history does not need to go into a report about pricing strategy.

Step 3: Synthesize a Final Report

With structured facts from each relevant page, the synthesis prompt is clean:

def synthesize_report(research_question: str, page_facts: list[dict]) -> str:
    """Synthesize a research report from per-page fact extractions."""

    if not page_facts:
        return "No relevant information found across the provided sources."

    facts_summary = ""
    for i, page in enumerate(page_facts, 1):
        facts_summary += f"\n\n### Source {i}: {page.get('title', page['url'])}\n"
        facts_summary += f"URL: {page['url']}\n"
        facts_summary += f"Confidence: {page.get('confidence', 'unknown')}\n"
        for fact in page.get("facts", []):
            facts_summary += f"- {fact}\n"

    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"""Research question: {research_question}

Facts extracted from {len(page_facts)} sources:
{facts_summary}

Write a clear, direct research report answering the question.
Cite specific sources where relevant. Note any contradictions between sources.
Keep it under 500 words."""
        }]
    )

    return response.content[0].text

Putting It Together

def research_agent(urls: list[str], question: str) -> str:
    print(f"Researching: {question}")
    print(f"Sources: {len(urls)} URLs\n")

    # Step 1: Fetch and filter
    pages = fetch_clean_pages(urls)
    print(f"\nPassed quality filter: {len(pages)}/{len(urls)} pages")

    if not pages:
        return "No usable pages found. All sources failed the quality filter."

    # Step 2: Extract per-page facts
    all_facts = []
    for page in pages:
        facts = extract_facts(page, question)
        if facts:
            all_facts.append(facts)
            print(f"Extracted facts from: {page['url']}")

    print(f"\nRelevant pages: {len(all_facts)}/{len(pages)}")

    # Step 3: Synthesize
    return synthesize_report(question, all_facts)


# Example usage
if __name__ == "__main__":
    urls = [
        "https://example.com/product/pricing",
        "https://example.com/about",
        "https://competitor.com/pricing",
        "https://news.ycombinator.com/item?id=12345",  # May score low — filtered
    ]

    report = research_agent(
        urls=urls,
        question="What are the pricing tiers and who are the target customers?"
    )
    print("\n--- REPORT ---")
    print(report)

Real-World Use Cases

Competitive Intelligence

Feed competitor URLs and ask “What pricing model do they use and what customer segments do they target?” The quality filter handles broken competitor pages gracefully, and the structured extraction gives you comparable facts rather than a wall of prose.

Due Diligence on Vendors

Given a list of vendor documentation and public case study URLs, ask “What are the known limitations and customer complaints?” The per-page relevance filter ensures the synthesis step only receives material about problems, not marketing copy.

Technical Research

Developer documentation sites are exactly the class of pages where clean Markdown matters most. Code blocks, parameter tables, and API references are preserved faithfully by UnWeb’s converter — they collapse to noise in raw HTML extraction. Ask Claude to extract supported parameters, version requirements, or migration notes across multiple docs pages and the structured output is consistently correct.

Improvements for Production

Async fetching. The example above fetches URLs sequentially. For lists longer than 5–10 URLs, use asyncio with await client.convert.url_async(url) to parallelize the fetch step. UnWeb’s API is designed for concurrent requests.

Caching. If you are researching the same sources repeatedly (monitoring competitor pages weekly, for example), cache converted Markdown by URL + last-modified. UnWeb returns an etag on each response you can use for conditional re-fetching.

Token budgeting. The example truncates page Markdown at 8,000 characters. For pages with long-form content (detailed technical documentation, lengthy blog posts), increase the limit. For a synthesis task across 20+ sources, consider a two-pass approach: first pass extracts 3–5 bullet facts per page; second pass synthesizes from the bullets rather than the full Markdown.

On prompt caching: if you are running the same research question against many pages, the system prompt in extract_facts() is a good candidate for Anthropic’s prompt caching. Move the static system instructions to a cached block and you will save tokens across all per-page extraction calls.

Clean input. Better extraction.

UnWeb converts any URL to clean Markdown with a quality score — so your agents skip garbage and extract from signal. Free tier includes 500 conversions per month.

Get your API key