How to Build a RAG Pipeline with Live Web Data Using Python

You built a RAG pipeline. You seeded your vector store with 800 documentation pages. Retrieval looks great in your smoke tests. Then you ship it, and three days later a user asks a perfectly reasonable question about your product’s API — and your LLM confidently answers with <div id="root"></div>.

That is not a hallucination. That is your pipeline ingesting the skeleton HTML returned by a JavaScript-rendered SPA, embedding it faithfully, and then retrieving it because, token-for-token, it is the closest thing in your index to “what does this endpoint return.” The embeddings matched. The content was garbage.

This happens to more pipelines than anyone admits, and the insidious part is that it fails silently. Your ingestion job reported 800 successful conversions. No exceptions were raised. You just quietly built a vector store full of Loading... placeholders and empty <body> tags dressed up as documentation.

This post walks through how to solve that problem properly — from the naive approach that breaks at scale, to a complete production RAG pipeline with a pre-ingestion quality filter.


The Naive Approach (and Why It Breaks)

The standard answer to “fetch a webpage and extract its text” is BeautifulSoup:

import requests
from bs4 import BeautifulSoup

def fetch_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    # Remove noise
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    return soup.get_text(separator="\n", strip=True)

This works fine for simple, server-rendered pages. It falls apart in at least three ways as soon as you operate at scale:

You cannot detect empty content. If the page is JS-rendered and returns <div id="root"></div>, get_text() returns an empty string or a handful of navigation labels. You will not know unless you add an explicit length check — which only catches the most egregious cases. A page that returns three sentences of boilerplate “Please enable JavaScript” text will pass that check.

HTML-to-text conversion is lossy in the wrong ways. Raw get_text() collapses all structure. Code blocks become indistinguishable from prose. Tables disappear entirely. Ordered lists merge into a single paragraph. For LLM consumption you want Markdown — a structured intermediate format that preserves semantic meaning without the noise of raw HTML tags.

Main content extraction is its own problem. Every page has navigation, footers, cookie banners, related-articles widgets, and newsletter popups. Stripping <nav> and <footer> is a start, but real-world pages require heuristic extraction — identifying the article body, product description, or documentation section from the surrounding chrome. Writing that reliably across arbitrary websites is a non-trivial project in itself.

You can assemble these pieces manually. Many teams do. But you are probably reading this because you already tried that, and you are now maintaining a fragile stack of CSS selector overrides and per-domain special cases.


UnWeb: Clean Markdown from Any URL, in Five Lines

UnWeb is an API that handles the full pipeline — fetch, extract main content, convert to CommonMark Markdown — and returns a quality score with every response. The quality score is the part that changes how you build RAG systems.

Install the Python SDK:

pip install unweb

Get a free API key at app.unweb.info. The free tier includes 500 credits per month, which is enough to build and test a full pipeline.

Your first conversion:

from unweb import UnwebClient

client = UnwebClient(api_key="unweb_...")
result = client.convert.url("https://docs.example.com/getting-started")

print(result.markdown)       # Clean CommonMark Markdown
print(result.quality_score)  # 0–100

That is the complete surface area for a single-page conversion. The markdown field contains the extracted main content, converted to clean Markdown with code blocks, headers, and lists preserved. The quality_score tells you how confident UnWeb is that the extraction succeeded.


The Quality Score: Your Pre-Ingestion Filter

The quality score is a 0–100 integer that reflects how much meaningful content was extracted relative to what was expected. A score above 70 means clean, dense content. A score below 40 almost always means one of three things: the page is JS-rendered and returned skeleton HTML, it is a login wall, or it is essentially empty (a redirect page, a 404 with a custom template).

The key insight is that you get this information before you embed anything. That changes the architecture: instead of ingesting everything and hoping retrieval quality reveals problems later, you filter at ingestion time.

from unweb import UnwebClient

client = UnwebClient(api_key="unweb_...")

result = client.convert.url("https://docs.example.com/getting-started")

if result.quality_score >= 40:
    # Safe to embed — content was extracted successfully
    chunks = chunk_markdown(result.markdown)
    vector_store.add_documents(chunks)
else:
    # Low quality — likely JS-rendered or empty
    # Log it, skip it, or queue it for headless browser fallback
    print(f"Skipped (score {result.quality_score}): {result.url}")

A threshold of 40 is the right default for most pipelines. Pages that score between 40 and 70 are usually usable but may have missed some secondary content (sidebars, code examples on complex pages). Pages below 40 are reliably problematic — embedding them will hurt retrieval quality more than skipping them.

One honest note on limitations: UnWeb does not run a headless browser. If a page requires JavaScript execution to render its content, UnWeb will return a low quality score rather than serving you the skeleton. That is the correct behavior — you now know to handle that URL differently (queue it for Playwright, skip it, or use a JS-rendering fallback). What you do not want is a tool that silently returns <div id="root"></div> as if it succeeded.


Building a Complete RAG Pipeline

For production RAG pipelines, you typically need to crawl an entire documentation site rather than convert individual URLs. UnWeb’s crawler handles this and, critically, exports directly to LangChain’s Document format — no transformation code needed.

Step 1: Start a Crawl

from unweb import UnwebClient

client = UnwebClient(api_key="unweb_...")

job = client.crawl.start(
    "https://docs.example.com",
    allowed_paths=["/docs/"],  # Restrict to documentation pages only
)

print(f"Crawl started: {job.id}")

The allowed_paths parameter keeps the crawler within the subdirectory you care about — you do not want your RAG pipeline ingesting blog posts, marketing pages, and changelog entries unless you have a specific reason to.

Step 2: Wait for Completion

import time

while True:
    status = client.crawl.status(job.id)
    print(f"Status: {status.state}{status.pages_crawled} pages crawled")
    if status.state in ("completed", "failed"):
        break
    time.sleep(10)

Step 3: Download as LangChain Documents

# Returns a list of LangChain Document objects — ready to pass to any LangChain component
docs = client.crawl.download(job.id, format="langchain")

print(f"Downloaded {len(docs)} documents")
# Each document has .page_content (Markdown) and .metadata (url, title, quality_score, etc.)

Step 4: Filter by Quality Score and Load into Your Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import MarkdownTextSplitter

# Filter out low-quality pages before embedding
clean_docs = [d for d in docs if d.metadata.get("quality_score", 0) >= 40]
skipped = len(docs) - len(clean_docs)
print(f"Embedding {len(clean_docs)} pages, skipped {skipped} low-quality pages")

# Split Markdown into chunks — the splitter respects Markdown structure
splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(clean_docs)

# Load into your vector store
embeddings = OpenAIEmbeddings()
vector_store = Chroma.from_documents(chunks, embeddings)
print(f"Vector store built with {vector_store._collection.count()} chunks")

The MarkdownTextSplitter is worth calling out explicitly: because UnWeb returns clean CommonMark Markdown rather than raw text, you can split at Markdown structural boundaries — header sections, code blocks, list items — rather than at arbitrary character counts. This produces semantically coherent chunks that retrieve better.


Production Tips

Batch Conversions with Quality Reporting

When converting individual URLs rather than crawling, batch your requests and collect quality metrics:

urls = [
    "https://docs.example.com/api/overview",
    "https://docs.example.com/api/authentication",
    "https://docs.example.com/api/endpoints",
    # ...
]

results = [client.convert.url(u) for u in urls]

clean_docs = [r for r in results if r.quality_score >= 40]
dirty_urls = [urls[i] for i, r in enumerate(results) if r.quality_score < 40]

print(f"Successfully converted: {len(clean_docs)}")
print(f"Filtered out {len(dirty_urls)} low-quality pages:")
for url in dirty_urls:
    print(f"  - {url}")

Log the dirty_urls list somewhere persistent. These are pages that either need a headless browser fallback or should be removed from your source list. Do not let them silently stay in your retry queue where they will just fail again.

Error Handling

The SDK raises typed exceptions you can catch specifically:

from unweb import UnwebClient
from unweb.exceptions import UnwebRateLimitError, UnwebAuthError, UnwebAPIError

client = UnwebClient(api_key="unweb_...")

def safe_convert(url: str):
    try:
        result = client.convert.url(url)
        return result if result.quality_score >= 40 else None
    except UnwebRateLimitError:
        # Back off and retry — you have hit your credit rate limit
        time.sleep(60)
        return safe_convert(url)
    except UnwebAuthError:
        raise  # Do not retry auth errors — fail fast
    except UnwebAPIError as e:
        print(f"API error for {url}: {e}")
        return None

Credit Planning

Each URL conversion costs 1 credit. A crawl job costs 1 credit per page discovered and converted. At 500 free credits per month you can ingest a 400-page documentation site with room to spare. The Starter plan at $12/month gives you 2,000 credits — enough to run a full re-ingestion weekly on most documentation sites.

For reference, Firecrawl’s comparable tier starts at $16/month and does not include a quality score. You would need to build and maintain your own content quality detection on top of their output — which brings you back to the brittle heuristics problem this post started with.


Wrapping Up

The skeleton HTML problem is not rare, and it does not announce itself. JS-rendered pages look like successful HTTP responses. They produce zero exceptions at ingestion time. They only surface when a user asks a question your pipeline should be able to answer, and your LLM returns three lines of boilerplate.

The fix is a quality gate at ingestion time, not a debugging session after the fact. UnWeb gives you that gate as a first-class API response field — a score you can act on before you commit a single chunk to your vector store.

If you are building a pipeline and hit an edge case — a site that scores unexpectedly low, a crawler behavior that does not match what you expected — the quality score gives you exactly the information you need to triage it. That is the point. Your RAG pipeline should know what it does not know.