A few years ago, Markdown was primarily a lightweight authoring format — a way to write documentation without fighting a rich-text editor. That was the extent of its infrastructure role.
Something changed when LLMs became central to developer workflows. Markdown is now the exchange format that holds the modern AI development stack together. Prompts are written in Markdown. Context windows are assembled from Markdown. RAG pipelines store and retrieve Markdown chunks. Documentation sites are generated from Markdown. AI coding assistants communicate through Markdown. The MCP protocol uses Markdown as its primary content representation.
This is not accidental. Markdown has the exact properties a structured-but-readable format needs for this role: it is parseable without special libraries, it preserves semantic structure (headings, lists, code blocks), it is token-efficient compared to HTML, and it degrades gracefully when parsed by a model that does not formally understand it.
The consequence is that the boundary between “content the developer wrote” and “content the system processes” has collapsed. Your workflow infrastructure is Markdown. Which means the quality of Markdown you get from external sources — the web, APIs, documentation sites — directly determines the quality of everything downstream.
The Web-to-Markdown Problem Is Infrastructure Now
The moment you decide to ground an LLM on live web content — a documentation page, a competitor’s pricing table, a changelog, an API reference — you have a web-to-Markdown extraction problem. How you solve it determines whether your pipeline is reliable.
The naive approach is to requests.get(url) and run html2text on the response. This works for simple static pages. It fails systematically for everything else: JavaScript-rendered SPAs that return a skeleton <div id="root"></div>, pages where the navigation and cookie banners account for 60% of the extracted text, documentation sites where every page repeats the same sidebar content, and pages that serve a different response to scrapers than to browsers.
The output of the naive approach is not “slightly worse Markdown.” It is structurally corrupted Markdown that embeds silently incorrect content into whatever pipeline receives it. A RAG store populated with navigation-polluted extractions will retrieval-match on sidebar text. An LLM prompt assembled from JS-skeleton pages will confidently produce wrong answers. The pipeline will appear to work until a user notices it is wrong.
The failure mode that matters: Web-to-Markdown extraction failures are largely invisible at ingestion time. No exceptions are raised. No error codes are returned. The content lands in your pipeline looking like data, and the corruption only surfaces when a user asks a question that should be answerable.
What Good Web-to-Markdown Extraction Actually Requires
There are a few hard requirements that distinguish production-grade extraction from hobbyist scraping.
JavaScript rendering
A growing fraction of the web requires JavaScript execution to render its actual content. React, Vue, Next.js with client-side rendering — these pages return a minimal HTML shell to the HTTP client and populate content via JavaScript that runs in the browser. A plain HTTP fetch will never see this content.
Production extraction needs a headless browser (Chromium, Playwright) that can execute JavaScript and wait for the content to hydrate before extracting text. This is significantly more expensive than a plain HTTP request, which is why you do not want to run it naively on every URL — you want to know which pages actually require it.
Content isolation
Raw HTML-to-Markdown conversion does not know what is article content versus navigation, footer, cookie banner, or sidebar. The converter will faithfully turn all of it into Markdown, because it has no semantic model of what you want.
Good extraction uses content detection — identifying the main content zone of a page and extracting that specifically. This is what makes the difference between Markdown that contains a clean article and Markdown that contains the article plus 400 words of “Related posts,” “Subscribe to our newsletter,” and “Accept all cookies.”
Quality measurement
Perhaps the most important property is one most extraction tools do not expose: a signal about whether the extraction was actually successful. A low-quality extraction — one that caught a JS skeleton, or a mostly-navigation page, or a bot-detection interstitial — looks exactly like a successful extraction from the caller’s perspective unless the tool explicitly tells you otherwise.
A quality score on every extraction gives you the ability to gate on quality at the pipeline level. If a page scores below 0.4, do not chunk and embed it — retry with a different approach, flag it for review, or skip it. This single signal eliminates the silent corruption problem.
Designing a Markdown-First Pipeline
A Markdown-first pipeline treats the extracted content as the canonical representation, not a preprocessing step. This changes a few architectural decisions.
Extract once, use many times
If you are building a RAG pipeline and also feeding content to an LLM for summarization, do not extract the URL twice. Extract it once to Markdown, store the Markdown, and derive both the embeddings and the summary from the same canonical representation. This ensures consistency and makes debugging tractable — you can inspect the exact Markdown that generated a particular chunk or summary.
import unweb
client = unweb.Client()
# Extract once — store the result
result = client.extract(url="https://docs.example.com/api-reference")
if result.quality_score >= 0.5:
markdown = result.markdown
# Use for RAG chunking
chunks = chunk_for_rag(markdown)
# Use for LLM summarization
summary = summarize_with_llm(markdown)
# Store canonical source
store_canonical(url=url, markdown=markdown, quality=result.quality_score)
else:
log_low_quality(url=url, score=result.quality_score)
Gate on quality, not just success
HTTP 200 means the server responded. It does not mean the response contains the content you wanted. A quality gate at the extraction layer is the correct place to catch this — before content enters any downstream store.
QUALITY_THRESHOLD = 0.5
def ingest_url(url: str) -> bool:
result = client.extract(url=url)
if result.quality_score < QUALITY_THRESHOLD:
print(f"Skipping {url} — quality {result.quality_score:.2f} below threshold")
return False
embed_and_store(result.markdown, metadata={"url": url, "quality": result.quality_score})
return True
Propagate quality metadata downstream
Store the quality score alongside your Markdown. When a retrieval result comes back with a quality score of 0.25, you can filter it at query time rather than returning low-confidence content to the LLM. This is the same pattern as storing source confidence in traditional information retrieval systems — and it is equally important.
Crawl site structure, not just individual pages
Documentation sites are rarely a single page. A full reference might span 200 URLs. Crawling one page is easy; crawling a site with consistent quality control requires respecting robots.txt, staying within a domain, deduplicating discovered links, and running the quality gate on every page.
# Crawl entire docs site — returns list of extracted pages
pages = client.crawl(
start_url="https://docs.example.com",
max_pages=200,
stay_on_domain=True
)
high_quality_pages = [p for p in pages if p.quality_score >= 0.5]
print(f"Crawled {len(pages)} pages; {len(high_quality_pages)} passed quality gate")
for page in high_quality_pages:
embed_and_store(page.markdown, metadata={"url": page.url})
Markdown in the Agent Context
AI agents that interact with the web have the same extraction problem, with a tighter feedback loop. When an agent decides it needs to fetch a URL to answer a question, it cannot wait for a human to notice the extraction was garbage three days later. The extraction needs to be reliable enough that the agent can act on it immediately.
This is where the MCP (Model Context Protocol) integration pattern matters. Rather than building extraction logic into your agent code, you expose a web-to-Markdown tool via MCP that the LLM can call. The tool handles rendering, quality control, and content isolation. The model gets clean Markdown and a quality signal. If the quality is low, the model can decide to try a different URL, report uncertainty, or ask for clarification — rather than confidently answering from corrupted content.
{
"mcpServers": {
"unweb": {
"command": "npx",
"args": ["-y", "@unweb/mcp-server"],
"env": { "UNWEB_API_KEY": "your-key" }
}
}
}
With this in place, any tool call to unweb_extract or unweb_crawl from Claude, GPT, or any MCP-compatible agent returns structured Markdown with a quality score baked in. The agent’s decision logic can treat low-quality extractions as uncertain evidence rather than ground truth.
The Performance Tradeoff
There is a real cost to high-quality extraction: headless browser rendering is 2–5x slower and more memory-intensive than plain HTTP fetching. For interactive agents that need sub-second responses, this is a meaningful constraint.
The practical pattern most teams land on is a tiered approach: try plain HTTP extraction first, check the quality score, and fall back to headless rendering only when the score is below threshold. This gets you the speed of plain HTTP for the majority of static pages while correctly handling JS-heavy pages without the rendering overhead on every request.
result = client.extract(
url=url,
mode="auto" # tries plain HTTP first; falls back to headless if quality is low
)
The auto mode in UnWeb handles this internally — you do not need to implement the fallback logic yourself.
What This Looks Like in Practice
The teams that have gotten this right tend to share a few patterns. They have a single extraction service that all pipelines call rather than each team rolling their own scraping logic. They store the canonical Markdown alongside the embeddings so debugging a retrieval failure means reading the source, not re-fetching and re-parsing the URL. They monitor quality score distributions across their corpus and set alerts when average quality drops — a sign that a site changed its rendering approach or started returning different content to automated clients.
The teams that have not gotten this right are the ones debugging “why does our RAG keep returning navigation text” six months after launch.
The Short Version
Markdown is infrastructure now, not a file format. The quality of web content entering your pipeline — from the first extraction through to what lands in the context window — determines the quality of your application’s outputs. Treating web-to-Markdown conversion as an afterthought means accepting silent, hard-to-debug degradation in everything downstream.
The good news is that the problem is solvable. The right extraction layer — one that handles JavaScript rendering, isolates content from navigation noise, and exposes a quality signal on every response — turns web content into a reliable foundation rather than a source of corruption. Everything built on top of it gets to assume clean input.
That is what infrastructure is for.