The standard Node.js scraping stack is axios + cheerio. It works fine for static sites. It fails quietly on the modern web.
const axios = require('axios');
const cheerio = require('cheerio');
const response = await axios.get('https://some-react-app.com/products');
const $ = cheerio.load(response.data);
console.log($('body').text()); // "Loading..."
The page looks full of content in Chrome. Your scraper gets a shell. This is because client-side-rendered apps — React, Next.js, Vue — ship an empty <div id="root"></div> and populate it with JavaScript after the browser executes the bundle. axios.get() fires before any of that runs.
The textbook fix is Playwright:
const { chromium } = require('playwright');
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://some-react-app.com/products');
await page.waitForSelector('.product-list');
const content = await page.content();
await browser.close();
It works. It’s also 3–5 seconds per page, requires a Chrome binary, and returns raw HTML you still have to parse and clean. For a 500-page crawl, that’s 25+ minutes and significant infrastructure overhead.
For most Node.js use cases — content pipelines, RAG ingestion, monitoring, research tools — there’s a faster path.
One API Call
import { UnWebClient } from '@mbsoftsystems/unweb';
const client = new UnWebClient({ apiKey: process.env.UNWEB_API_KEY });
const result = await client.convert.url('https://example.com/article');
console.log(result.markdown); // Clean CommonMark
console.log(result.qualityScore); // 0–100
qualityScore is the useful part. It measures whether the page returned meaningful content:
- 70+: Full content extracted. Safe to use.
- 40–69: Partial content. May be a partially JS-rendered page.
- Below 40: The page is almost certainly JS-rendered. Your scraper got a skeleton.
Why this matters for LLM pipelines: A page that scores below 40 looks like valid data to your pipeline. There’s no exception, no error code — just mostly-empty markdown that gets chunked, embedded, and retrieved. The corruption surfaces when a user asks a question that should be answerable. The quality score lets you gate before that happens.
Quality-Gated Batch Scraping
When scraping multiple URLs, run requests concurrently with Promise.all and gate on quality score:
import { UnWebClient } from '@mbsoftsystems/unweb';
const client = new UnWebClient({ apiKey: process.env.UNWEB_API_KEY });
const MIN_QUALITY = 40;
async function scrapeAll(urls) {
const results = await Promise.all(
urls.map(async (url) => {
const result = await client.convert.url(url);
if (result.qualityScore < MIN_QUALITY) {
console.warn(`Low quality (${result.qualityScore}): ${url}`);
return null;
}
return { url, markdown: result.markdown, quality: result.qualityScore };
})
);
return results.filter(Boolean);
}
const urls = [
'https://docs.langchain.com/docs/concepts',
'https://react.dev/learn',
'https://expressjs.com/en/guide/routing.html',
];
const pages = await scrapeAll(urls);
console.log(`Got ${pages.length} of ${urls.length} pages with usable content`);
Each request takes roughly 1–2 seconds. Running them concurrently across 10 URLs completes in 2–3 seconds total — versus 30+ seconds sequential, or 50+ seconds with Playwright.
Crawling a Site for RAG Pipelines
For documentation sites or structured content you want to ingest in bulk, use the crawl API:
import { UnWebClient } from '@mbsoftsystems/unweb';
const client = new UnWebClient({ apiKey: process.env.UNWEB_API_KEY });
// Start the crawl
let job = await client.crawl.start('https://docs.example.com', {
allowedPath: '/docs/',
maxPages: 200,
exportFormat: 'langchain', // also: 'llamaindex', 'raw-md'
});
console.log(`Job started: ${job.jobId}`);
// Poll until complete
while (job.status !== 'Completed') {
await new Promise(r => setTimeout(r, 5000));
job = await client.crawl.status(job.jobId);
console.log(`${job.status}: ${job.pagesCrawled} pages`);
}
// Download results
const download = await client.crawl.download(job.jobId);
console.log(`Crawled ${job.pagesCrawled} pages`);
console.log(`LangChain JSONL: ${download.downloadUrl}`);
The LangChain export is ready to load directly into a vector store without format wrangling. The LlamaIndex export works the same way. raw-md gives you one Markdown file per page in a zip if you want to handle chunking yourself.
Processing HTML You Already Have
If you have HTML in memory rather than a URL — from a browser automation step, an email parser, or a webhook payload — use convert.paste:
const htmlString = '<article><h1>Title</h1><p>Body content here.</p></article>';
const result = await client.convert.paste(htmlString);
console.log(result.markdown);
// # Title
// Body content here.
Or upload a local HTML file:
import { readFileSync } from 'node:fs';
const buffer = readFileSync('./saved-page.html');
const result = await client.convert.upload(buffer, 'saved-page.html');
When to Use Each Method
| Scenario | Method |
|---|---|
| Monitor one URL | client.convert.url(url) |
| Batch scrape 10–100 URLs | Promise.all with convert.url |
| Build a RAG knowledge base from a doc site | client.crawl.start() + LangChain/LlamaIndex export |
| Process HTML from memory | client.convert.paste(html) |
| Convert a local HTML file | client.convert.upload(buffer) |
Setup
npm install @mbsoftsystems/unweb
Get an API key at app.unweb.info (free tier: 500 credits/month, no credit card required).
// .env
UNWEB_API_KEY=your_api_key_here
import { UnWebClient } from '@mbsoftsystems/unweb';
const client = new UnWebClient({ apiKey: process.env.UNWEB_API_KEY });
Full API reference: docs.unweb.info. Community examples: github.com/mbsoft-systems/unweb-community.
The quality score is what separates this from a plain HTML-to-Markdown converter. Most tools return something regardless of whether the page had content. UnWeb tells you when the extraction is trustworthy — which is what you actually need when you’re feeding web content into an LLM pipeline where garbage-in leads to garbage-out.
Start scraping with quality guarantees
Free tier includes 500 credits/month. No credit card required. Install with npm install @mbsoftsystems/unweb and get your API key at app.unweb.info.