Web Scraping with Node.js: Clean Markdown from Any URL

The standard Node.js scraping stack is axios + cheerio. It works fine for static sites. It fails quietly on the modern web.

const axios = require('axios');
const cheerio = require('cheerio');

const response = await axios.get('https://some-react-app.com/products');
const $ = cheerio.load(response.data);
console.log($('body').text()); // "Loading..."

The page looks full of content in Chrome. Your scraper gets a shell. This is because client-side-rendered apps — React, Next.js, Vue — ship an empty <div id="root"></div> and populate it with JavaScript after the browser executes the bundle. axios.get() fires before any of that runs.

The textbook fix is Playwright:

const { chromium } = require('playwright');
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://some-react-app.com/products');
await page.waitForSelector('.product-list');
const content = await page.content();
await browser.close();

It works. It’s also 3–5 seconds per page, requires a Chrome binary, and returns raw HTML you still have to parse and clean. For a 500-page crawl, that’s 25+ minutes and significant infrastructure overhead.

For most Node.js use cases — content pipelines, RAG ingestion, monitoring, research tools — there’s a faster path.

One API Call

import { UnWebClient } from '@mbsoftsystems/unweb';

const client = new UnWebClient({ apiKey: process.env.UNWEB_API_KEY });
const result = await client.convert.url('https://example.com/article');

console.log(result.markdown);      // Clean CommonMark
console.log(result.qualityScore);  // 0–100

qualityScore is the useful part. It measures whether the page returned meaningful content:

70+: Full content extracted. Safe to use.
40–69: Partial content. May be a partially JS-rendered page.
Below 40: The page is almost certainly JS-rendered. Your scraper got a skeleton.

Why this matters for LLM pipelines: A page that scores below 40 looks like valid data to your pipeline. There’s no exception, no error code — just mostly-empty markdown that gets chunked, embedded, and retrieved. The corruption surfaces when a user asks a question that should be answerable. The quality score lets you gate before that happens.

Quality-Gated Batch Scraping

When scraping multiple URLs, run requests concurrently with Promise.all and gate on quality score:

import { UnWebClient } from '@mbsoftsystems/unweb';

const client = new UnWebClient({ apiKey: process.env.UNWEB_API_KEY });
const MIN_QUALITY = 40;

async function scrapeAll(urls) {
  const results = await Promise.all(
    urls.map(async (url) => {
      const result = await client.convert.url(url);
      if (result.qualityScore < MIN_QUALITY) {
        console.warn(`Low quality (${result.qualityScore}): ${url}`);
        return null;
      }
      return { url, markdown: result.markdown, quality: result.qualityScore };
    })
  );
  return results.filter(Boolean);
}

const urls = [
  'https://docs.langchain.com/docs/concepts',
  'https://react.dev/learn',
  'https://expressjs.com/en/guide/routing.html',
];

const pages = await scrapeAll(urls);
console.log(`Got ${pages.length} of ${urls.length} pages with usable content`);

Each request takes roughly 1–2 seconds. Running them concurrently across 10 URLs completes in 2–3 seconds total — versus 30+ seconds sequential, or 50+ seconds with Playwright.

Crawling a Site for RAG Pipelines

For documentation sites or structured content you want to ingest in bulk, use the crawl API:

import { UnWebClient } from '@mbsoftsystems/unweb';

const client = new UnWebClient({ apiKey: process.env.UNWEB_API_KEY });

// Start the crawl
let job = await client.crawl.start('https://docs.example.com', {
  allowedPath: '/docs/',
  maxPages: 200,
  exportFormat: 'langchain',  // also: 'llamaindex', 'raw-md'
});

console.log(`Job started: ${job.jobId}`);

// Poll until complete
while (job.status !== 'Completed') {
  await new Promise(r => setTimeout(r, 5000));
  job = await client.crawl.status(job.jobId);
  console.log(`${job.status}: ${job.pagesCrawled} pages`);
}

// Download results
const download = await client.crawl.download(job.jobId);
console.log(`Crawled ${job.pagesCrawled} pages`);
console.log(`LangChain JSONL: ${download.downloadUrl}`);

The LangChain export is ready to load directly into a vector store without format wrangling. The LlamaIndex export works the same way. raw-md gives you one Markdown file per page in a zip if you want to handle chunking yourself.

Processing HTML You Already Have

If you have HTML in memory rather than a URL — from a browser automation step, an email parser, or a webhook payload — use convert.paste:

const htmlString = '<article><h1>Title</h1><p>Body content here.</p></article>';
const result = await client.convert.paste(htmlString);
console.log(result.markdown);
// # Title
// Body content here.

Or upload a local HTML file:

import { readFileSync } from 'node:fs';
const buffer = readFileSync('./saved-page.html');
const result = await client.convert.upload(buffer, 'saved-page.html');

When to Use Each Method

Scenario	Method
Monitor one URL	`client.convert.url(url)`
Batch scrape 10–100 URLs	`Promise.all` with `convert.url`
Build a RAG knowledge base from a doc site	`client.crawl.start()` + LangChain/LlamaIndex export
Process HTML from memory	`client.convert.paste(html)`
Convert a local HTML file	`client.convert.upload(buffer)`

Setup

npm install @mbsoftsystems/unweb

Get an API key at app.unweb.info (free tier: 500 credits/month, no credit card required).

// .env
UNWEB_API_KEY=your_api_key_here

import { UnWebClient } from '@mbsoftsystems/unweb';
const client = new UnWebClient({ apiKey: process.env.UNWEB_API_KEY });

Full API reference: docs.unweb.info. Community examples: github.com/mbsoft-systems/unweb-community.

The quality score is what separates this from a plain HTML-to-Markdown converter. Most tools return something regardless of whether the page had content. UnWeb tells you when the extraction is trustworthy — which is what you actually need when you’re feeding web content into an LLM pipeline where garbage-in leads to garbage-out.

Start scraping with quality guarantees

Free tier includes 500 credits/month. No credit card required. Install with npm install @mbsoftsystems/unweb and get your API key at app.unweb.info.

Get your free API key

Back to Blog