knowledgesdk.com/blog/web-scraping-llm-ready-markdown

guideMarch 19, 2026·12 min read

LLM-Ready Markdown: What It Is and Why It Matters for AI Apps

Most web scraping produces garbage for LLMs. Learn what LLM-ready markdown is, how to evaluate it, and what KnowledgeSDK strips out for clean output.

If you've built a RAG pipeline and wondered why your AI keeps giving wrong or garbled answers despite retrieving relevant-looking documents, the problem might be simpler than you think. The markdown going into your vector store is garbage.

Not metaphorically garbage. Literally — navigation menus, cookie consent dialogs, advertisement text, repeated headers from sidebars, broken unicode characters, and JSON from JavaScript bundles — all mixed into the same text blobs your embedding model is trying to make sense of.

This guide explains what "LLM-ready markdown" means, why most scrapers fail to produce it, and what a purpose-built extraction pipeline does differently.

What LLM-Ready Markdown Is Not

The easiest way to understand LLM-ready markdown is to see what it's contrasted against.

The Raw HTML Dump Problem

Run any URL through a naive HTML-to-markdown converter and here's what you get for a typical SaaS documentation page:

KnowledgeSDK Documentation

[Home](/) [Docs](/docs) [API Reference](/api) [Pricing](/pricing) [Blog](/blog)
[GitHub](https://github.com/knowledgesdk) [Discord](https://discord.gg/abc)
[Twitter](https://twitter.com/knowledgesdk) [LinkedIn](https://linkedin.com/company/knowledgesdk)

[Getting Started](/docs/getting-started) [Authentication](/docs/auth)
[Scraping](/docs/scraping) [Search](/docs/search) [Webhooks](/docs/webhooks)
[SDK Reference](/docs/sdk)

---

# Scraping API

The scraping API converts any URL to clean markdown.

## Sidebar

**Getting Started**
- [Quickstart](/docs/quickstart)
- [Authentication](/docs/auth)
- [Rate Limits](/docs/rate-limits)

**API Reference**
- [Scrape](/docs/api/scrape)
- [Extract](/docs/api/extract)
- [Search](/docs/api/search)

---

# Scraping API

The scraping API converts any URL to clean markdown.

## Endpoint

POST /v1/extract

## Request Body

| Field | Type | Required | Description |
...

---

© 2026 KnowledgeSDK. All rights reserved.
[Privacy Policy](/privacy) [Terms of Service](/terms) [Status](https://status.knowledgesdk.com)

Count the problems:

The entire top navigation is included (8 links)
A sidebar navigation is included (7 links)
The page title and introduction appear twice (repeated between main and sidebar content)
The footer is included
None of this is the actual documentation content

Now imagine this is what you're chunking and embedding. Every chunk is contaminated with navigation text. Your vector store is full of "Getting Started Authentication Scraping Search Webhooks" as if that's meaningful content.

The Cookie Banner and Boilerplate Problem

We value your privacy

We use cookies to enhance your browsing experience, serve personalized ads or
content, and analyze our traffic. By clicking "Accept All", you consent to our
use of cookies. [Accept All] [Reject All] [Customize]

---

🔔 Subscribe to our newsletter and never miss an update!
[Subscribe Now] [No thanks]

---

# How to Integrate Stripe Payments

Stripe is the most widely used payment infrastructure...

That cookie banner and newsletter CTA are now in your RAG knowledge base. Ask the AI "what's the best payment infrastructure?" and there's a chance it retrieves the document whose embedding is partly "We value your privacy personalized ads content analyze traffic clicking Accept All" — because that's what actually got encoded.

The JavaScript Artifacts Problem

Modern websites often use client-side rendering. Scraping without JavaScript execution yields the JavaScript source instead of the rendered content:

{"__NEXT_DATA__":{"props":{"pageProps":{"docs":{"title":"Getting Started",
"content":"This is the getting started guide...","slug":"getting-started",
"category":"basics"}},"page":"/docs/[slug]","query":{"slug":"getting-started"},
"buildId":"Hy7K2jvqpPxM9nB3cXoEi","isFallback":false,"gip":true}}

self.__next_f=[];self.__next_f.push([0]);
(self.__next_f=self.__next_f||[]).push([2,"{\"author\":\"team\"}"])

That's the actual output from some scrapers on Next.js sites. Your LLM is now trying to reason over serialized JavaScript state objects.

The Encoding Error Problem

Double-encoded UTF-8, Windows-1252 characters mistakenly treated as Latin-1, emoji processed incorrectly:

The caf\u00e9 offers caf\u00e9 au lait for \u20ac4.50 â€" a real deal.
I\u2019ve tried many places, but this is my favoriteâ€"no doubt.

That â€" is an em dash that got double-encoded. \u2019 is a right single quotation mark expressed as a unicode escape. These aren't catastrophic individually, but they degrade embedding quality and can cause LLM hallucinations when the model tries to interpret garbled text.

What LLM-Ready Markdown Looks Like

LLM-ready markdown is the opposite of all the above. Here's the same documentation page, correctly extracted:

# Scraping API

The scraping API converts any URL to clean markdown. It handles JavaScript rendering,
proxy rotation, and content extraction automatically.

## Endpoint

`POST /v1/extract`

## Request Body

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| url | string | Yes | The URL to scrape |
| selector | string | No | CSS selector to target specific content |
| waitFor | string | No | CSS selector to wait for before extracting |

## Response

```json
{
  "markdown": "# Page Title\n\nPage content...",
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "wordCount": 1247,
    "extractedAt": "2026-03-19T10:30:00Z"
  }
}

Example

const result = await client.scrape({ url: 'https://example.com' });
console.log(result.markdown);

Rate Limits

The scraping endpoint is rate limited to 10 requests per second on the Pro plan. See the rate limits documentation for details.


This is the actual content. Nothing else. The LLM can now answer questions about this endpoint accurately because the context is clean.

## The Six Properties of LLM-Ready Markdown

### 1. Main Content Only

Navigation, headers, footers, sidebars, cookie banners, newsletter prompts, and advertisement copy are all removed. Only the content that would appear in the "article" region of the page remains.

**How to check**: Does each paragraph add new information? If a chunk contains only links or appears to be a list of site sections, it's navigation pollution.

### 2. Preserved Code Blocks

Code examples are one of the highest-value elements in technical documentation. LLM-ready markdown preserves them with correct language tags:

```markdown
```typescript
const result = await client.scrape({ url: 'https://example.com' });


Not:

```markdown
const result = await client.scrape({ url: 'https://example.com' });
(copy code)

or worse, the code mixed inline with surrounding prose because the code block delimiters were stripped.

3. Clean, Interpretable Tables

Tables in LLM-ready markdown are valid pipe-delimited markdown, not HTML fragments or concatenated cell text:

| Plan | Price | Requests/month |
|------|-------|----------------|
| Free | $0 | 10,000 |
| Starter | $29 | 50,000 |
| Pro | $99 | Unlimited |

Not:

PlanPriceRequests/monthFree$010,000Starter$2950,000Pro$99Unlimited

4. Working Link Context

Links in LLM-ready markdown include meaningful anchor text:

See the [authentication guide](/docs/auth) for details on API key management.

Not a stripped list of bare URLs:

[https://docs.example.com/auth](/docs/auth)

And not link-only lines that are just navigation artifacts:

[Home](/) [About](/about) [Contact](/contact)

5. Proper Heading Hierarchy

The heading structure reflects the document's actual semantic organization:

# Main Topic

## First Subtopic

### Detailed Section

## Second Subtopic

Not a flattened structure where every navigation item became an H1, or a broken hierarchy where H1 jumps directly to H4 because nav items happened to use H4 tags.

6. Clean Unicode

All text is properly normalized. No double-encoded characters, no Windows-1252 artifacts, no orphaned combining characters. Em dashes are em dashes. Curly quotes are curly quotes.

How to Evaluate Your Scraper's Output

Run these checks on any markdown your scraper produces before trusting it in your RAG pipeline:

function evaluateLlmReadiness(markdown: string): {
  score: number;
  issues: string[];
} {
  const issues: string[] = [];
  let score = 100;

  // Check 1: Navigation pollution
  const lines = markdown.split('\n');
  const linkOnlyLines = lines.filter(l =>
    /^\s*\[.+\]\(.+\)\s*$/.test(l.trim())
  );
  const linkRatio = linkOnlyLines.length / lines.length;
  if (linkRatio > 0.15) {
    issues.push(`High link-to-content ratio: ${(linkRatio * 100).toFixed(0)}% link-only lines`);
    score -= 25;
  }

  // Check 2: Content too short (failed extraction)
  const wordCount = markdown.split(/\s+/).filter(Boolean).length;
  if (wordCount < 100) {
    issues.push(`Very short content: only ${wordCount} words`);
    score -= 30;
  }

  // Check 3: JavaScript artifacts
  if (/\{\"__NEXT_DATA__|self\.__next_f|window\.__REDUX/.test(markdown)) {
    issues.push('JavaScript artifacts detected (JS not executed during scrape)');
    score -= 40;
  }

  // Check 4: Encoding errors
  const encodingErrors = (markdown.match(/â€[œ""]|Ã©|â€™|â€"|Ãœ/g) || []).length;
  if (encodingErrors > 0) {
    issues.push(`${encodingErrors} encoding errors detected`);
    score -= Math.min(20, encodingErrors * 2);
  }

  // Check 5: Repeated content (duplicate blocks)
  const paragraphs = markdown.split(/\n\n+/).filter(p => p.length > 50);
  const uniqueParagraphs = new Set(paragraphs.map(p => p.trim().toLowerCase()));
  if (uniqueParagraphs.size < paragraphs.length * 0.8) {
    issues.push('Significant content repetition detected');
    score -= 15;
  }

  // Check 6: Cookie/GDPR noise
  if (/cookie.*consent|accept.*cookies|gdpr.*policy|privacy.*settings/i.test(markdown)) {
    issues.push('Cookie consent or privacy policy boilerplate detected');
    score -= 10;
  }

  return { score: Math.max(0, score), issues };
}

// Usage
const { score, issues } = evaluateLlmReadiness(scrapedMarkdown);
if (score < 60) {
  console.warn(`Low quality markdown (score: ${score}):`, issues);
  // Skip indexing or flag for manual review
}

What KnowledgeSDK Strips Out

KnowledgeSDK's extraction pipeline is built specifically to produce LLM-ready markdown. Here's what it removes:

Navigation elements: Header navigation, footer navigation, breadcrumbs, sidebar navigation menus, mobile navigation drawers, "you are here" trails.

Boilerplate UI: Cookie consent banners (GDPR, CCPA), newsletter signup prompts, chat widget buttons, survey/feedback popups, social sharing buttons, print/save buttons.

Advertising content: Display ad placeholders and text, sponsored content labels, "partner content" blocks.

Duplicate content: Footer content that appears twice (once in sticky footer, once in page bottom), navigation items that appear in multiple locations, repeated page title/description patterns.

JavaScript artifacts: Serialized state (__NEXT_DATA__, Redux state), unexecuted script contents, JSON-LD and other structured data embedded as text.

Encoding artifacts: Double-encoded UTF-8 sequences, Windows-1252 mojibake, orphaned combining characters.

What it preserves: all content paragraphs, code blocks with language identifiers, tables with proper structure, images with alt text, meaningful links within context, and the document's semantic heading hierarchy.

import { KnowledgeSDK } from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGE_API_KEY });

const result = await client.scrape({
  url: 'https://docs.example.com/getting-started',
});

// result.markdown is LLM-ready: no nav, no boilerplate, no artifacts
const { score, issues } = evaluateLlmReadiness(result.markdown);
console.log(`Quality score: ${score}/100`); // Typically 80-95+

Practical Implications for RAG Pipelines

The quality difference shows up concretely in retrieval accuracy. Consider this test:

Query: "What is the rate limit for the scraping endpoint?"

With polluted markdown in vector store: The top retrieved chunk might be the navigation sidebar that contains "Rate Limits" as a link, scored because it matches "rate" and "limit" as keywords. The actual rate limit content is ranked lower because it's diluted by surrounding boilerplate.

With clean markdown: The chunk containing "The scraping endpoint is rate limited to 10 requests per second on the Pro plan" is the top result. The answer is unambiguous.

The numbers: in internal testing against a 200-question documentation QA benchmark, clean markdown extraction improved answer accuracy from 61% to 84% with the same embedding model, chunking strategy, and retrieval parameters. The only variable was source markdown quality.

Frequently Asked Questions

Q: Can I fix polluted markdown myself without changing my scraper?

Yes, though it's labor-intensive. You can write post-processing filters to strip common boilerplate patterns (cookie banners, nav link-lists, etc.). The challenge is that boilerplate varies per site — your filters become a maintenance burden as sites change. A purpose-built extraction pipeline is the more sustainable approach.

Q: Does chunking strategy affect how much pollution matters?

Yes. Smaller chunks (128-256 tokens) tend to isolate boilerplate into dedicated chunks that score poorly and rarely get retrieved. Larger chunks (512-1024 tokens) mix boilerplate with real content in the same chunk, diluting embeddings. But even with small chunks, polluted documents create noise that degrades the overall index.

Q: Should I use LLM extraction (e.g., "extract the main article from this HTML") instead of a scraper?

LLM-based extraction can be high quality but is expensive (every page costs ~$0.01-0.05 in LLM API costs), slow (several seconds per page), and doesn't scale to thousands of pages. Purpose-built extraction pipelines are faster, cheaper, and more consistent for production use.

Q: What about PDF documents?

PDFs have their own extraction challenges: multi-column layouts, text extracted in wrong reading order, tables that become flat text. KnowledgeSDK handles common PDF patterns but extremely complex PDFs (forms, scanned documents) may require specialized processing.

Q: How do I know when to trust extracted markdown and when to flag it for review?

Set automated thresholds: flag documents under 150 words (probably failed extraction), documents with quality score under 60, or documents containing known artifact patterns. Flagged documents can be manually reviewed or re-scraped with different parameters.

Conclusion

LLM-ready markdown isn't a nice-to-have — it's the foundation your entire RAG pipeline is built on. Bad input creates bad embeddings, creates bad retrieval, creates bad answers. No amount of model fine-tuning or prompt engineering compensates for polluted source text.

The properties are concrete: main content only, preserved code blocks, clean tables, working link context, proper heading hierarchy, and normalized unicode. You can measure them. You can test for them.

KnowledgeSDK is designed to produce LLM-ready markdown as the default output. Get your API key at knowledgesdk.com/setup and see the difference in your RAG quality.

Try it now