knowledgesdk.com/blog/markdown-quality-web-scraping
technicalMarch 19, 2026·13 min read

Why Markdown Quality Matters for LLM Web Scraping (And How to Measure It)

Bad markdown ruins RAG quality. Learn how to identify common extraction failures, measure markdown quality, and ensure clean output for LLMs.

Why Markdown Quality Matters for LLM Web Scraping (And How to Measure It)

Every RAG pipeline has a bottleneck that rarely gets discussed: the quality of the text going into the vector store. Engineers spend weeks tuning embedding models, experimenting with chunking strategies, and optimizing retrieval parameters — then wonder why their Q&A system still returns garbage.

The answer, more often than not, is that the source markdown is garbage.

This guide examines why markdown quality matters so much for LLM applications, what "bad" markdown looks like in practice, how to measure quality programmatically, and what a production extraction pipeline does to ensure clean output.

Why Markdown Quality Directly Impacts RAG Performance

When you scrape a webpage and store the result for RAG, you're making an implicit assumption: the text you stored is the text that matters. If your scraper includes navigation menus, cookie consent banners, footer links, and ad copy alongside the actual article content, you've polluted every chunk that gets embedded.

The effects compound in subtle ways:

Embedding dilution: A chunk containing 30% boilerplate and 70% real content produces an embedding that represents a mixture of both. When a user asks a relevant question, the embedding distance to that chunk is worse than it would be for a clean chunk — so it gets ranked lower and potentially never retrieved.

Token waste: Every token of navigation, ads, and boilerplate that goes into your LLM context window is a token that could have been real content. On a 128K context window this seems trivial, but at scale — thousands of documents in context — it adds up to real cost and degraded response quality.

Hallucination triggers: Inconsistent or garbled text (broken unicode, partially-rendered JavaScript, truncated sentences) can confuse LLMs. Sufficiently bad input can trigger confabulation — the model "fixes" the garbled input with plausible-sounding but fabricated content.

Chunking failures: Most RAG systems chunk on paragraph or sentence boundaries. Navigation menus and repeated headers create fake "paragraphs" that get chunked independently and embedded as if they were real content nodes.

What Bad Markdown Actually Looks Like

Let's be specific. Here are the failure modes you'll encounter when scraping HTML with naive converters.

Navigation Pollution

A typical e-commerce site might have 60-80 navigation links in the header alone. A naive HTML-to-markdown converter includes all of them:

# Example Store

[Home](/) [Products](/products) [Categories](/categories) [Electronics](/electronics)
[Computers](/computers) [Laptops](/laptops) [Desktops](/desktops) [Monitors](/monitors)
[Phones](/phones) [Tablets](/tablets) [Audio](/audio) [TV & Home](/tv) [Sale](/sale)
[New Arrivals](/new) [Best Sellers](/best) [Brands](/brands) [Apple](/brands/apple)
[Samsung](/brands/samsung) [Sony](/brands/sony) [Account](/account) [Orders](/orders)
[Wishlist](/wishlist) [Cart](/cart) [Help](/help) [Contact](/contact)

## MacBook Pro 16-inch (2025)

The MacBook Pro 16-inch delivers exceptional performance...

That navigation block becomes one or more chunks in your vector store. Query "What is the MacBook Pro 16-inch?" and your retrieval system might return the navigation chunk (which mentions "Laptops") instead of the product description chunk.

Boilerplate and Cookie Banners

We use cookies to improve your experience. By continuing to use this site, you agree
to our use of cookies. [Accept All] [Manage Preferences] [Privacy Policy]

SPECIAL OFFER: Sign up for our newsletter and get 10% off your first order!
[SUBSCRIBE NOW]

## The Complete Guide to API Rate Limiting

Rate limiting is a critical technique...

That "Complete Guide to API Rate Limiting" article is now poisoned with cookie consent text and newsletter promotions that have zero semantic relationship to the content.

Garbled Tables

HTML tables are a particular challenge. Poor converters produce:

PlanPriceRequestsSupport
Free$010k/moEmail
Starter$2950k/moEmail + Chat
Pro$99Unlimited24/7

vs. clean output:

| Plan | Price | Requests | Support |
|------|-------|----------|---------|
| Free | $0 | 10k/mo | Email |
| Starter | $29 | 50k/mo | Email + Chat |
| Pro | $99 | Unlimited | 24/7 |

The garbled version is essentially uninterpretable by an LLM attempting to answer "What's included in the Starter plan?"

Repeated Headers and Footer Content

Many sites repeat the same content in the header, a sticky nav bar, a sidebar, and the footer. A naive scraper captures all four copies:

# Company Blog

# Company Blog

## Latest Posts

# Company Blog

[Home] [Blog] [About] [Contact]
© 2026 Company. All rights reserved. [Privacy] [Terms] [Sitemap]

This creates artificial token inflation and can confuse retrieval systems that use keyword frequency signals.

Broken Unicode and Character Encoding

Especially common with international sites or PDFs converted to HTML:

The caf\u00e9 serves caf\u00e9 au lait and caf\u00e9 mocha.
Prices start at \u20ac4.50 â€" a bargain by any measure.

That â€" is a UTF-8 em dash that got double-encoded. LLMs can usually parse through it, but it degrades the text quality and embedding fidelity.

JavaScript Artifacts

Single-page applications that use client-side rendering often produce JSON blobs, script tags, or serialized state when scraped without JavaScript execution:

{"__NEXT_DATA__":{"props":{"pageProps":{"product":{"id":"abc123","name":"Widget Pro"...

window.__REDUX_STATE__ = {"user":null,"cart":{"items":[],"total":0},...

This is meaningless noise that gets embedded alongside or instead of the actual page content.

How to Measure Markdown Quality

Markdown quality is measurable. Here are the signals you should track for any extraction pipeline.

1. Content Ratio

Calculate the ratio of "content tokens" to "total tokens." This requires defining what content is (harder than it sounds), but a practical proxy is:

function estimateContentRatio(markdown: string): number {
  const lines = markdown.split('\n').filter(l => l.trim().length > 0);
  const totalLines = lines.length;

  // Heuristics for non-content lines
  const boilerplatePatterns = [
    /^\[.*\]\(.*\)$/, // pure link lines
    /^#{1,6}\s+.{0,20}$/, // very short headings (nav items)
    /copyright|all rights reserved|privacy policy/i,
    /cookie|gdpr|consent/i,
    /subscribe|newsletter|sign up/i,
    /^\s*\|[\s\|]+\|\s*$/, // table separator rows
  ];

  const boilerplateLines = lines.filter(line =>
    boilerplatePatterns.some(pattern => pattern.test(line))
  ).length;

  return 1 - (boilerplateLines / totalLines);
}

// Good: > 0.75
// Acceptable: 0.5 - 0.75
// Poor: < 0.5

2. Heading Structure Score

A well-structured document has a logical heading hierarchy. Check for:

function headingStructureScore(markdown: string): number {
  const headings = markdown.match(/^#{1,6}\s+.+$/gm) || [];
  if (headings.length === 0) return 0.5; // No headings might be okay (article prose)

  let score = 1.0;
  let prevLevel = 0;

  for (const heading of headings) {
    const level = heading.match(/^(#+)/)?.[1].length || 1;
    // Penalize skipping levels (h1 → h4 with no h2/h3 is navigation noise)
    if (level > prevLevel + 1 && prevLevel > 0) {
      score -= 0.1;
    }
    prevLevel = level;
  }

  // Penalize documents with too many headings relative to content
  const wordCount = markdown.split(/\s+/).length;
  const headingDensity = headings.length / (wordCount / 100);
  if (headingDensity > 5) score -= 0.2; // More than 5 headings per 100 words = nav pollution

  return Math.max(0, score);
}

3. Unique Content Density

Repeated content (footer links appearing 3 times, etc.) can be detected:

function uniqueContentDensity(markdown: string): number {
  const lines = markdown.split('\n').filter(l => l.trim().length > 20);
  const uniqueLines = new Set(lines.map(l => l.trim().toLowerCase()));
  return uniqueLines.size / lines.length;
}

// 1.0 = all unique (ideal)
// < 0.7 = significant repetition

4. Composite Quality Score

Combine signals into a single score:

interface MarkdownQualityReport {
  score: number; // 0-1
  contentRatio: number;
  headingScore: number;
  uniquenessDensity: number;
  wordCount: number;
  flags: string[];
}

function assessMarkdownQuality(markdown: string): MarkdownQualityReport {
  const flags: string[] = [];

  const contentRatio = estimateContentRatio(markdown);
  const headingScore = headingStructureScore(markdown);
  const uniquenessDensity = uniqueContentDensity(markdown);
  const wordCount = markdown.split(/\s+/).filter(Boolean).length;

  if (contentRatio < 0.5) flags.push('HIGH_BOILERPLATE');
  if (wordCount < 100) flags.push('INSUFFICIENT_CONTENT');
  if (headingScore < 0.6) flags.push('NAVIGATION_POLLUTION');
  if (uniquenessDensity < 0.7) flags.push('REPEATED_CONTENT');

  // Check for common artifacts
  if (/\{\"__NEXT_DATA__\"|window\.__REDUX/.test(markdown)) {
    flags.push('JS_ARTIFACTS');
  }
  if (/â€[œ""]|é|’/.test(markdown)) {
    flags.push('ENCODING_ERRORS');
  }

  const score = (contentRatio * 0.4) + (headingScore * 0.3) + (uniquenessDensity * 0.3);

  return { score, contentRatio, headingScore, uniquenessDensity, wordCount, flags };
}

What KnowledgeSDK's Extraction Pipeline Does

KnowledgeSDK's scraping endpoint isn't a raw HTML-to-markdown converter. It's an extraction pipeline with several processing stages designed specifically to produce LLM-ready output.

Stage 1: Intelligent Content Detection

Before conversion, the pipeline identifies the main content region of the page using a combination of:

  • Density analysis: Paragraphs with high text-to-link ratios are more likely to be main content than navigation
  • DOM position heuristics: <main>, <article>, and <section> tags signal content regions
  • Visual layout inference: Content regions tend to occupy the largest visual area
  • Boilerplate classifiers: Trained models that identify common CMS header/footer patterns

This stage eliminates 80-90% of navigation pollution before any markdown conversion happens.

Stage 2: Semantic Structure Preservation

The converter is aware of semantic HTML elements and preserves their meaning:

<!-- Input HTML -->
<figure>
  <img src="chart.png" alt="Monthly revenue growth chart showing 23% increase">
  <figcaption>Revenue grew 23% in Q4 2025</figcaption>
</figure>
<!-- Output Markdown -->
![Monthly revenue growth chart showing 23% increase](chart.png)
*Revenue grew 23% in Q4 2025*

Tables are parsed structurally, not as raw HTML, ensuring proper pipe-delimited output even for complex tables with merged cells.

Stage 3: Post-Processing Filters

After conversion, a series of filters clean up common artifacts:

  • Duplicate block removal: Identical or near-identical paragraphs (cosine similarity > 0.95) are deduplicated
  • Link-only line removal: Lines containing nothing but links are classified as navigation and removed
  • Unicode normalization: All text is normalized to NFC form, fixing common encoding double-encoding issues
  • Whitespace normalization: Multiple blank lines are collapsed; trailing whitespace is stripped

Stage 4: Quality Verification

Every response includes quality metadata you can use to filter or flag low-quality extractions:

import { KnowledgeSDK } from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGE_API_KEY });

const result = await client.scrape({
  url: 'https://example.com/article',
});

console.log(result.markdown);
// Clean, LLM-ready markdown

console.log(result.metadata);
// {
//   wordCount: 1847,
//   title: "Article Title",
//   description: "Article description...",
//   extractedAt: "2026-03-19T10:30:00Z"
// }

Comparison: Raw HTML Conversion vs. KnowledgeSDK

Here's a real-world comparison. Input: a typical SaaS documentation page.

Naive HTML-to-markdown (Turndown):

  • 4,200 words
  • Includes 3 full navigation menus
  • Cookie banner text included
  • Footer repeated twice
  • 14 near-duplicate "Learn more" link lines
  • 2 encoding errors
  • Quality score: 0.41

KnowledgeSDK extraction:

  • 1,100 words (the actual documentation content)
  • Navigation removed
  • Cookie banner removed
  • Footer removed
  • Deduplication applied
  • Unicode normalized
  • Quality score: 0.89

The downstream RAG performance difference is not subtle. With clean extraction, relevant chunk retrieval accuracy typically improves 30-50% in internal benchmarks against the same questions.

Frequently Asked Questions

Q: Should I filter out low-quality pages before embedding?

Yes. Set a minimum word count threshold (we recommend 150 words) and a minimum quality score (0.6+). Pages below these thresholds often aren't worth embedding — they contribute noise without adding retrieval value.

Q: How do I handle pages that are legitimately short (like a contact page)?

Treat short pages differently based on page type. A contact page with 80 words is complete and correct. A product page with 80 words is probably a failed extraction. Use URL patterns and page title heuristics to classify pages before applying quality thresholds.

Q: Can I improve extraction quality by providing hints to the scraper?

Yes. KnowledgeSDK accepts CSS selectors to target specific content regions if you know the site's structure:

const result = await client.scrape({
  url: 'https://docs.example.com/guide',
  // Target only the main docs content
  selector: '.docs-content, article, main',
});

Q: What about dynamic JavaScript-rendered content?

KnowledgeSDK executes JavaScript before extraction. Content rendered by React, Vue, or other frameworks is included in the output. If you're using a naive converter, you'll get the pre-render HTML shell instead of actual content.

Q: How often should I re-scrape for freshness?

Depends on your use case. For documentation sites, weekly is usually sufficient. For news sites, hourly. KnowledgeSDK's webhook system can notify you when content changes, eliminating the need for scheduled re-scraping.

Conclusion

Markdown quality is the unsung variable in RAG performance. You can optimize every other part of your pipeline — better embeddings, smarter chunking, reranking, query expansion — and still get poor results if the source text is polluted with navigation, boilerplate, and encoding errors.

The solution is a purpose-built extraction pipeline that understands the difference between content and chrome, preserves semantic structure, and verifies output quality before returning it to your application.

KnowledgeSDK is that pipeline. Get your API key at knowledgesdk.com/setup and start scraping with @knowledgesdk/node or the knowledgesdk Python SDK.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →
← Back to blog