Markdown Extraction

Converting raw HTML web pages into clean, structured Markdown text, removing navigation, ads, and boilerplate.

What Is Markdown Extraction?

Markdown extraction is the process of taking a raw HTML web page and converting its meaningful content into clean, structured Markdown text. The goal is to strip away everything that is not useful — navigation menus, cookie banners, advertisements, footers, sidebars, and tracking scripts — and preserve only the readable content: headings, body text, lists, code blocks, and links.

The result is a format that is both human-readable and ideal for feeding into large language models (LLMs), search indexes, and knowledge bases.

Why Markdown Instead of Raw HTML?

Raw HTML is noisy. A typical web page contains thousands of characters of markup for every hundred characters of actual content. Feeding raw HTML to an LLM wastes tokens, increases cost, and degrades extraction quality because the model must reason about HTML structure rather than content meaning.

Markdown solves this by:

Removing boilerplate — menus, ads, footers, and scripts are discarded
Preserving structure — headings (##), lists (-), code blocks (```), and links remain intact
Reducing token count — clean Markdown is typically 80-95% smaller than equivalent HTML
Improving LLM accuracy — models trained on Markdown understand its structure natively

How Markdown Extraction Works

The extraction pipeline typically involves three stages:

Fetch — retrieve the raw HTML via HTTP or a headless browser
Readability pass — identify the main content area (similar to Firefox's Reader Mode or Mozilla's Readability library)
Convert — transform remaining HTML elements into their Markdown equivalents:
- <h1> → # Heading
- <ul><li> → - item
- <code> → `code`
- <a href> → [text](url)

Markdown Extraction with KnowledgeSDK

KnowledgeSDK's POST /v1/scrape endpoint returns clean Markdown from any URL in a single API call:

POST /v1/scrape
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://blog.example.com/how-to-deploy-node"
}

Response:

{
  "markdown": "# How to Deploy a Node.js App\n\nDeploying Node.js to production requires...",
  "title": "How to Deploy a Node.js App",
  "url": "https://blog.example.com/how-to-deploy-node"
}

For richer output — including structured metadata, links, and category classification — use POST /v1/extract.

Common Use Cases

RAG pipelines — convert web pages into context chunks for retrieval-augmented generation
Documentation ingestion — index third-party docs into your own search engine
Newsletter summarization — extract article bodies for AI summarization
Competitive intelligence — convert competitor blog posts into searchable knowledge
Content migration — move web content into a CMS or database