What Is Markdown Extraction?
Markdown extraction is the process of taking a raw HTML web page and converting its meaningful content into clean, structured Markdown text. The goal is to strip away everything that is not useful — navigation menus, cookie banners, advertisements, footers, sidebars, and tracking scripts — and preserve only the readable content: headings, body text, lists, code blocks, and links.
The result is a format that is both human-readable and ideal for feeding into large language models (LLMs), search indexes, and knowledge bases.
Why Markdown Instead of Raw HTML?
Raw HTML is noisy. A typical web page contains thousands of characters of markup for every hundred characters of actual content. Feeding raw HTML to an LLM wastes tokens, increases cost, and degrades extraction quality because the model must reason about HTML structure rather than content meaning.
Markdown solves this by:
- Removing boilerplate — menus, ads, footers, and scripts are discarded
- Preserving structure — headings (
##), lists (-), code blocks (```), and links remain intact - Reducing token count — clean Markdown is typically 80-95% smaller than equivalent HTML
- Improving LLM accuracy — models trained on Markdown understand its structure natively
How Markdown Extraction Works
The extraction pipeline typically involves three stages:
- Fetch — retrieve the raw HTML via HTTP or a headless browser
- Readability pass — identify the main content area (similar to Firefox's Reader Mode or Mozilla's Readability library)
- Convert — transform remaining HTML elements into their Markdown equivalents:
<h1>→# Heading<ul><li>→- item<code>→`code`<a href>→[text](url)
Markdown Extraction with KnowledgeSDK
KnowledgeSDK's POST /v1/scrape endpoint returns clean Markdown from any URL in a single API call:
POST /v1/scrape
Authorization: Bearer knowledgesdk_live_...
{
"url": "https://blog.example.com/how-to-deploy-node"
}
Response:
{
"markdown": "# How to Deploy a Node.js App\n\nDeploying Node.js to production requires...",
"title": "How to Deploy a Node.js App",
"url": "https://blog.example.com/how-to-deploy-node"
}
For richer output — including structured metadata, links, and category classification — use POST /v1/extract.
Common Use Cases
- RAG pipelines — convert web pages into context chunks for retrieval-augmented generation
- Documentation ingestion — index third-party docs into your own search engine
- Newsletter summarization — extract article bodies for AI summarization
- Competitive intelligence — convert competitor blog posts into searchable knowledge
- Content migration — move web content into a CMS or database
Quality Signals for Good Markdown Extraction
A high-quality extraction should:
- Preserve heading hierarchy (
#,##,###) - Keep code blocks fenced with correct language tags
- Retain meaningful links with their anchor text
- Drop repeated navigation text that appears on every page
- Handle multi-column layouts by linearizing content in reading order