Markdown Extraction API: How to Get Clean Text from Any URL

A practical guide to markdown extraction APIs — what they do, how they differ, and how to use them to feed clean text to your LLMs, RAG pipelines, and AI agents.

Markdown Extraction API: How to Get Clean Text from Any URL

Markdown extraction APIs solve a specific, important problem: given a URL, return the content in a format that LLMs can actually use efficiently. They handle everything between "URL" and "clean text" — JavaScript rendering, anti-bot bypass, boilerplate removal, and format conversion — so your application code deals only with the output.

This matters because the naïve alternative — fetching a URL and piping raw HTML to an LLM — is wasteful and unreliable. HTML is bloated with scripts, styles, navigation, and markup that consumes tokens without contributing information. A 50KB HTML page often reduces to 5KB of meaningful content in markdown. That's a 10x reduction in token cost and a meaningful improvement in context quality.

Why Markdown Is the Right Format for LLMs

Not HTML. Not JSON. Not plain text. Markdown.

HTML carries too much noise. Even after stripping tags, you're left with attribute values, class names, and structural markup that pollutes the context. A typical news article in HTML is 60-80% non-content.

Plain text loses structure. Headings flatten to prose, tables become unreadable columns, code blocks lose their boundaries. An LLM processing plain text has to infer structure that was explicit in the original page.

JSON requires schema — and web pages don't have a consistent schema. You'd need to define and maintain extraction rules for every domain you care about.

Markdown preserves structure with minimal overhead. Headings (#, ##) indicate document hierarchy. Code blocks (```) denote technical content. Tables stay readable. Links are preserved as [text](url). Bold and italic emphasis survive. The result is information-dense, structure-preserving, and token-efficient.

For RAG pipelines, this means better chunking (chunk by heading, not by character count), better retrieval (semantic search on meaningful chunks), and better generation (LLMs produce more accurate answers when given structured context).

How Extraction Works Under the Hood

A quality markdown extraction API does several things in sequence:

1. Headless browser execution. The URL is loaded in a headless Chromium instance. JavaScript executes, dynamic content loads, lazy-loaded images trigger. This step is what separates real extraction from simple HTTP fetches — most modern web pages require JavaScript to render their actual content.

2. Anti-bot bypass. Before even loading the page, the service routes through residential proxies and applies browser fingerprint patching to avoid bot detection. Cloudflare, DataDome, and Akamai protections are handled transparently.

3. DOM cleaning. Scripts, styles, iframes, ads, navigation menus, cookie banners, and other boilerplate are identified and removed. This is harder than it sounds — "main content" detection requires heuristics about element placement, size, content density, and semantic role.

4. Markdown conversion. The cleaned DOM is converted to markdown. Good converters handle headings, lists, tables, code blocks, links, and images with alt text. Bad ones produce garbled output or lose structure entirely.

5. Metadata extraction. Title, description, canonical URL, author, and publish date are extracted from meta tags and structured data (JSON-LD, OpenGraph).

The Major APIs and How They Differ

Jina Reader

Jina Reader (r.jina.ai) is a free URL-to-markdown service. You prefix any URL with https://r.jina.ai/ and get markdown back. It's genuinely useful for quick experimentation.

The limitations: rate limits are aggressive on the free tier. Quality varies on JS-heavy pages. No anti-bot handling for protected sites. No indexing or search capabilities. Good for prototyping; not production-ready for AI pipelines.

Firecrawl

Firecrawl is purpose-built for LLM use cases and its markdown quality is excellent. It handles JavaScript rendering, has solid anti-bot coverage, and produces clean, well-structured output. The crawl endpoint lets you extract entire documentation sites recursively.

The trade-off is cost — Firecrawl is among the more expensive options per page, which adds up at scale.

ScrapingBee

ScrapingBee's primary use case is raw HTML extraction with proxy rotation. It does offer a return_page_markdown parameter, but markdown quality is a secondary concern rather than a core product focus. It works, but output quality isn't optimized for LLM consumption.

KnowledgeSDK

KnowledgeSDK is built specifically for AI developers who need web content as structured knowledge. Beyond markdown extraction, it adds:

Semantic search (POST /v1/search) over all extracted content — hybrid keyword + vector search per API key collection.
Webhooks for change detection — monitor URLs and receive notifications when content changes.
Async extraction (POST /v1/extract/async) for large sites with job status polling (GET /v1/jobs/{jobId}).
MCP server (@knowledgesdk/mcp) for direct Claude/agent integration.

The extraction output includes the markdown content, title, summary, categories, and structured knowledge — not just raw markdown.

Quality Factors to Evaluate

When comparing markdown extraction APIs, test against these specific dimensions:

Heading preservation. Do h1-h6 tags become #-######? Does the hierarchy make sense in the output?

Code block handling. Does ```python appear around code samples with the correct language tag? Are inline code spans preserved?

Table formatting. Markdown tables (| col | col |) should preserve the original data. Many extractors flatten tables to prose or lose them entirely.

Link retention. Internal and external links should survive as [text](url) — useful for crawling and for giving LLMs verifiable sources.

Boilerplate removal. Navigation menus, cookie banners, footers, and sidebar widgets should be stripped. The output should be article content, not site chrome.

Image handling. Images should become ![alt text](src) with meaningful alt text when available.

Code Examples

TypeScript (KnowledgeSDK)

import KnowledgeSDK from '@knowledgesdk/node';

const ks = new KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });

// Simple URL to markdown
const result = await ks.extract('https://docs.example.com/getting-started');

console.log(result.markdown);
// # Getting Started
// 
// This guide will walk you through setting up your first...

console.log(result.title);      // "Getting Started | ExampleDocs"
console.log(result.description); // "Official getting started guide..."

// Full knowledge extraction (structured output)
const knowledge = await ks.extract('https://example.com/about');
console.log(knowledge.title);
console.log(knowledge.summary);
console.log(knowledge.markdown);

Python

import requests

API_KEY = "knowledgesdk_live_..."
BASE_URL = "https://api.knowledgesdk.com"

response = requests.post(
    f"{BASE_URL}/v1/extract",
    headers={"x-api-key": API_KEY},
    json={"url": "https://docs.example.com/getting-started"}
)

data = response.json()
markdown = data["markdown"]
title = data["title"]

print(f"Title: {title}")
print(f"Content length: {len(markdown)} characters")
print(markdown[:500])  # First 500 characters

Feeding to an LLM

import KnowledgeSDK from '@knowledgesdk/node';
import OpenAI from 'openai';

const ks = new KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function answerFromURL(url: string, question: string) {
  // Extract clean markdown
  const { markdown } = await ks.extract(url);

  // Feed to LLM
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'Answer questions based on the provided content.' },
      { role: 'user', content: `Content:\n${markdown}\n\nQuestion: ${question}` }
    ]
  });

  return response.choices[0].message.content;
}

const answer = await answerFromURL(
  'https://docs.example.com/pricing',
  'What is included in the Pro plan?'
);
console.log(answer);

Edge Cases to Know About

Paywalls. If content is behind a paywall, the extraction will return the paywall page, not the article. There's no legitimate way to bypass this — and you shouldn't try.

Login-required content. Some pages require authentication. Scraping APIs can't handle this without credentials. Use Playwright with session management for authenticated scraping.

Infinite scroll. Pages that load content as you scroll won't fully render in a single page load. Quality extractors handle common infinite scroll patterns, but complex implementations may require manual pagination.

PDF and non-HTML content. Markdown extraction is designed for web pages. PDFs, Word documents, and other binary formats require dedicated document parsing libraries.

For the vast majority of publicly accessible web content, a quality markdown extraction API handles everything transparently. Start with KnowledgeSDK's 1,000 free requests and validate quality against your actual target URLs before committing to a plan.

Try it now