What Is Full-Page Extraction?
Full-page extraction is the process of capturing all meaningful content from a web page — its body text, headings, links, images, metadata, and structured data — in a single operation. Unlike targeted extraction (which pulls specific fields like a price or a title), full-page extraction aims to preserve the complete informational content of a page in a clean, structured format.
The result is a comprehensive document representation of the page that can be stored in a knowledge base, fed to a search engine, passed to an LLM as context, or used as the basis for more targeted downstream extraction.
What Full-Page Extraction Captures
A complete full-page extraction typically includes:
- Title — the page's
<title>tag and primary<h1> - Body content — all readable text, converted to clean Markdown
- Headings hierarchy —
<h1>through<h6>preserved as#through###### - Lists — ordered and unordered lists
- Code blocks — inline code and multi-line code fenced with language tags
- Links — all
<a href>links with their anchor text - Images —
srcURLs andalttext - Metadata — Open Graph tags, meta description, author, published date, canonical URL
- Structured data — JSON-LD schema markup (Product, Article, FAQ, etc.) if present
Full-Page Extraction vs. Targeted Extraction
| Aspect | Full-Page Extraction | Targeted Extraction |
|---|---|---|
| Output | Complete document | Specific fields (price, author, etc.) |
| Schema required | No | Yes |
| Use case | Knowledge bases, RAG, search | Databases, analytics, structured datasets |
| Maintenance | Low (no selectors) | Medium (schema may need updates) |
Full-page extraction is the right choice when you want to index all of a page's content for future querying; targeted extraction is better when you need specific, database-ready fields.
KnowledgeSDK Full-Page Extraction
KnowledgeSDK's POST /v1/extract endpoint performs full-page extraction with a single API call, handling JavaScript rendering, Markdown conversion, and metadata parsing automatically:
POST /v1/extract
Authorization: Bearer knowledgesdk_live_...
{
"url": "https://docs.example.com/guides/getting-started"
}
Response:
{
"url": "https://docs.example.com/guides/getting-started",
"title": "Getting Started Guide",
"markdown": "# Getting Started\n\nWelcome to Example...",
"metadata": {
"description": "Learn how to get started with Example in under 5 minutes.",
"author": "Example Team",
"published_at": "2025-09-01",
"canonical_url": "https://docs.example.com/guides/getting-started"
},
"links": [
{ "text": "Installation", "href": "/guides/installation" },
{ "text": "API Reference", "href": "/api" }
]
}
For visual capture alongside text extraction, combine with POST /v1/screenshot.
Use Cases for Full-Page Extraction
- RAG knowledge bases — index entire documentation sites, wikis, and help centers for retrieval-augmented generation
- Enterprise search — extract and index all intranet or partner site pages for internal search
- Competitive intelligence — capture all content from competitor sites for analysis
- AI training datasets — collect high-quality web text at scale for LLM fine-tuning
- Content migration — extract and preserve content from legacy websites before they go offline
- Legal and compliance archiving — preserve complete page snapshots with metadata for audit trails
Handling JavaScript-Rendered Pages
Full-page extraction must account for pages where content is generated by client-side JavaScript. KnowledgeSDK's extraction API uses a managed headless browser to fully render the page before extraction, ensuring that content from React, Vue, Angular, and Next.js applications is captured completely — not just the initial HTML shell.