Full-Page Extraction

Capturing all visible and structured content from a web page — text, links, metadata, and media references — in a single API call.

What Is Full-Page Extraction?

Full-page extraction is the process of capturing all meaningful content from a web page — its body text, headings, links, images, metadata, and structured data — in a single operation. Unlike targeted extraction (which pulls specific fields like a price or a title), full-page extraction aims to preserve the complete informational content of a page in a clean, structured format.

The result is a comprehensive document representation of the page that can be stored in a knowledge base, fed to a search engine, passed to an LLM as context, or used as the basis for more targeted downstream extraction.

What Full-Page Extraction Captures

A complete full-page extraction typically includes:

Title — the page's <title> tag and primary <h1>
Body content — all readable text, converted to clean Markdown
Headings hierarchy — <h1> through <h6> preserved as # through ######
Lists — ordered and unordered lists
Code blocks — inline code and multi-line code fenced with language tags
Links — all <a href> links with their anchor text
Images — src URLs and alt text
Metadata — Open Graph tags, meta description, author, published date, canonical URL
Structured data — JSON-LD schema markup (Product, Article, FAQ, etc.) if present

Full-Page Extraction vs. Targeted Extraction

Aspect	Full-Page Extraction	Targeted Extraction
Output	Complete document	Specific fields (price, author, etc.)
Schema required	No	Yes
Use case	Knowledge bases, RAG, search	Databases, analytics, structured datasets
Maintenance	Low (no selectors)	Medium (schema may need updates)

Full-page extraction is the right choice when you want to index all of a page's content for future querying; targeted extraction is better when you need specific, database-ready fields.

KnowledgeSDK Full-Page Extraction

KnowledgeSDK's POST /v1/extract endpoint performs full-page extraction with a single API call, handling JavaScript rendering, Markdown conversion, and metadata parsing automatically:

POST /v1/extract
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://docs.example.com/guides/getting-started"
}

Response:

{
  "url": "https://docs.example.com/guides/getting-started",
  "title": "Getting Started Guide",
  "markdown": "# Getting Started\n\nWelcome to Example...",
  "metadata": {
    "description": "Learn how to get started with Example in under 5 minutes.",
    "author": "Example Team",
    "published_at": "2025-09-01",
    "canonical_url": "https://docs.example.com/guides/getting-started"
  },
  "links": [
    { "text": "Installation", "href": "/guides/installation" },
    { "text": "API Reference", "href": "/api" }
  ]
}

For visual capture alongside text extraction, combine with POST /v1/screenshot.

Use Cases for Full-Page Extraction

RAG knowledge bases — index entire documentation sites, wikis, and help centers for retrieval-augmented generation
Enterprise search — extract and index all intranet or partner site pages for internal search
Competitive intelligence — capture all content from competitor sites for analysis
AI training datasets — collect high-quality web text at scale for LLM fine-tuning
Content migration — extract and preserve content from legacy websites before they go offline
Legal and compliance archiving — preserve complete page snapshots with metadata for audit trails

Handling JavaScript-Rendered Pages

Full-page extraction must account for pages where content is generated by client-side JavaScript. KnowledgeSDK's extraction API uses a managed headless browser to fully render the page before extraction, ensuring that content from React, Vue, Angular, and Next.js applications is captured completely — not just the initial HTML shell.

Related Terms

Web Scraping & Extractionbeginner

Web Scraping

The automated extraction of data from websites by programmatically fetching and parsing HTML content.

Web Scraping & Extractionbeginner

Markdown Extraction

Converting raw HTML web pages into clean, structured Markdown text, removing navigation, ads, and boilerplate.

Web Scraping & Extractionintermediate

JavaScript Rendering

The process of executing a page's JavaScript in a real or headless browser to capture the fully rendered DOM before extraction.

← Fine-tuning Function Calling →

Try it now

Build with Full-Page Extraction using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary