knowledgesdk.com/glossary/full-page-extraction
Web Scraping & Extractionbeginner

Also known as: full page scrape

Full-Page Extraction

Capturing all visible and structured content from a web page — text, links, metadata, and media references — in a single API call.

What Is Full-Page Extraction?

Full-page extraction is the process of capturing all meaningful content from a web page — its body text, headings, links, images, metadata, and structured data — in a single operation. Unlike targeted extraction (which pulls specific fields like a price or a title), full-page extraction aims to preserve the complete informational content of a page in a clean, structured format.

The result is a comprehensive document representation of the page that can be stored in a knowledge base, fed to a search engine, passed to an LLM as context, or used as the basis for more targeted downstream extraction.

What Full-Page Extraction Captures

A complete full-page extraction typically includes:

  • Title — the page's <title> tag and primary <h1>
  • Body content — all readable text, converted to clean Markdown
  • Headings hierarchy<h1> through <h6> preserved as # through ######
  • Lists — ordered and unordered lists
  • Code blocks — inline code and multi-line code fenced with language tags
  • Links — all <a href> links with their anchor text
  • Imagessrc URLs and alt text
  • Metadata — Open Graph tags, meta description, author, published date, canonical URL
  • Structured data — JSON-LD schema markup (Product, Article, FAQ, etc.) if present

Full-Page Extraction vs. Targeted Extraction

Aspect Full-Page Extraction Targeted Extraction
Output Complete document Specific fields (price, author, etc.)
Schema required No Yes
Use case Knowledge bases, RAG, search Databases, analytics, structured datasets
Maintenance Low (no selectors) Medium (schema may need updates)

Full-page extraction is the right choice when you want to index all of a page's content for future querying; targeted extraction is better when you need specific, database-ready fields.

KnowledgeSDK Full-Page Extraction

KnowledgeSDK's POST /v1/extract endpoint performs full-page extraction with a single API call, handling JavaScript rendering, Markdown conversion, and metadata parsing automatically:

POST /v1/extract
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://docs.example.com/guides/getting-started"
}

Response:

{
  "url": "https://docs.example.com/guides/getting-started",
  "title": "Getting Started Guide",
  "markdown": "# Getting Started\n\nWelcome to Example...",
  "metadata": {
    "description": "Learn how to get started with Example in under 5 minutes.",
    "author": "Example Team",
    "published_at": "2025-09-01",
    "canonical_url": "https://docs.example.com/guides/getting-started"
  },
  "links": [
    { "text": "Installation", "href": "/guides/installation" },
    { "text": "API Reference", "href": "/api" }
  ]
}

For visual capture alongside text extraction, combine with POST /v1/screenshot.

Use Cases for Full-Page Extraction

  • RAG knowledge bases — index entire documentation sites, wikis, and help centers for retrieval-augmented generation
  • Enterprise search — extract and index all intranet or partner site pages for internal search
  • Competitive intelligence — capture all content from competitor sites for analysis
  • AI training datasets — collect high-quality web text at scale for LLM fine-tuning
  • Content migration — extract and preserve content from legacy websites before they go offline
  • Legal and compliance archiving — preserve complete page snapshots with metadata for audit trails

Handling JavaScript-Rendered Pages

Full-page extraction must account for pages where content is generated by client-side JavaScript. KnowledgeSDK's extraction API uses a managed headless browser to fully render the page before extraction, ensuring that content from React, Vue, Angular, and Next.js applications is captured completely — not just the initial HTML shell.

Related Terms

Web Scraping & Extractionbeginner
Web Scraping
The automated extraction of data from websites by programmatically fetching and parsing HTML content.
Web Scraping & Extractionbeginner
Markdown Extraction
Converting raw HTML web pages into clean, structured Markdown text, removing navigation, ads, and boilerplate.
Web Scraping & Extractionintermediate
JavaScript Rendering
The process of executing a page's JavaScript in a real or headless browser to capture the fully rendered DOM before extraction.
Fine-tuningFunction Calling

Try it now

Build with Full-Page Extraction using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary