Intelligent Extraction

Using AI or LLMs to understand and extract meaningful content from web pages without manually writing CSS selectors or XPath rules.

What Is Intelligent Extraction?

Intelligent extraction is the application of large language models (LLMs) or other AI techniques to understand, interpret, and extract meaningful information from web pages — without requiring developers to write CSS selectors, XPath expressions, or regular expressions for each target site.

Instead of teaching a scraper "the price is in the element with class product-price__amount," you tell the AI "find the price" and let it reason about the page's content to locate and return the value.

Why Traditional Extraction Falls Short

Traditional DOM-parsing scrapers have a fundamental weakness: they are tightly coupled to the HTML structure of a specific page. When a site redesigns its frontend:

Class names change or get hashed
Element hierarchies shift
Data moves to different DOM nodes

The result is a broken scraper that silently returns empty or incorrect data until a developer notices and rewrites the selectors.

Intelligent extraction sidesteps this problem by operating at the semantic level — understanding what the content means rather than where it lives in the DOM.

How Intelligent Extraction Works

Fetch and render — the target URL is fetched and JavaScript is executed if needed
Convert to clean text — HTML is converted to Markdown, dramatically reducing token count and noise
Schema definition — the developer specifies what fields to extract, either as a JSON Schema or natural language description
LLM inference — the model reads the Markdown and populates the schema by reasoning about the content
Validation — the output is validated against the schema and returned as structured JSON

KnowledgeSDK's Intelligent Extraction

KnowledgeSDK's POST /v1/extract combines all of these steps into a single API call:

POST /v1/extract
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://news.example.com/article/ai-breakthrough",
  "schema": {
    "headline": "string",
    "author": "string",
    "published_date": "string",
    "summary": "string",
    "topics": "array of strings"
  }
}

Response:

{
  "headline": "Researchers Achieve New AI Benchmark",
  "author": "Jane Smith",
  "published_date": "2025-11-20",
  "summary": "A team at MIT has demonstrated...",
  "topics": ["artificial intelligence", "machine learning", "benchmarks"]
}

No selectors. No maintenance. No breakage when the site redesigns.

Advantages Over Rule-Based Extraction

Resilience to layout changes — the model adapts to new HTML structures automatically
Handles ambiguity — can infer a field even when it is phrased differently on different pages
Cross-site consistency — the same schema works across many sites without per-site configuration
Natural language schemas — define fields in plain English rather than technical selector syntax
Handles edge cases — missing fields, merged fields, and unusual formatting are handled gracefully

Use Cases

Multi-site price monitoring — extract prices from hundreds of e-commerce sites with one schema
News and research aggregation — collect structured article data without a parser per publisher
Lead enrichment — extract company details from any website without site-specific scrapers
Knowledge base construction — turn arbitrary web pages into structured knowledge items for RAG
Regulatory compliance monitoring — extract key terms from legal and government pages

Limitations

Latency — LLM inference adds 1-5 seconds compared to sub-100ms CSS selector extraction
Cost — token usage accumulates at scale; optimize by pre-cleaning with Markdown extraction
Hallucination risk — LLMs can occasionally invent data not present on the page; schema validation and confidence scores mitigate this
Token limits — very long pages must be chunked or summarized before extraction