What Is Structured Data Extraction?
Structured data extraction is the process of identifying and pulling specific, named fields from unstructured or semi-structured web pages and outputting them in a machine-readable format — typically JSON or CSV. Rather than capturing an entire page as free text, you define the schema of what you want and the extractor fills it in.
Examples of structured extraction targets:
- E-commerce product pages →
{ name, price, sku, rating, reviews_count } - Job listings →
{ title, company, location, salary_range, posted_date } - Real estate listings →
{ address, price, bedrooms, bathrooms, sqft } - News articles →
{ headline, author, published_at, body, tags }
Traditional Approaches
Historically, structured extraction relied on writing CSS selectors or XPath expressions that pinpoint specific DOM nodes:
// CSS selector approach
const price = document.querySelector('.product-price .amount').innerText;
const title = document.querySelector('h1.product-title').innerText;
This works well for consistent, well-structured pages but breaks whenever the site's HTML changes — a common occurrence with frequent frontend redesigns.
AI-Powered Structured Extraction
Modern extraction pipelines use LLMs to understand page content semantically and populate a JSON schema without hardcoded selectors. You describe the fields you want in natural language or as a JSON Schema, and the model fills them in:
{
"schema": {
"name": "string",
"price": "number",
"currency": "string",
"in_stock": "boolean"
}
}
This approach is far more resilient to HTML changes because the model understands meaning, not just DOM structure.
Structured Extraction with KnowledgeSDK
KnowledgeSDK's POST /v1/extract endpoint combines JavaScript rendering, Markdown extraction, and AI-powered field parsing into a single call:
POST /v1/extract
Authorization: Bearer knowledgesdk_live_...
{
"url": "https://shop.example.com/products/widget-pro",
"schema": {
"product_name": "string",
"price_usd": "number",
"availability": "string",
"description": "string"
}
}
The API returns a structured JSON object matching your schema, ready to insert into a database or pass to downstream processing.
Output Formats
| Format | Best For |
|---|---|
| JSON | APIs, databases, LLM context |
| CSV | Spreadsheets, bulk analytics |
| Markdown | LLM ingestion, knowledge bases |
| Parquet | Large-scale data warehousing |
Challenges in Structured Extraction
- Schema drift — target sites change their layout, breaking field mappings
- Missing fields — some pages omit optional fields entirely
- Ambiguous values — prices in mixed currencies, dates in different formats
- Nested structures — product variants, review threads, and comment trees require recursive extraction
- Pagination — a full product catalog may span hundreds of pages
Combining AI extraction with schema validation and fallback defaults is the most robust pattern for production pipelines.