Structured Data Extraction

Pulling specific fields — prices, names, dates — from web pages into structured formats like JSON or CSV.

What Is Structured Data Extraction?

Structured data extraction is the process of identifying and pulling specific, named fields from unstructured or semi-structured web pages and outputting them in a machine-readable format — typically JSON or CSV. Rather than capturing an entire page as free text, you define the schema of what you want and the extractor fills it in.

Examples of structured extraction targets:

E-commerce product pages → { name, price, sku, rating, reviews_count }
Job listings → { title, company, location, salary_range, posted_date }
Real estate listings → { address, price, bedrooms, bathrooms, sqft }
News articles → { headline, author, published_at, body, tags }

Traditional Approaches

Historically, structured extraction relied on writing CSS selectors or XPath expressions that pinpoint specific DOM nodes:

// CSS selector approach
const price = document.querySelector('.product-price .amount').innerText;
const title = document.querySelector('h1.product-title').innerText;

This works well for consistent, well-structured pages but breaks whenever the site's HTML changes — a common occurrence with frequent frontend redesigns.

AI-Powered Structured Extraction

Modern extraction pipelines use LLMs to understand page content semantically and populate a JSON schema without hardcoded selectors. You describe the fields you want in natural language or as a JSON Schema, and the model fills them in:

{
  "schema": {
    "name": "string",
    "price": "number",
    "currency": "string",
    "in_stock": "boolean"
  }
}

This approach is far more resilient to HTML changes because the model understands meaning, not just DOM structure.

Structured Extraction with KnowledgeSDK

KnowledgeSDK's POST /v1/extract endpoint combines JavaScript rendering, Markdown extraction, and AI-powered field parsing into a single call:

POST /v1/extract
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://shop.example.com/products/widget-pro",
  "schema": {
    "product_name": "string",
    "price_usd": "number",
    "availability": "string",
    "description": "string"
  }
}

The API returns a structured JSON object matching your schema, ready to insert into a database or pass to downstream processing.

Output Formats

Format	Best For
JSON	APIs, databases, LLM context
CSV	Spreadsheets, bulk analytics
Markdown	LLM ingestion, knowledge bases
Parquet	Large-scale data warehousing

Challenges in Structured Extraction

Schema drift — target sites change their layout, breaking field mappings
Missing fields — some pages omit optional fields entirely
Ambiguous values — prices in mixed currencies, dates in different formats
Nested structures — product variants, review threads, and comment trees require recursive extraction
Pagination — a full product catalog may span hundreds of pages

Combining AI extraction with schema validation and fallback defaults is the most robust pattern for production pipelines.