knowledgesdk.com/glossary/structured-data-extraction
Web Scraping & Extractionintermediate

Also known as: data extraction, web data parsing

Structured Data Extraction

Pulling specific fields — prices, names, dates — from web pages into structured formats like JSON or CSV.

What Is Structured Data Extraction?

Structured data extraction is the process of identifying and pulling specific, named fields from unstructured or semi-structured web pages and outputting them in a machine-readable format — typically JSON or CSV. Rather than capturing an entire page as free text, you define the schema of what you want and the extractor fills it in.

Examples of structured extraction targets:

  • E-commerce product pages{ name, price, sku, rating, reviews_count }
  • Job listings{ title, company, location, salary_range, posted_date }
  • Real estate listings{ address, price, bedrooms, bathrooms, sqft }
  • News articles{ headline, author, published_at, body, tags }

Traditional Approaches

Historically, structured extraction relied on writing CSS selectors or XPath expressions that pinpoint specific DOM nodes:

// CSS selector approach
const price = document.querySelector('.product-price .amount').innerText;
const title = document.querySelector('h1.product-title').innerText;

This works well for consistent, well-structured pages but breaks whenever the site's HTML changes — a common occurrence with frequent frontend redesigns.

AI-Powered Structured Extraction

Modern extraction pipelines use LLMs to understand page content semantically and populate a JSON schema without hardcoded selectors. You describe the fields you want in natural language or as a JSON Schema, and the model fills them in:

{
  "schema": {
    "name": "string",
    "price": "number",
    "currency": "string",
    "in_stock": "boolean"
  }
}

This approach is far more resilient to HTML changes because the model understands meaning, not just DOM structure.

Structured Extraction with KnowledgeSDK

KnowledgeSDK's POST /v1/extract endpoint combines JavaScript rendering, Markdown extraction, and AI-powered field parsing into a single call:

POST /v1/extract
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://shop.example.com/products/widget-pro",
  "schema": {
    "product_name": "string",
    "price_usd": "number",
    "availability": "string",
    "description": "string"
  }
}

The API returns a structured JSON object matching your schema, ready to insert into a database or pass to downstream processing.

Output Formats

Format Best For
JSON APIs, databases, LLM context
CSV Spreadsheets, bulk analytics
Markdown LLM ingestion, knowledge bases
Parquet Large-scale data warehousing

Challenges in Structured Extraction

  • Schema drift — target sites change their layout, breaking field mappings
  • Missing fields — some pages omit optional fields entirely
  • Ambiguous values — prices in mixed currencies, dates in different formats
  • Nested structures — product variants, review threads, and comment trees require recursive extraction
  • Pagination — a full product catalog may span hundreds of pages

Combining AI extraction with schema validation and fallback defaults is the most robust pattern for production pipelines.

Related Terms

Web Scraping & Extractionbeginner
Web Scraping
The automated extraction of data from websites by programmatically fetching and parsing HTML content.
Web Scraping & Extractionintermediate
Intelligent Extraction
Using AI or LLMs to understand and extract meaningful content from web pages without manually writing CSS selectors or XPath rules.
Web Scraping & Extractionintermediate
DOM Parsing
Traversing and extracting content from a browser's Document Object Model tree using selectors like CSS or XPath.
LLMsbeginner
JSON Schema
A vocabulary for describing and validating the structure of JSON data, widely used to define the expected output format for LLM function calls.
Sparse RetrievalStructured Output

Try it now

Build with Structured Data Extraction using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary