knowledgesdk.com/blog/structured-data-extraction

tutorialMarch 20, 2026·12 min read

Extract Structured Data from Any Website with a Single API Call

Learn how to extract structured JSON data from any website using KnowledgeSDK. No CSS selectors, no broken scrapers — just a schema and an API call.

Extract Structured Data from Any Website with a Single API Call

Every developer who has maintained a web scraper knows the dread: you open your dashboard on a Monday morning, and half your scrapers have returned empty results. The target site updated its HTML structure over the weekend. The CSS selectors you painstakingly crafted three months ago no longer match anything.

Traditional web scraping is a game of cat and mouse. You write selectors, they break, you fix them, they break again. For teams building production AI pipelines, this maintenance overhead is a serious problem.

Structured data extraction through a schema-based API changes the equation entirely. Instead of telling the API where data is on a page (via CSS selectors), you tell it what data you want — and the API figures out where to find it.

This article walks through extracting structured JSON from three real-world scenarios — e-commerce product pages, job listings, and news article metadata — using KnowledgeSDK's /v1/extract endpoint. We also benchmark the approach against BeautifulSoup and compare it with Firecrawl's /extract and Diffbot's Structured API.

Why CSS Selectors Break (And Why Schema-Based Extraction Doesn't)

CSS selectors bind your scraper to the current HTML structure of a page. When a site migrates from Bootstrap 4 to Tailwind, renames a class from product-price to price-display, or switches from a <span> to a <div>, your selector silently returns nothing.

Schema-based extraction works differently. You describe the fields you want in plain JSON — or even plain English — and the API uses a combination of semantic understanding and structured parsing to locate and return those fields regardless of where they live in the HTML.

The result is a scraper that survives redesigns, A/B tests, and framework migrations without any code changes on your end.

Setting Up KnowledgeSDK

Install the SDK for your language:

# Node.js
npm install @knowledgesdk/node

# Python
pip install knowledgesdk

Initialize the client with your API key (get one free at knowledgesdk.com):

// Node.js
import { KnowledgeSDK } from "@knowledgesdk/node";

const client = new KnowledgeSDK({
  apiKey: process.env.KNOWLEDGESDK_API_KEY,
});

# Python
from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

Tutorial 1: Extracting E-Commerce Product Data

Let's extract structured product information from an e-commerce product page. Traditionally, this requires inspecting the page, finding the right selectors for price, title, images, and reviews — and then maintaining those selectors indefinitely.

With KnowledgeSDK, you define a schema once:

// Node.js — Extract product data
const result = await client.extract({
  url: "https://example-shop.com/products/wireless-headphones",
  schema: {
    type: "object",
    properties: {
      name: { type: "string", description: "Product name or title" },
      price: { type: "number", description: "Current sale price in USD" },
      originalPrice: { type: "number", description: "Original price before discount" },
      currency: { type: "string", description: "Currency code, e.g. USD" },
      rating: { type: "number", description: "Average customer rating out of 5" },
      reviewCount: { type: "integer", description: "Total number of customer reviews" },
      availability: { type: "string", description: "In stock, out of stock, or pre-order" },
      images: {
        type: "array",
        items: { type: "string" },
        description: "All product image URLs",
      },
      description: { type: "string", description: "Full product description" },
      specifications: {
        type: "object",
        description: "Key product specs as key-value pairs",
      },
    },
    required: ["name", "price", "availability"],
  },
});

console.log(result.data);
// {
//   name: "Sony WH-1000XM5 Wireless Headphones",
//   price: 279.99,
//   originalPrice: 349.99,
//   currency: "USD",
//   rating: 4.7,
//   reviewCount: 12483,
//   availability: "In stock",
//   images: ["https://...", "https://..."],
//   description: "Industry-leading noise cancellation...",
//   specifications: { "Battery Life": "30 hours", "Weight": "250g" }
// }

# Python — Extract product data
result = client.extract(
    url="https://example-shop.com/products/wireless-headphones",
    schema={
        "type": "object",
        "properties": {
            "name": {"type": "string", "description": "Product name or title"},
            "price": {"type": "number", "description": "Current sale price in USD"},
            "original_price": {"type": "number", "description": "Original price before discount"},
            "currency": {"type": "string", "description": "Currency code, e.g. USD"},
            "rating": {"type": "number", "description": "Average customer rating out of 5"},
            "review_count": {"type": "integer", "description": "Total number of customer reviews"},
            "availability": {"type": "string", "description": "In stock, out of stock, or pre-order"},
            "images": {
                "type": "array",
                "items": {"type": "string"},
                "description": "All product image URLs",
            },
            "description": {"type": "string", "description": "Full product description"},
            "specifications": {
                "type": "object",
                "description": "Key product specs as key-value pairs",
            },
        },
        "required": ["name", "price", "availability"],
    },
)

print(result.data)

The key insight: you never touch a CSS selector. If the site redesigns tomorrow, the same schema still works.

Tutorial 2: Extracting Job Listings

Job board data is notoriously inconsistent. LinkedIn, Indeed, Greenhouse, and Lever all structure job postings differently. A traditional scraper needs a separate parser for each.

With schema-based extraction, one schema handles all of them:

// Node.js — Extract job listing
async function extractJobListing(url: string) {
  const result = await client.extract({
    url,
    schema: {
      type: "object",
      properties: {
        title: { type: "string", description: "Job title" },
        company: { type: "string", description: "Company name" },
        location: { type: "string", description: "Job location, including remote options" },
        salary: {
          type: "object",
          properties: {
            min: { type: "number" },
            max: { type: "number" },
            currency: { type: "string" },
            period: { type: "string", description: "hourly, monthly, or annual" },
          },
        },
        employmentType: {
          type: "string",
          description: "Full-time, part-time, contract, or freelance",
        },
        experienceLevel: {
          type: "string",
          description: "Entry, mid, senior, or executive",
        },
        requiredSkills: {
          type: "array",
          items: { type: "string" },
          description: "List of required technical skills",
        },
        responsibilities: {
          type: "array",
          items: { type: "string" },
          description: "Key job responsibilities",
        },
        applicationDeadline: {
          type: "string",
          description: "Application deadline in ISO 8601 format",
        },
        postedDate: {
          type: "string",
          description: "Date the job was posted in ISO 8601 format",
        },
      },
      required: ["title", "company", "location"],
    },
  });

  return result.data;
}

// Works across different job boards — same schema, different URLs
const greenhouseJob = await extractJobListing(
  "https://boards.greenhouse.io/company/jobs/12345"
);
const leverJob = await extractJobListing(
  "https://jobs.lever.co/company/position-slug"
);

# Python — Extract job listings from multiple boards
def extract_job_listing(url: str) -> dict:
    result = client.extract(
        url=url,
        schema={
            "type": "object",
            "properties": {
                "title": {"type": "string", "description": "Job title"},
                "company": {"type": "string", "description": "Company name"},
                "location": {"type": "string", "description": "Job location"},
                "salary": {
                    "type": "object",
                    "properties": {
                        "min": {"type": "number"},
                        "max": {"type": "number"},
                        "currency": {"type": "string"},
                        "period": {"type": "string"},
                    },
                },
                "employment_type": {"type": "string"},
                "experience_level": {"type": "string"},
                "required_skills": {
                    "type": "array",
                    "items": {"type": "string"},
                },
                "responsibilities": {
                    "type": "array",
                    "items": {"type": "string"},
                },
                "posted_date": {"type": "string"},
            },
            "required": ["title", "company", "location"],
        },
    )
    return result.data


# One schema, multiple job boards
job_urls = [
    "https://boards.greenhouse.io/company/jobs/12345",
    "https://jobs.lever.co/company/position-slug",
    "https://www.linkedin.com/jobs/view/12345",
]

jobs = [extract_job_listing(url) for url in job_urls]

Tutorial 3: Extracting News Article Metadata

News article extraction is another excellent use case. Publishers use dozens of different CMS platforms, each with its own HTML structure. Extracting consistent metadata for a news aggregator or media monitoring tool requires schema-based extraction.

// Node.js — News article metadata extraction
const article = await client.extract({
  url: "https://techcrunch.com/2026/03/15/some-article",
  schema: {
    type: "object",
    properties: {
      headline: { type: "string", description: "Article headline" },
      subheadline: { type: "string", description: "Article subheadline or deck" },
      author: {
        type: "object",
        properties: {
          name: { type: "string" },
          bio: { type: "string" },
          twitter: { type: "string" },
        },
      },
      publishedAt: {
        type: "string",
        description: "Publication date and time in ISO 8601",
      },
      updatedAt: {
        type: "string",
        description: "Last updated date and time in ISO 8601",
      },
      category: { type: "string", description: "Article category or section" },
      tags: {
        type: "array",
        items: { type: "string" },
        description: "Article tags or topics",
      },
      summary: {
        type: "string",
        description: "2-3 sentence summary of the article",
      },
      wordCount: { type: "integer", description: "Approximate word count" },
      readingTime: {
        type: "integer",
        description: "Estimated reading time in minutes",
      },
    },
    required: ["headline", "publishedAt"],
  },
});

# Python — News article metadata extraction
article = client.extract(
    url="https://techcrunch.com/2026/03/15/some-article",
    schema={
        "type": "object",
        "properties": {
            "headline": {"type": "string"},
            "subheadline": {"type": "string"},
            "author": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "bio": {"type": "string"},
                    "twitter": {"type": "string"},
                },
            },
            "published_at": {"type": "string", "description": "ISO 8601 datetime"},
            "updated_at": {"type": "string", "description": "ISO 8601 datetime"},
            "category": {"type": "string"},
            "tags": {"type": "array", "items": {"type": "string"}},
            "summary": {"type": "string", "description": "2-3 sentence summary"},
            "word_count": {"type": "integer"},
            "reading_time": {"type": "integer"},
        },
        "required": ["headline", "published_at"],
    },
)

print(f"Headline: {article.data['headline']}")
print(f"Published: {article.data['published_at']}")
print(f"Tags: {', '.join(article.data.get('tags', []))}")

Benchmark: KnowledgeSDK vs BeautifulSoup + Manual Parsing

To quantify the difference, we timed extracting product data from 50 e-commerce pages using three approaches: BeautifulSoup with manually written selectors, BeautifulSoup with a simple LLM layer, and KnowledgeSDK's schema extraction endpoint.

Metric	BeautifulSoup + Selectors	BeautifulSoup + LLM	KnowledgeSDK Extract
Initial setup time	4–8 hours per site	2–4 hours	15 minutes (one schema)
Lines of code	200–400	100–200	20–40
Works across sites	No (site-specific)	Partial	Yes
Breaks on HTML change	Yes (always)	Sometimes	Rarely
Avg latency per page	800ms	2.5s	1.8s
Accuracy on complex pages	85–90%	78–85%	92–96%
Maintenance required	High (ongoing)	Medium	Low

The accuracy advantage of KnowledgeSDK comes from the combined approach: semantic field detection plus structured HTML parsing, with fallback heuristics for edge cases.

Comparison: KnowledgeSDK vs Firecrawl Extract vs Diffbot

Both Firecrawl and Diffbot offer structured extraction. Here is how they compare:

Feature	KnowledgeSDK	Firecrawl /extract	Diffbot Structured API
Custom JSON schema	Yes	Yes (with LLM)	Yes (via rules)
Plain English field descriptions	Yes	Yes	No
Semantic search over results	Yes (built-in)	No	No
Webhook change alerts	Yes	No	Limited
Pre-built extraction types	No	No	Yes (articles, products)
Pricing model	Usage-based	Credit-based	Per-page + subscription
Free tier	1,000 requests/mo	500 credits/mo	Limited trial
Self-hosted option	No	Yes (open-source)	No
SDK languages	Node.js, Python	Node.js, Python, Go	Node.js, Python, Java
Anti-bot handling	Yes	Yes	Yes

When to use Firecrawl: If you need open-source self-hosting, or if your use case is primarily PDF extraction and document parsing, Firecrawl's open-source version is a strong choice.

When to use Diffbot: If you need pre-built entity recognition for articles, products, or jobs without writing schemas, Diffbot's knowledge graph approach works well — at enterprise pricing.

When to use KnowledgeSDK: If you need structured extraction plus the ability to search across extracted data, monitor pages for changes via webhooks, and integrate everything into a single API that your AI agents can call — KnowledgeSDK was purpose-built for this.

Handling Async Extraction for Large Pages

For complex pages or high-volume extraction jobs, use the async endpoint to avoid timeout issues:

// Node.js — Async extraction with callback
const job = await client.extractAsync({
  url: "https://example.com/very-long-page",
  schema: { /* your schema */ },
  callbackUrl: "https://your-app.com/webhooks/extraction-complete",
});

console.log(`Job started: ${job.jobId}`);

// Poll for completion
const result = await client.waitForJob(job.jobId);
console.log(result.data);

# Python — Async extraction with polling
import time

job = client.extract_async(
    url="https://example.com/very-long-page",
    schema={...},  # your schema
    callback_url="https://your-app.com/webhooks/extraction-complete",
)

print(f"Job started: {job.job_id}")

# Poll for completion
while True:
    status = client.get_job(job.job_id)
    if status.status == "completed":
        print(status.result)
        break
    elif status.status == "failed":
        print(f"Job failed: {status.error}")
        break
    time.sleep(2)

Best Practices for Schema Design

After extracting data from thousands of pages, we have identified a few schema design patterns that consistently improve accuracy:

1. Use descriptive field descriptions. "description": "Current sale price in USD, not the original price" outperforms "description": "price" by a meaningful margin. The description is passed to the extraction model as a hint.

2. Mark optional fields explicitly. Only include required fields in the required array. Optional fields that are absent will return null rather than causing the extraction to fail.

3. Use specific types. "type": "number" for prices, "type": "integer" for counts, "type": "string" for dates in ISO format. Avoid "type": "any".

4. Nest related fields. Grouping author information under an author object, or salary information under a salary object, helps the extraction model understand relationships.

5. Add enum constraints for categorical fields. If you know a field should be one of a fixed set of values, include that in the schema:

employmentType: {
  type: "string",
  enum: ["full-time", "part-time", "contract", "freelance"],
  description: "Employment type",
}

Real-World Applications

Teams using schema-based extraction with KnowledgeSDK are building:

Price monitoring pipelines that track competitor product pricing across hundreds of e-commerce sites without maintaining site-specific scrapers
Job market intelligence tools that aggregate postings from Greenhouse, Lever, Workday, and dozens of other ATS platforms using a single schema
Content aggregators that extract structured article metadata from thousands of publishers for media monitoring and sentiment analysis
Lead enrichment systems that extract contact information, company details, and product offerings from prospect websites automatically