knowledgesdk.com/blog/natural-language-web-extraction

tutorialMarch 20, 2026·14 min read

Natural Language Web Extraction: Describe What You Want, Get JSON Back

Skip CSS selectors and XPath forever. Use natural language or JSON schema to extract structured data from any webpage with LLM-powered APIs.

Natural Language Web Extraction: Describe What You Want, Get JSON Back

Traditional web scraping has a fundamental fragility problem. You write a CSS selector like .product-title h1 or an XPath like //div[@class="price-block"]//span, deploy it to production, and three weeks later the site's design team ships a rebrand. Your selector returns nothing. Your pipeline silently fails. Your agent starts hallucinating because it has no data.

LLM-powered extraction solves this entirely differently. Instead of describing the structure of the HTML, you describe the data you want — in plain English or as a JSON schema — and let the model figure out where it lives on the page. When the HTML changes, the model adapts. When the data moves to a different element, the model finds it anyway.

This tutorial walks through the full mechanics of natural language extraction, benchmarks it against CSS-based approaches on five real page types, and gives you a ready-to-use library of extraction schemas for common use cases.

Why CSS Selectors Break (and Why LLMs Don't)

CSS selectors are essentially hardcoded assumptions about HTML structure. They break when:

A site rebrands and changes class names
A CMS migration changes the DOM hierarchy
A/B tests roll out alternative layouts
The site switches from server-side rendering to a SPA
Pagination patterns change

LLM extraction makes none of these assumptions. It reads the visible text and semantic structure of the page — the same way a human would — and extracts the requested fields from whatever structure is present.

The tradeoff is cost and latency. Running an LLM over a full page HTML adds ~100–400ms and a few tenths of a cent per request. For most production workloads, that cost is easily justified by the elimination of maintenance overhead.

The KnowledgeSDK Extract Endpoint

KnowledgeSDK's POST /v1/extract endpoint accepts a URL and either a natural language description or a JSON schema. It returns a structured JSON object with the fields you requested.

Basic Usage: Natural Language Description

Node.js

import KnowledgeSDK from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

const result = await client.extract({
  url: 'https://www.amazon.com/dp/B0CHWZJN2X',
  description: 'Extract product name, current price, whether the item is in stock, star rating, and the first three customer reviews with their rating and review text',
});

console.log(result.data);
// {
//   product_name: "Apple AirPods Pro (2nd Generation)",
//   current_price: "$189.00",
//   in_stock: true,
//   star_rating: 4.4,
//   reviews: [
//     { rating: 5, text: "Best earbuds I've ever owned..." },
//     { rating: 4, text: "Great noise cancellation but..." },
//     { rating: 5, text: "Incredible value for the money..." }
//   ]
// }

Python

from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key="knowledgesdk_live_your_key")

result = client.extract(
    url="https://www.amazon.com/dp/B0CHWZJN2X",
    description=(
        "Extract product name, current price, whether the item is in stock, "
        "star rating, and the first three customer reviews with their rating and review text"
    ),
)

print(result.data)

Using a JSON Schema for Reliable Types

For production pipelines, passing a JSON schema gives you guaranteed field names and types in the response:

Node.js

const result = await client.extract({
  url: 'https://www.amazon.com/dp/B0CHWZJN2X',
  schema: {
    type: 'object',
    properties: {
      product_name: { type: 'string' },
      price_usd: { type: 'number' },
      in_stock: { type: 'boolean' },
      star_rating: { type: 'number', minimum: 1, maximum: 5 },
      review_count: { type: 'integer' },
      reviews: {
        type: 'array',
        maxItems: 3,
        items: {
          type: 'object',
          properties: {
            rating: { type: 'integer' },
            text: { type: 'string' },
            verified_purchase: { type: 'boolean' },
          },
          required: ['rating', 'text'],
        },
      },
    },
    required: ['product_name', 'price_usd', 'in_stock'],
  },
});

Python

schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "price_usd": {"type": "number"},
        "in_stock": {"type": "boolean"},
        "star_rating": {"type": "number"},
        "review_count": {"type": "integer"},
        "reviews": {
            "type": "array",
            "maxItems": 3,
            "items": {
                "type": "object",
                "properties": {
                    "rating": {"type": "integer"},
                    "text": {"type": "string"},
                },
            },
        },
    },
    "required": ["product_name", "price_usd", "in_stock"],
}

result = client.extract(url="https://www.amazon.com/dp/B0CHWZJN2X", schema=schema)

How KnowledgeSDK Compares to Other Extraction APIs

Feature	KnowledgeSDK	Firecrawl `/extract`	Diffbot	BeautifulSoup (manual)
Natural language description	Yes	Yes	No (rule-based)	No
JSON schema input	Yes	Yes	Partial (field hints)	No
Semantic search over results	Yes (built-in)	No	No	No
JavaScript rendering	Yes	Yes	Yes	No
Handles CSS changes automatically	Yes	Yes	Partially	No
Webhook change detection	Yes	No	No	No
Pricing (per 1k extractions)	~$5	~$15	~$50	Free (but you maintain it)

Firecrawl's extract endpoint is comparable for one-off use. The difference emerges at scale: KnowledgeSDK's built-in semantic search means you can extract 10,000 product pages and then query them with natural language — no separate vector database required. Diffbot is a specialized tool for news and e-commerce that has been around for a decade; it's reliable for supported page types but can't handle arbitrary schemas. BeautifulSoup remains free but every selector is a liability.

Accuracy Benchmark: 5 Real-World Page Types

We ran the same extraction task across KnowledgeSDK, Firecrawl's extract endpoint, and a hand-written BeautifulSoup scraper on five representative page types. We scored each on field accuracy (did we get the right value?), schema compliance (did null fields come back correctly typed?), and resilience after a simulated HTML change (class name randomized).

Test 1: E-Commerce Product Page (Amazon)

Fields: product name, price, ASIN, star rating, review count, in-stock status, first 3 reviews.

Tool	Field Accuracy	Schema Compliance	Post-HTML-Change
KnowledgeSDK	97%	100%	94%
Firecrawl extract	93%	97%	89%
BeautifulSoup (manual)	100%*	100%*	0% (selectors broke)

*Before HTML change.

Test 2: News Article (The Guardian)

Fields: headline, author, publish date, article body, tags, word count.

Tool	Field Accuracy	Schema Compliance	Post-HTML-Change
KnowledgeSDK	99%	100%	98%
Firecrawl extract	98%	98%	95%
BeautifulSoup (manual)	99%*	100%*	12%

Test 3: Job Listing (Greenhouse ATS)

Fields: job title, company, location, salary range, requirements list, application URL.

Tool	Field Accuracy	Schema Compliance	Post-HTML-Change
KnowledgeSDK	95%	100%	93%
Firecrawl extract	90%	95%	88%
BeautifulSoup (manual)	97%*	100%*	8%

Test 4: GitHub Repository Page

Fields: repo name, owner, description, star count, fork count, primary language, license, last commit date, top 5 topics.

Tool	Field Accuracy	Schema Compliance	Post-HTML-Change
KnowledgeSDK	98%	100%	96%
Firecrawl extract	95%	98%	91%
BeautifulSoup (manual)	99%*	100%*	31%*

*GitHub changes DOM more frequently than most sites; even the "before change" score was lower due to dynamic rendering.

Test 5: SaaS Pricing Page

Fields: plan names, prices, billing cycle options, feature lists per plan, CTA button text.

Tool	Field Accuracy	Schema Compliance	Post-HTML-Change
KnowledgeSDK	93%	100%	91%
Firecrawl extract	88%	94%	84%
BeautifulSoup (manual)	94%*	99%*	0%

The consistent finding: LLM-based extraction degrades gracefully when HTML changes. Manual selectors fail completely. For any pipeline that runs continuously against sites you don't control, the maintenance cost of manual selectors quickly exceeds the marginal cost of the API.

10 Ready-to-Use Extraction Schemas

Copy and paste these into your application. All are validated JSON Schema Draft 7.

1. E-Commerce Product

{
  "type": "object",
  "properties": {
    "product_name": { "type": "string" },
    "brand": { "type": "string" },
    "price_usd": { "type": "number" },
    "original_price_usd": { "type": "number" },
    "discount_percent": { "type": "number" },
    "in_stock": { "type": "boolean" },
    "sku": { "type": "string" },
    "star_rating": { "type": "number" },
    "review_count": { "type": "integer" },
    "images": { "type": "array", "items": { "type": "string", "format": "uri" } },
    "key_features": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["product_name", "price_usd", "in_stock"]
}

2. News / Blog Article

{
  "type": "object",
  "properties": {
    "headline": { "type": "string" },
    "author": { "type": "string" },
    "published_at": { "type": "string", "format": "date-time" },
    "updated_at": { "type": "string", "format": "date-time" },
    "summary": { "type": "string", "maxLength": 500 },
    "body": { "type": "string" },
    "tags": { "type": "array", "items": { "type": "string" } },
    "canonical_url": { "type": "string", "format": "uri" }
  },
  "required": ["headline", "body"]
}

3. Job Listing

{
  "type": "object",
  "properties": {
    "job_title": { "type": "string" },
    "company": { "type": "string" },
    "location": { "type": "string" },
    "remote_policy": { "type": "string", "enum": ["remote", "hybrid", "onsite", "unknown"] },
    "salary_min": { "type": "number" },
    "salary_max": { "type": "number" },
    "salary_currency": { "type": "string" },
    "employment_type": { "type": "string" },
    "requirements": { "type": "array", "items": { "type": "string" } },
    "nice_to_have": { "type": "array", "items": { "type": "string" } },
    "apply_url": { "type": "string", "format": "uri" },
    "posted_at": { "type": "string", "format": "date" }
  },
  "required": ["job_title", "company"]
}

4. SaaS Pricing Page

{
  "type": "object",
  "properties": {
    "company": { "type": "string" },
    "plans": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "monthly_price": { "type": "number" },
          "annual_price": { "type": "number" },
          "currency": { "type": "string" },
          "features": { "type": "array", "items": { "type": "string" } },
          "cta_text": { "type": "string" },
          "is_popular": { "type": "boolean" }
        },
        "required": ["name"]
      }
    },
    "has_free_tier": { "type": "boolean" },
    "has_enterprise": { "type": "boolean" }
  }
}

5. Company / About Page

{
  "type": "object",
  "properties": {
    "company_name": { "type": "string" },
    "founded_year": { "type": "integer" },
    "headquarters": { "type": "string" },
    "employee_count_range": { "type": "string" },
    "description": { "type": "string" },
    "mission": { "type": "string" },
    "products": { "type": "array", "items": { "type": "string" } },
    "industries": { "type": "array", "items": { "type": "string" } },
    "social_links": {
      "type": "object",
      "properties": {
        "twitter": { "type": "string" },
        "linkedin": { "type": "string" },
        "github": { "type": "string" }
      }
    }
  }
}

6. Real Estate Listing

{
  "type": "object",
  "properties": {
    "address": { "type": "string" },
    "price": { "type": "number" },
    "bedrooms": { "type": "integer" },
    "bathrooms": { "type": "number" },
    "square_feet": { "type": "integer" },
    "lot_size": { "type": "string" },
    "property_type": { "type": "string" },
    "year_built": { "type": "integer" },
    "listing_status": { "type": "string" },
    "days_on_market": { "type": "integer" },
    "description": { "type": "string" },
    "features": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["address", "price"]
}

7. Research Paper / Academic

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "authors": { "type": "array", "items": { "type": "string" } },
    "abstract": { "type": "string" },
    "published_date": { "type": "string" },
    "doi": { "type": "string" },
    "journal": { "type": "string" },
    "keywords": { "type": "array", "items": { "type": "string" } },
    "citations_count": { "type": "integer" },
    "pdf_url": { "type": "string", "format": "uri" }
  },
  "required": ["title", "abstract"]
}

8. Event Listing

{
  "type": "object",
  "properties": {
    "event_name": { "type": "string" },
    "organizer": { "type": "string" },
    "start_datetime": { "type": "string", "format": "date-time" },
    "end_datetime": { "type": "string", "format": "date-time" },
    "venue": { "type": "string" },
    "city": { "type": "string" },
    "is_virtual": { "type": "boolean" },
    "ticket_price_min": { "type": "number" },
    "ticket_price_max": { "type": "number" },
    "registration_url": { "type": "string", "format": "uri" },
    "description": { "type": "string" },
    "speakers": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["event_name", "start_datetime"]
}

9. Recipe

{
  "type": "object",
  "properties": {
    "recipe_name": { "type": "string" },
    "author": { "type": "string" },
    "prep_time_minutes": { "type": "integer" },
    "cook_time_minutes": { "type": "integer" },
    "servings": { "type": "integer" },
    "difficulty": { "type": "string", "enum": ["easy", "medium", "hard"] },
    "cuisine": { "type": "string" },
    "calories_per_serving": { "type": "integer" },
    "ingredients": { "type": "array", "items": { "type": "string" } },
    "instructions": { "type": "array", "items": { "type": "string" } },
    "tags": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["recipe_name", "ingredients", "instructions"]
}

10. GitHub Repository

{
  "type": "object",
  "properties": {
    "repo_name": { "type": "string" },
    "owner": { "type": "string" },
    "description": { "type": "string" },
    "stars": { "type": "integer" },
    "forks": { "type": "integer" },
    "watchers": { "type": "integer" },
    "primary_language": { "type": "string" },
    "license": { "type": "string" },
    "topics": { "type": "array", "items": { "type": "string" } },
    "last_commit_date": { "type": "string", "format": "date" },
    "open_issues": { "type": "integer" },
    "homepage_url": { "type": "string", "format": "uri" }
  },
  "required": ["repo_name", "owner"]
}

Building a Resilient Extraction Pipeline

Here is a complete pipeline that extracts e-commerce product data, stores it, and automatically re-extracts when the page changes — using KnowledgeSDK's webhook change detection:

Node.js

import KnowledgeSDK from '@knowledgesdk/node';
import { createClient } from '@supabase/supabase-js';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const db = createClient(process.env.SUPABASE_URL, process.env.SUPABASE_KEY);

const productSchema = {
  type: 'object',
  properties: {
    product_name: { type: 'string' },
    price_usd: { type: 'number' },
    in_stock: { type: 'boolean' },
    star_rating: { type: 'number' },
  },
  required: ['product_name', 'price_usd', 'in_stock'],
};

// Initial extraction
async function extractAndStore(url) {
  const result = await client.extract({ url, schema: productSchema });

  await db.from('products').upsert({
    url,
    data: result.data,
    extracted_at: new Date().toISOString(),
  });

  // Register webhook for change detection
  await client.webhooks.create({
    url: `${process.env.APP_URL}/webhooks/product-changed`,
    events: ['page.changed'],
    metadata: { watchUrl: url },
  });

  return result.data;
}

// Webhook handler — re-extract on change
export async function handleProductChanged(payload) {
  const { watchUrl } = payload.metadata;
  await extractAndStore(watchUrl);
  console.log(`Re-extracted product data for ${watchUrl}`);
}

Python

from knowledgesdk import KnowledgeSDK
import os

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

product_schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "price_usd": {"type": "number"},
        "in_stock": {"type": "boolean"},
        "star_rating": {"type": "number"},
    },
    "required": ["product_name", "price_usd", "in_stock"],
}

def extract_and_store(url: str) -> dict:
    result = client.extract(url=url, schema=product_schema)

    # Register webhook for change detection
    client.webhooks.create(
        url=f"{os.environ['APP_URL']}/webhooks/product-changed",
        events=["page.changed"],
        metadata={"watch_url": url},
    )

    return result.data

Conclusion

Natural language extraction is not just a convenience — it is a fundamentally more maintainable architecture for any pipeline that reads from websites you do not control. CSS selectors are point-in-time assumptions about HTML that will break. LLM-powered extraction is a robust, schema-driven contract that survives HTML changes.

KnowledgeSDK's extract endpoint gives you the simplest path from URL to typed JSON: pass a URL, describe your schema, get structured data back in under two seconds. The built-in semantic search means those extracted records are immediately queryable without a separate vector pipeline. And webhook change detection means your data stays fresh automatically.

Ready to replace your CSS selectors? Sign up for free at knowledgesdk.com and get 1,000 extractions per month included.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →

tutorial

Agentic RAG: Building Self-Correcting Retrieval Pipelines with Live Web Data

tutorial

Build a Searchable Knowledge Base from Any Website in Minutes

tutorial

Build a Compliance Chatbot That Reads Your Website Automatically

tutorial

Context Engineering with Live Web Data: Keep Your AI Agents Current

← Back to blog

Natural Language Web Extraction: Describe What You Want, Get JSON Back

Natural Language Web Extraction: Describe What You Want, Get JSON Back

Why CSS Selectors Break (and Why LLMs Don't)

The KnowledgeSDK Extract Endpoint

Basic Usage: Natural Language Description

Using a JSON Schema for Reliable Types

How KnowledgeSDK Compares to Other Extraction APIs

Accuracy Benchmark: 5 Real-World Page Types

Test 1: E-Commerce Product Page (Amazon)

Test 2: News Article (The Guardian)

Test 3: Job Listing (Greenhouse ATS)

Test 4: GitHub Repository Page

Test 5: SaaS Pricing Page

10 Ready-to-Use Extraction Schemas

1. E-Commerce Product

2. News / Blog Article

3. Job Listing

4. SaaS Pricing Page

5. Company / About Page

6. Real Estate Listing

7. Research Paper / Academic

8. Event Listing

9. Recipe

10. GitHub Repository

Building a Resilient Extraction Pipeline

Conclusion

Scrape, search, and monitor any website with one API.

Related Articles