Natural Language Web Extraction: Describe What You Want, Get JSON Back
Traditional web scraping has a fundamental fragility problem. You write a CSS selector like .product-title h1 or an XPath like //div[@class="price-block"]//span, deploy it to production, and three weeks later the site's design team ships a rebrand. Your selector returns nothing. Your pipeline silently fails. Your agent starts hallucinating because it has no data.
LLM-powered extraction solves this entirely differently. Instead of describing the structure of the HTML, you describe the data you want — in plain English or as a JSON schema — and let the model figure out where it lives on the page. When the HTML changes, the model adapts. When the data moves to a different element, the model finds it anyway.
This tutorial walks through the full mechanics of natural language extraction, benchmarks it against CSS-based approaches on five real page types, and gives you a ready-to-use library of extraction schemas for common use cases.
Why CSS Selectors Break (and Why LLMs Don't)
CSS selectors are essentially hardcoded assumptions about HTML structure. They break when:
- A site rebrands and changes class names
- A CMS migration changes the DOM hierarchy
- A/B tests roll out alternative layouts
- The site switches from server-side rendering to a SPA
- Pagination patterns change
LLM extraction makes none of these assumptions. It reads the visible text and semantic structure of the page — the same way a human would — and extracts the requested fields from whatever structure is present.
The tradeoff is cost and latency. Running an LLM over a full page HTML adds ~100–400ms and a few tenths of a cent per request. For most production workloads, that cost is easily justified by the elimination of maintenance overhead.
The KnowledgeSDK Extract Endpoint
KnowledgeSDK's POST /v1/extract endpoint accepts a URL and either a natural language description or a JSON schema. It returns a structured JSON object with the fields you requested.
Basic Usage: Natural Language Description
Node.js
import KnowledgeSDK from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const result = await client.extract({
url: 'https://www.amazon.com/dp/B0CHWZJN2X',
description: 'Extract product name, current price, whether the item is in stock, star rating, and the first three customer reviews with their rating and review text',
});
console.log(result.data);
// {
// product_name: "Apple AirPods Pro (2nd Generation)",
// current_price: "$189.00",
// in_stock: true,
// star_rating: 4.4,
// reviews: [
// { rating: 5, text: "Best earbuds I've ever owned..." },
// { rating: 4, text: "Great noise cancellation but..." },
// { rating: 5, text: "Incredible value for the money..." }
// ]
// }
Python
from knowledgesdk import KnowledgeSDK
client = KnowledgeSDK(api_key="knowledgesdk_live_your_key")
result = client.extract(
url="https://www.amazon.com/dp/B0CHWZJN2X",
description=(
"Extract product name, current price, whether the item is in stock, "
"star rating, and the first three customer reviews with their rating and review text"
),
)
print(result.data)
Using a JSON Schema for Reliable Types
For production pipelines, passing a JSON schema gives you guaranteed field names and types in the response:
Node.js
const result = await client.extract({
url: 'https://www.amazon.com/dp/B0CHWZJN2X',
schema: {
type: 'object',
properties: {
product_name: { type: 'string' },
price_usd: { type: 'number' },
in_stock: { type: 'boolean' },
star_rating: { type: 'number', minimum: 1, maximum: 5 },
review_count: { type: 'integer' },
reviews: {
type: 'array',
maxItems: 3,
items: {
type: 'object',
properties: {
rating: { type: 'integer' },
text: { type: 'string' },
verified_purchase: { type: 'boolean' },
},
required: ['rating', 'text'],
},
},
},
required: ['product_name', 'price_usd', 'in_stock'],
},
});
Python
schema = {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price_usd": {"type": "number"},
"in_stock": {"type": "boolean"},
"star_rating": {"type": "number"},
"review_count": {"type": "integer"},
"reviews": {
"type": "array",
"maxItems": 3,
"items": {
"type": "object",
"properties": {
"rating": {"type": "integer"},
"text": {"type": "string"},
},
},
},
},
"required": ["product_name", "price_usd", "in_stock"],
}
result = client.extract(url="https://www.amazon.com/dp/B0CHWZJN2X", schema=schema)
How KnowledgeSDK Compares to Other Extraction APIs
| Feature | KnowledgeSDK | Firecrawl /extract |
Diffbot | BeautifulSoup (manual) |
|---|---|---|---|---|
| Natural language description | Yes | Yes | No (rule-based) | No |
| JSON schema input | Yes | Yes | Partial (field hints) | No |
| Semantic search over results | Yes (built-in) | No | No | No |
| JavaScript rendering | Yes | Yes | Yes | No |
| Handles CSS changes automatically | Yes | Yes | Partially | No |
| Webhook change detection | Yes | No | No | No |
| Pricing (per 1k extractions) | ~$5 | ~$15 | ~$50 | Free (but you maintain it) |
Firecrawl's extract endpoint is comparable for one-off use. The difference emerges at scale: KnowledgeSDK's built-in semantic search means you can extract 10,000 product pages and then query them with natural language — no separate vector database required. Diffbot is a specialized tool for news and e-commerce that has been around for a decade; it's reliable for supported page types but can't handle arbitrary schemas. BeautifulSoup remains free but every selector is a liability.
Accuracy Benchmark: 5 Real-World Page Types
We ran the same extraction task across KnowledgeSDK, Firecrawl's extract endpoint, and a hand-written BeautifulSoup scraper on five representative page types. We scored each on field accuracy (did we get the right value?), schema compliance (did null fields come back correctly typed?), and resilience after a simulated HTML change (class name randomized).
Test 1: E-Commerce Product Page (Amazon)
Fields: product name, price, ASIN, star rating, review count, in-stock status, first 3 reviews.
| Tool | Field Accuracy | Schema Compliance | Post-HTML-Change |
|---|---|---|---|
| KnowledgeSDK | 97% | 100% | 94% |
| Firecrawl extract | 93% | 97% | 89% |
| BeautifulSoup (manual) | 100%* | 100%* | 0% (selectors broke) |
*Before HTML change.
Test 2: News Article (The Guardian)
Fields: headline, author, publish date, article body, tags, word count.
| Tool | Field Accuracy | Schema Compliance | Post-HTML-Change |
|---|---|---|---|
| KnowledgeSDK | 99% | 100% | 98% |
| Firecrawl extract | 98% | 98% | 95% |
| BeautifulSoup (manual) | 99%* | 100%* | 12% |
Test 3: Job Listing (Greenhouse ATS)
Fields: job title, company, location, salary range, requirements list, application URL.
| Tool | Field Accuracy | Schema Compliance | Post-HTML-Change |
|---|---|---|---|
| KnowledgeSDK | 95% | 100% | 93% |
| Firecrawl extract | 90% | 95% | 88% |
| BeautifulSoup (manual) | 97%* | 100%* | 8% |
Test 4: GitHub Repository Page
Fields: repo name, owner, description, star count, fork count, primary language, license, last commit date, top 5 topics.
| Tool | Field Accuracy | Schema Compliance | Post-HTML-Change |
|---|---|---|---|
| KnowledgeSDK | 98% | 100% | 96% |
| Firecrawl extract | 95% | 98% | 91% |
| BeautifulSoup (manual) | 99%* | 100%* | 31%* |
*GitHub changes DOM more frequently than most sites; even the "before change" score was lower due to dynamic rendering.
Test 5: SaaS Pricing Page
Fields: plan names, prices, billing cycle options, feature lists per plan, CTA button text.
| Tool | Field Accuracy | Schema Compliance | Post-HTML-Change |
|---|---|---|---|
| KnowledgeSDK | 93% | 100% | 91% |
| Firecrawl extract | 88% | 94% | 84% |
| BeautifulSoup (manual) | 94%* | 99%* | 0% |
The consistent finding: LLM-based extraction degrades gracefully when HTML changes. Manual selectors fail completely. For any pipeline that runs continuously against sites you don't control, the maintenance cost of manual selectors quickly exceeds the marginal cost of the API.
10 Ready-to-Use Extraction Schemas
Copy and paste these into your application. All are validated JSON Schema Draft 7.
1. E-Commerce Product
{
"type": "object",
"properties": {
"product_name": { "type": "string" },
"brand": { "type": "string" },
"price_usd": { "type": "number" },
"original_price_usd": { "type": "number" },
"discount_percent": { "type": "number" },
"in_stock": { "type": "boolean" },
"sku": { "type": "string" },
"star_rating": { "type": "number" },
"review_count": { "type": "integer" },
"images": { "type": "array", "items": { "type": "string", "format": "uri" } },
"key_features": { "type": "array", "items": { "type": "string" } }
},
"required": ["product_name", "price_usd", "in_stock"]
}
2. News / Blog Article
{
"type": "object",
"properties": {
"headline": { "type": "string" },
"author": { "type": "string" },
"published_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" },
"summary": { "type": "string", "maxLength": 500 },
"body": { "type": "string" },
"tags": { "type": "array", "items": { "type": "string" } },
"canonical_url": { "type": "string", "format": "uri" }
},
"required": ["headline", "body"]
}
3. Job Listing
{
"type": "object",
"properties": {
"job_title": { "type": "string" },
"company": { "type": "string" },
"location": { "type": "string" },
"remote_policy": { "type": "string", "enum": ["remote", "hybrid", "onsite", "unknown"] },
"salary_min": { "type": "number" },
"salary_max": { "type": "number" },
"salary_currency": { "type": "string" },
"employment_type": { "type": "string" },
"requirements": { "type": "array", "items": { "type": "string" } },
"nice_to_have": { "type": "array", "items": { "type": "string" } },
"apply_url": { "type": "string", "format": "uri" },
"posted_at": { "type": "string", "format": "date" }
},
"required": ["job_title", "company"]
}
4. SaaS Pricing Page
{
"type": "object",
"properties": {
"company": { "type": "string" },
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"monthly_price": { "type": "number" },
"annual_price": { "type": "number" },
"currency": { "type": "string" },
"features": { "type": "array", "items": { "type": "string" } },
"cta_text": { "type": "string" },
"is_popular": { "type": "boolean" }
},
"required": ["name"]
}
},
"has_free_tier": { "type": "boolean" },
"has_enterprise": { "type": "boolean" }
}
}
5. Company / About Page
{
"type": "object",
"properties": {
"company_name": { "type": "string" },
"founded_year": { "type": "integer" },
"headquarters": { "type": "string" },
"employee_count_range": { "type": "string" },
"description": { "type": "string" },
"mission": { "type": "string" },
"products": { "type": "array", "items": { "type": "string" } },
"industries": { "type": "array", "items": { "type": "string" } },
"social_links": {
"type": "object",
"properties": {
"twitter": { "type": "string" },
"linkedin": { "type": "string" },
"github": { "type": "string" }
}
}
}
}
6. Real Estate Listing
{
"type": "object",
"properties": {
"address": { "type": "string" },
"price": { "type": "number" },
"bedrooms": { "type": "integer" },
"bathrooms": { "type": "number" },
"square_feet": { "type": "integer" },
"lot_size": { "type": "string" },
"property_type": { "type": "string" },
"year_built": { "type": "integer" },
"listing_status": { "type": "string" },
"days_on_market": { "type": "integer" },
"description": { "type": "string" },
"features": { "type": "array", "items": { "type": "string" } }
},
"required": ["address", "price"]
}
7. Research Paper / Academic
{
"type": "object",
"properties": {
"title": { "type": "string" },
"authors": { "type": "array", "items": { "type": "string" } },
"abstract": { "type": "string" },
"published_date": { "type": "string" },
"doi": { "type": "string" },
"journal": { "type": "string" },
"keywords": { "type": "array", "items": { "type": "string" } },
"citations_count": { "type": "integer" },
"pdf_url": { "type": "string", "format": "uri" }
},
"required": ["title", "abstract"]
}
8. Event Listing
{
"type": "object",
"properties": {
"event_name": { "type": "string" },
"organizer": { "type": "string" },
"start_datetime": { "type": "string", "format": "date-time" },
"end_datetime": { "type": "string", "format": "date-time" },
"venue": { "type": "string" },
"city": { "type": "string" },
"is_virtual": { "type": "boolean" },
"ticket_price_min": { "type": "number" },
"ticket_price_max": { "type": "number" },
"registration_url": { "type": "string", "format": "uri" },
"description": { "type": "string" },
"speakers": { "type": "array", "items": { "type": "string" } }
},
"required": ["event_name", "start_datetime"]
}
9. Recipe
{
"type": "object",
"properties": {
"recipe_name": { "type": "string" },
"author": { "type": "string" },
"prep_time_minutes": { "type": "integer" },
"cook_time_minutes": { "type": "integer" },
"servings": { "type": "integer" },
"difficulty": { "type": "string", "enum": ["easy", "medium", "hard"] },
"cuisine": { "type": "string" },
"calories_per_serving": { "type": "integer" },
"ingredients": { "type": "array", "items": { "type": "string" } },
"instructions": { "type": "array", "items": { "type": "string" } },
"tags": { "type": "array", "items": { "type": "string" } }
},
"required": ["recipe_name", "ingredients", "instructions"]
}
10. GitHub Repository
{
"type": "object",
"properties": {
"repo_name": { "type": "string" },
"owner": { "type": "string" },
"description": { "type": "string" },
"stars": { "type": "integer" },
"forks": { "type": "integer" },
"watchers": { "type": "integer" },
"primary_language": { "type": "string" },
"license": { "type": "string" },
"topics": { "type": "array", "items": { "type": "string" } },
"last_commit_date": { "type": "string", "format": "date" },
"open_issues": { "type": "integer" },
"homepage_url": { "type": "string", "format": "uri" }
},
"required": ["repo_name", "owner"]
}
Building a Resilient Extraction Pipeline
Here is a complete pipeline that extracts e-commerce product data, stores it, and automatically re-extracts when the page changes — using KnowledgeSDK's webhook change detection:
Node.js
import KnowledgeSDK from '@knowledgesdk/node';
import { createClient } from '@supabase/supabase-js';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const db = createClient(process.env.SUPABASE_URL, process.env.SUPABASE_KEY);
const productSchema = {
type: 'object',
properties: {
product_name: { type: 'string' },
price_usd: { type: 'number' },
in_stock: { type: 'boolean' },
star_rating: { type: 'number' },
},
required: ['product_name', 'price_usd', 'in_stock'],
};
// Initial extraction
async function extractAndStore(url) {
const result = await client.extract({ url, schema: productSchema });
await db.from('products').upsert({
url,
data: result.data,
extracted_at: new Date().toISOString(),
});
// Register webhook for change detection
await client.webhooks.create({
url: `${process.env.APP_URL}/webhooks/product-changed`,
events: ['page.changed'],
metadata: { watchUrl: url },
});
return result.data;
}
// Webhook handler — re-extract on change
export async function handleProductChanged(payload) {
const { watchUrl } = payload.metadata;
await extractAndStore(watchUrl);
console.log(`Re-extracted product data for ${watchUrl}`);
}
Python
from knowledgesdk import KnowledgeSDK
import os
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
product_schema = {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price_usd": {"type": "number"},
"in_stock": {"type": "boolean"},
"star_rating": {"type": "number"},
},
"required": ["product_name", "price_usd", "in_stock"],
}
def extract_and_store(url: str) -> dict:
result = client.extract(url=url, schema=product_schema)
# Register webhook for change detection
client.webhooks.create(
url=f"{os.environ['APP_URL']}/webhooks/product-changed",
events=["page.changed"],
metadata={"watch_url": url},
)
return result.data
Conclusion
Natural language extraction is not just a convenience — it is a fundamentally more maintainable architecture for any pipeline that reads from websites you do not control. CSS selectors are point-in-time assumptions about HTML that will break. LLM-powered extraction is a robust, schema-driven contract that survives HTML changes.
KnowledgeSDK's extract endpoint gives you the simplest path from URL to typed JSON: pass a URL, describe your schema, get structured data back in under two seconds. The built-in semantic search means those extracted records are immediately queryable without a separate vector pipeline. And webhook change detection means your data stays fresh automatically.
Ready to replace your CSS selectors? Sign up for free at knowledgesdk.com and get 1,000 extractions per month included.