conceptualMarch 20, 2026·10 min read

Semantic Scraping: Beyond Raw HTML Extraction for AI Applications

Semantic scraping is the next evolution of web data extraction — extracting meaning, not just text. This guide explains what it means and how to implement it for AI applications.

Semantic Scraping: Beyond Raw HTML Extraction for AI Applications

Raw web scraping extracts text. Semantic scraping extracts meaning. The difference sounds abstract until you see it in practice.

Traditional scraping takes a URL and returns the DOM content — sometimes cleaned of navigation, ads, and boilerplate, but ultimately still raw text that requires further processing before it's useful to an AI system. Semantic scraping applies understanding on top of that text: it identifies what the content is about, what entities it contains, what claims it makes, and how it relates to other content in your knowledge base. The output isn't just "here's the text from that URL" — it's "here's what you can now know and query about that URL."

The shift matters because LLMs are capable of doing that semantic work, and the tools to automate it at scale now exist. For teams building AI agents, RAG pipelines, or knowledge management systems, semantic scraping is the difference between a data pipeline and an intelligence pipeline.

Three Levels of Extraction

It helps to think about web extraction in three distinct levels. Most teams start at Level 1. Most AI applications need Level 3.

Level 1: Raw HTML extraction

The output is DOM content — sometimes stripped of obvious navigation, sometimes not. This is what you get from basic curl or a simple headless browser setup.

<!-- What you get -->
<nav>Home | About | Contact</nav>
<h1>Q1 2026 Earnings Report</h1>
<p>Revenue grew 34% year-over-year to $2.4B...</p>
<div class="ad-unit">Advertisement</div>

Noise is high. Structure is HTML-specific, not content-specific. LLMs can work with this but spend tokens on boilerplate.

Level 2: Clean markdown extraction

The output is cleaned, normalized text. Navigation is removed. Ads are stripped. Content structure is preserved as markdown (headers, lists, code blocks, tables). This is what tools like Firecrawl, ScrapingBee, and KnowledgeSDK's /v1/extract endpoint produce.

# Q1 2026 Earnings Report

Revenue grew 34% year-over-year to $2.4B, driven by cloud product adoption.
Operating margin improved to 23% from 19% in Q1 2025.

## Segment Performance

| Segment | Revenue | YoY Growth |
|---------|---------|------------|
| Cloud | $1.6B | +52% |
| Enterprise | $0.5B | +18% |
| Consumer | $0.3B | -4% |

LLMs work well with this. It's token-efficient and structurally meaningful. For many use cases, Level 2 is sufficient.

Level 3: Semantic knowledge extraction

The output is structured knowledge — entities, facts, relationships, and summaries extracted from the content. The page is understood, not just cleaned.

{
  "type": "earnings_report",
  "company": "Example Corp",
  "period": "Q1 2026",
  "keyMetrics": {
    "revenue": { "value": 2.4, "unit": "billion USD", "yoyGrowth": 0.34 },
    "operatingMargin": { "value": 0.23, "prev": 0.19 }
  },
  "segments": [
    { "name": "Cloud", "revenue": 1.6, "growth": 0.52 },
    { "name": "Enterprise", "revenue": 0.5, "growth": 0.18 },
    { "name": "Consumer", "revenue": 0.3, "growth": -0.04 }
  ],
  "summary": "Strong quarter driven by cloud adoption. Consumer segment declining.",
  "sentiment": "positive",
  "topics": ["cloud computing", "enterprise software", "Q1 earnings"]
}

This output is immediately useful to downstream systems. No further processing needed. You can aggregate it, query it, track it over time, and surface it directly to AI agents.

How LLMs Enable Semantic Extraction

The reason semantic scraping is practical in 2026 is that LLMs are genuinely good at structured extraction from unstructured text — and they're now cheap enough to run on every document in a large corpus.

The pattern is simple: Level 2 (clean markdown) → LLM extraction prompt → Level 3 (structured knowledge).

import { KnowledgeSDK } from '@knowledgesdk/node';
import OpenAI from 'openai';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function semanticScrape(url, schema) {
  // Level 2: Clean markdown
  const { markdown } = await client.scrape({ url });

  // Level 3: Semantic extraction
  const extraction = await openai.chat.completions.create({
    model: 'gpt-4o-mini', // Fast + cheap for extraction tasks
    messages: [{
      role: 'user',
      content: `Extract the following information from this web page. 
      Return valid JSON matching this schema: ${JSON.stringify(schema)}
      
      If a field is not present in the content, use null.
      
      Web page content:
      ${markdown.slice(0, 6000)}`,
    }],
    response_format: { type: 'json_object' },
  });

  return JSON.parse(extraction.choices[0].message.content);
}

// Example: Extract company information from a homepage
const companySchema = {
  companyName: 'string',
  description: 'string — one sentence',
  founded: 'number | null',
  headquarters: 'string | null',
  products: 'array of strings',
  targetCustomers: 'string',
  keyDifferentiators: 'array of strings',
};

const companyInfo = await semanticScrape('https://stripe.com', companySchema);

KnowledgeSDK's /v1/extract endpoint automates this pattern — you provide a URL and a schema, and it returns structured JSON by combining Level 2 scraping with LLM extraction internally.

Vector Embeddings: Searching by Meaning

Semantic scraping doesn't end at structured extraction. The other half is making content searchable by meaning rather than keywords.

Traditional search is keyword-based: you find documents that contain the words in your query. This fails for AI applications where users ask questions in natural language and don't know the exact terminology used in the source documents.

Vector embeddings convert text into high-dimensional numerical representations where semantically similar content is geometrically close. "Machine learning model deployment" and "MLOps pipeline infrastructure" end up near each other in embedding space even with no shared keywords.

KnowledgeSDK handles this automatically: every page you scrape is embedded and indexed for hybrid search (vector similarity + BM25 keyword). You don't configure embedding models or manage a vector database.

// All scraped content is automatically searchable
await client.scrape({ url: 'https://docs.example.com/api-reference' });
await client.scrape({ url: 'https://docs.example.com/getting-started' });
await client.scrape({ url: 'https://docs.example.com/authentication' });

// Search by meaning — no exact keyword matching needed
const results = await client.search({
  query: 'how do I set up my API credentials?',
  limit: 5,
});
// Returns the authentication docs even though "API credentials" 
// doesn't appear verbatim in the content

The combination of semantic extraction (understanding what a document means) and vector search (finding relevant documents by meaning) is what makes Level 3 extraction powerful for AI agents.

Practical Examples

Product intelligence. Extracting competitor product pages to track feature changes, pricing updates, and positioning shifts.

const productSchema = {
  productName: 'string',
  tagline: 'string',
  pricingPlans: [{
    name: 'string',
    price: 'number | null',
    billingPeriod: '"monthly" | "annual" | null',
    features: 'array of strings',
  }],
  integrations: 'array of strings',
  targetAudience: 'string',
  cta: 'string — primary call to action',
};

const product = await semanticScrape('https://competitor.com/pricing', productSchema);
// Track these extractions over time to detect pricing and feature changes

Company research. Building a database of company facts for sales, investment, or competitive research.

const companyResearchSchema = {
  name: 'string',
  businessModel: 'string',
  keyProducts: 'array of strings',
  customers: 'array of notable customer names or types',
  techStack: 'array of strings mentioned',
  fundingStage: 'string | null',
  employeeCount: 'string | null',
  recentNews: 'array of strings — recent announcements or events',
};

Content summarization at scale. Processing large content corpora — industry reports, research papers, news archives — into indexed, searchable summaries.

async function processContentCorpus(urls) {
  for (const url of urls) {
    // Scrape + auto-index (Level 2, searchable immediately)
    const { markdown } = await client.scrape({ url });

    // Semantic enrichment (Level 3)
    const summary = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [{
        role: 'user',
        content: `Summarize this content in 3 bullet points. Focus on key facts and insights.\n\n${markdown.slice(0, 3000)}`,
      }],
    });

    await db.insert({
      url,
      summary: summary.choices[0].message.content,
      indexedAt: new Date().toISOString(),
    });
  }
}

Knowledge Graph Extraction

The most ambitious form of semantic scraping is knowledge graph construction: identifying entities and the relationships between them across a corpus.

Rather than treating each page as an independent document, you're building a connected graph: Company A was founded by Person B, who previously worked at Company C, which was acquired by Company D in Year E.

async function extractKnowledgeGraph(url) {
  const { markdown } = await client.scrape({ url });

  const graphData = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{
      role: 'user',
      content: `Extract entities and relationships from this content.
      Return JSON with:
      - entities: [{ id, type, name, attributes }]
      - relationships: [{ from, to, type, attributes }]
      
      Entity types: Person, Company, Product, Event, Location, Technology
      Relationship types: founded_by, acquired_by, works_at, competes_with, built_with, located_in, occurred_at
      
      Content:
      ${markdown.slice(0, 5000)}`,
    }],
    response_format: { type: 'json_object' },
  });

  return JSON.parse(graphData.choices[0].message.content);
}

At scale, this graph connects across documents: the "OpenAI" entity extracted from a TechCrunch article is the same node as "OpenAI" extracted from a research paper, building a progressively richer representation of the entity over time.

When Raw Extraction Is Enough

Semantic scraping is overkill for some use cases. Raw markdown (Level 2) is sufficient when:

The downstream consumer is an LLM that will do its own reasoning over the text
You're building a simple Q&A system where keyword + semantic search on the full text is sufficient
The content structure is consistent enough that LLMs can interpret it from markdown alone
You're processing at a scale where LLM extraction costs are a meaningful constraint

Level 3 extraction pays off when you need: structured databases of extracted facts, tracking of specific fields over time, aggregation across many documents (trend analysis), or downstream systems that can't work with freeform text.

The Full Pipeline

The semantic scraping pipeline for a production AI application:

URL List
  → KnowledgeSDK /v1/extract (Level 2: clean markdown, auto-indexed)
  → LLM extraction with schema (Level 3: structured JSON)
  → Store in Postgres (for structured queries + trend tracking)
  → KnowledgeSDK /v1/search (semantic search across the indexed corpus)
  → AI agent context injection

The KnowledgeSDK steps handle the heavy lifting: rendering JavaScript-heavy pages, circumventing bot detection, normalizing content, embedding, and indexing. The LLM extraction step is where your domain-specific schema lives. The result is a knowledge base that your AI agents can query by meaning, not just keyword.

Get started with semantic scraping on KnowledgeSDK's free tier — 1,000 requests per month at knowledgesdk.com/setup.

Try it now