Incremental Web Crawling: Only Scrape What Changed (With Webhooks)
The naive approach to keeping a web-scraped knowledge base current is to re-crawl everything every day. If your corpus has 100 pages and you re-scrape all of them daily for 30 days, you make 3,000 API calls — regardless of how many pages actually changed.
In practice, most content does not change that frequently. A documentation site might update 5 pages per day on average. A product catalog might change pricing on 10% of SKUs per week. A news site publishes new articles but rarely changes old ones. Re-crawling the entire corpus to capture a 5% change rate wastes 95% of your API budget.
The incremental crawling pattern solves this: perform a baseline crawl once, then subscribe to change notifications via webhooks, and re-scrape only the URLs that actually changed. For a 100-page site with 5 changes per day, that is 100 initial calls plus 5 per day — 250 total over 30 days instead of 3,000. 12x cheaper, with the same up-to-date knowledge base.
This article shows how to implement this pattern from scratch using KnowledgeSDK's webhook system.
The Architecture
The incremental crawling system has three components:
1. Baseline Crawler — runs once to populate the knowledge base with all pages.
2. Change Detection Webhook — KnowledgeSDK monitors registered URLs and fires an HTTP event when the content of a page changes. Your server receives the webhook, identifies which URL changed, and re-scrapes it.
3. Vector Store Updater — when a page is re-scraped, the old vector chunks are deleted and replaced with fresh ones. The knowledge base stays current without a full rebuild.
Initial setup:
Crawl all URLs → extract + chunk → upsert to vector DB
↓
Register URLs for change monitoring with KnowledgeSDK
Ongoing:
KnowledgeSDK detects change → fires webhook → your handler
↓
Re-scrape changed URL → delete old chunks → upsert new chunks
Step 1: Baseline Crawl
Start by crawling all pages in the corpus and storing the extracted content. We will use a sitemap to enumerate the URLs.
Node.js
import KnowledgeSDK from "@knowledgesdk/node";
import { OpenAI } from "openai";
import { createClient } from "@supabase/supabase-js";
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!
);
async function baselineCrawl(sitemapUrl: string) {
// 1. Fetch all URLs from the sitemap
const sitemapResult = await ks.sitemap(sitemapUrl);
const urls = sitemapResult.urls;
console.log(`Baseline crawl: ${urls.length} URLs to process`);
// 2. Process in batches to avoid overwhelming the API
const BATCH_SIZE = 10;
for (let i = 0; i < urls.length; i += BATCH_SIZE) {
const batch = urls.slice(i, i + BATCH_SIZE);
await Promise.all(
batch.map(async (url: string) => {
try {
const extracted = await ks.extract(url);
// Generate embedding for semantic search
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: extracted.markdown.slice(0, 8000), // truncate to token limit
});
// Store in vector database
await supabase.from("knowledge_items").upsert({
url,
title: extracted.title,
content: extracted.markdown,
embedding: embedding.data[0].embedding,
scraped_at: new Date().toISOString(),
content_hash: hashContent(extracted.markdown),
});
console.log(`Processed: ${url}`);
} catch (err) {
console.error(`Failed to process ${url}:`, err);
}
})
);
// Brief pause between batches
await new Promise((r) => setTimeout(r, 500));
}
console.log("Baseline crawl complete");
return urls;
}
function hashContent(content: string): string {
const crypto = require("crypto");
return crypto.createHash("sha256").update(content).digest("hex").slice(0, 16);
}
Python
import os
import hashlib
import asyncio
from typing import list
import knowledgesdk
from openai import AsyncOpenAI
from supabase import create_client
ks = knowledgesdk.Client(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_SERVICE_KEY"])
def hash_content(content: str) -> str:
return hashlib.sha256(content.encode()).hexdigest()[:16]
async def process_url(url: str) -> None:
try:
extracted = ks.extract(url)
embedding_response = await openai.embeddings.create(
model="text-embedding-3-small",
input=extracted["markdown"][:8000]
)
supabase.table("knowledge_items").upsert({
"url": url,
"title": extracted.get("title", ""),
"content": extracted["markdown"],
"embedding": embedding_response.data[0].embedding,
"scraped_at": datetime.utcnow().isoformat(),
"content_hash": hash_content(extracted["markdown"]),
}).execute()
print(f"Processed: {url}")
except Exception as e:
print(f"Failed to process {url}: {e}")
async def baseline_crawl(sitemap_url: str) -> list[str]:
sitemap_result = ks.sitemap(sitemap_url)
urls = sitemap_result["urls"]
print(f"Baseline crawl: {len(urls)} URLs to process")
# Process in batches of 10
batch_size = 10
for i in range(0, len(urls), batch_size):
batch = urls[i:i + batch_size]
await asyncio.gather(*[process_url(url) for url in batch])
await asyncio.sleep(0.5)
print("Baseline crawl complete")
return urls
Step 2: Register URLs for Change Monitoring
After the baseline crawl, register the URLs with KnowledgeSDK's webhook system. KnowledgeSDK will periodically check each URL for content changes and fire a webhook to your endpoint when something updates.
Node.js
async function registerWebhooks(urls: string[], webhookUrl: string) {
// Register your webhook endpoint
const webhook = await ks.webhooks.create({
url: webhookUrl,
events: ["page.changed"],
secret: process.env.WEBHOOK_SECRET!, // for signature verification
});
console.log(`Webhook created: ${webhook.id}`);
// Register each URL for monitoring
// KnowledgeSDK batches these internally
const registrations = await Promise.all(
urls.map((url) =>
ks.monitor.register({
url,
webhookId: webhook.id,
checkInterval: "1h", // check every hour
})
)
);
console.log(`Registered ${registrations.length} URLs for monitoring`);
return webhook;
}
// Example usage
const urls = await baselineCrawl("https://docs.example.com/sitemap.xml");
await registerWebhooks(urls, "https://your-app.com/webhooks/knowledge-update");
Python
async def register_webhooks(urls: list[str], webhook_url: str) -> dict:
# Create the webhook endpoint registration
webhook = ks.webhooks.create(
url=webhook_url,
events=["page.changed"],
secret=os.environ["WEBHOOK_SECRET"]
)
print(f"Webhook created: {webhook['id']}")
# Register URLs for monitoring
for url in urls:
ks.monitor.register(
url=url,
webhook_id=webhook["id"],
check_interval="1h"
)
print(f"Registered {len(urls)} URLs for monitoring")
return webhook
Step 3: Handle Incoming Webhooks
When KnowledgeSDK detects a change, it sends an HTTP POST to your webhook URL. Your handler verifies the signature, re-scrapes the changed URL, and updates the vector store.
Node.js (Express)
import express from "express";
import crypto from "crypto";
const app = express();
app.post("/webhooks/knowledge-update", express.raw({ type: "application/json" }), async (req, res) => {
// Verify webhook signature
const signature = req.headers["x-knowledgesdk-signature"] as string;
const expectedSig = crypto
.createHmac("sha256", process.env.WEBHOOK_SECRET!)
.update(req.body)
.digest("hex");
if (`sha256=${expectedSig}` !== signature) {
return res.status(401).json({ error: "Invalid signature" });
}
const payload = JSON.parse(req.body.toString());
// Acknowledge receipt immediately (KnowledgeSDK retries if no 2xx within 30s)
res.status(200).json({ received: true });
// Process asynchronously
if (payload.event === "page.changed") {
await handlePageChange(payload.data.url, payload.data.changeType);
}
});
async function handlePageChange(url: string, changeType: "updated" | "deleted") {
console.log(`Page changed: ${url} (${changeType})`);
if (changeType === "deleted") {
// Remove from vector store
await supabase
.from("knowledge_items")
.delete()
.eq("url", url);
console.log(`Deleted: ${url}`);
return;
}
// Re-scrape the changed page
const extracted = await ks.extract(url);
// Generate new embedding
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: extracted.markdown.slice(0, 8000),
});
// Upsert to replace old content
await supabase.from("knowledge_items").upsert({
url,
title: extracted.title,
content: extracted.markdown,
embedding: embedding.data[0].embedding,
scraped_at: new Date().toISOString(),
content_hash: hashContent(extracted.markdown),
});
console.log(`Updated vector store: ${url}`);
}
app.listen(3000, () => console.log("Webhook server running on port 3000"));
Python (FastAPI)
import hmac
import hashlib
import json
from fastapi import FastAPI, Request, HTTPException, BackgroundTasks
from openai import AsyncOpenAI
app = FastAPI()
openai = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
def verify_signature(payload: bytes, signature: str, secret: str) -> bool:
expected = hmac.new(
secret.encode(),
payload,
hashlib.sha256
).hexdigest()
return hmac.compare_digest(f"sha256={expected}", signature)
async def handle_page_change(url: str, change_type: str) -> None:
print(f"Page changed: {url} ({change_type})")
if change_type == "deleted":
supabase.table("knowledge_items").delete().eq("url", url).execute()
print(f"Deleted: {url}")
return
# Re-scrape the changed URL
extracted = ks.extract(url)
embedding_response = await openai.embeddings.create(
model="text-embedding-3-small",
input=extracted["markdown"][:8000]
)
supabase.table("knowledge_items").upsert({
"url": url,
"title": extracted.get("title", ""),
"content": extracted["markdown"],
"embedding": embedding_response.data[0].embedding,
"scraped_at": datetime.utcnow().isoformat(),
"content_hash": hash_content(extracted["markdown"]),
}).execute()
print(f"Updated vector store: {url}")
@app.post("/webhooks/knowledge-update")
async def webhook_handler(request: Request, background_tasks: BackgroundTasks):
body = await request.body()
signature = request.headers.get("x-knowledgesdk-signature", "")
if not verify_signature(body, signature, os.environ["WEBHOOK_SECRET"]):
raise HTTPException(status_code=401, detail="Invalid signature")
payload = json.loads(body)
# Respond immediately, process in background
if payload.get("event") == "page.changed":
background_tasks.add_task(
handle_page_change,
payload["data"]["url"],
payload["data"]["change_type"]
)
return {"received": True}
Cost Analysis: Full Re-Crawl vs. Incremental
Let us work through a realistic scenario: a 100-page documentation site that averages 5 page updates per day.
Full Daily Re-Crawl
| Item | Calculation | Total |
|---|---|---|
| Pages per crawl | 100 | — |
| Crawls per month | 30 | — |
| Total API calls | 100 × 30 | 3,000 |
| Cost at $0.002/call | 3,000 × $0.002 | $6.00/month |
| Embedding API calls | 3,000 × $0.0001 | $0.30/month |
| Total | $6.30/month |
Incremental Crawl with Webhooks
| Item | Calculation | Total |
|---|---|---|
| Baseline crawl | 100 (once) | 100 |
| Changes per day | 5 | — |
| Days per month | 30 | — |
| Re-scrape calls | 5 × 30 | 150 |
| Total API calls | 100 + 150 | 250 |
| Cost at $0.002/call | 250 × $0.002 | $0.50/month |
| Embedding API calls | 250 × $0.0001 | $0.025/month |
| Total | $0.525/month |
The incremental approach is 12x cheaper — and it actually provides better results, because changes are processed as they happen rather than up to 24 hours later with nightly batch crawling.
Scaling This Up
| Corpus Size | Daily Changes (5%) | Full Re-Crawl/month | Incremental/month | Savings |
|---|---|---|---|---|
| 100 pages | 5 | 3,000 calls | 250 calls | 12x |
| 1,000 pages | 50 | 30,000 calls | 1,600 calls | 19x |
| 10,000 pages | 500 | 300,000 calls | 15,100 calls | 20x |
At scale, the savings become even more dramatic because the baseline crawl amortizes over a longer period.
Handling Edge Cases
New Pages (Not in Baseline)
Sites continuously add new pages. You need a periodic full sitemap diff to catch pages that were not in the original baseline:
async function syncNewPages(sitemapUrl: string) {
const current = await ks.sitemap(sitemapUrl);
const currentUrls = new Set(current.urls);
const { data: existing } = await supabase
.from("knowledge_items")
.select("url");
const existingUrls = new Set(existing!.map((r: any) => r.url));
const newUrls = [...currentUrls].filter((url) => !existingUrls.has(url));
console.log(`Found ${newUrls.length} new pages since baseline`);
// Process new pages
for (const url of newUrls) {
await handlePageChange(url, "updated");
}
}
// Run sitemap sync once per day to catch new pages
setInterval(syncNewPages, 24 * 60 * 60 * 1000);
Webhook Delivery Failures
KnowledgeSDK retries failed webhook deliveries with exponential backoff (1 min, 5 min, 30 min, 2 hours). If all retries fail, the change event is logged in your dashboard for manual review. Design your webhook handler to be idempotent — processing the same change event twice should produce the same result as processing it once.
Detecting False Positives
Some pages have dynamic content (timestamps, ad blocks, user counters) that changes on every request without meaningful content changes. KnowledgeSDK's change detection compares a normalized content hash that strips common dynamic elements, but for your specific corpus you may want to add an additional check:
async function handlePageChange(url: string, changeType: string) {
const extracted = await ks.extract(url);
const newHash = hashContent(extracted.markdown);
// Check against stored hash before updating
const { data: existing } = await supabase
.from("knowledge_items")
.select("content_hash")
.eq("url", url)
.single();
if (existing?.content_hash === newHash) {
console.log(`No meaningful change detected for ${url}, skipping update`);
return;
}
// Proceed with update...
}
Monitoring Your Incremental Crawl
Track the health of your incremental system with a simple metrics table:
-- PostgreSQL: Add to your schema
CREATE TABLE crawl_metrics (
date DATE NOT NULL,
urls_checked INTEGER DEFAULT 0,
urls_changed INTEGER DEFAULT 0,
urls_failed INTEGER DEFAULT 0,
api_calls_saved INTEGER DEFAULT 0,
PRIMARY KEY (date)
);
// Log metrics after each webhook batch
async function logMetrics(checked: number, changed: number, failed: number, corpusSize: number) {
const fullCrawlCost = corpusSize; // API calls if we re-crawled everything
const actualCost = changed + failed;
const saved = fullCrawlCost - actualCost;
await supabase.from("crawl_metrics").upsert({
date: new Date().toISOString().split("T")[0],
urls_checked: checked,
urls_changed: changed,
urls_failed: failed,
api_calls_saved: saved,
});
}
Start Building Your Incremental Crawler
KnowledgeSDK is one of the few managed web scraping APIs that supports change detection webhooks natively — most competitors require you to build and host your own change detection layer. You get the baseline crawl, the monitoring, and the webhook delivery all from a single API.
Get started with a free API key at knowledgesdk.com. The free tier includes 1,000 API calls and webhook monitoring for up to 50 URLs — enough to prototype the full incremental pipeline against your target corpus before committing to production.