knowledgesdk.com/blog/incremental-web-crawling

technicalMarch 20, 2026·13 min read

Incremental Web Crawling: Only Scrape What Changed (With Webhooks)

Reduce web scraping costs by 12x with incremental crawling. Use webhooks to detect changes and only re-scrape updated pages instead of re-crawling entire sites daily.

Incremental Web Crawling: Only Scrape What Changed (With Webhooks)

The naive approach to keeping a web-scraped knowledge base current is to re-crawl everything every day. If your corpus has 100 pages and you re-scrape all of them daily for 30 days, you make 3,000 API calls — regardless of how many pages actually changed.

In practice, most content does not change that frequently. A documentation site might update 5 pages per day on average. A product catalog might change pricing on 10% of SKUs per week. A news site publishes new articles but rarely changes old ones. Re-crawling the entire corpus to capture a 5% change rate wastes 95% of your API budget.

The incremental crawling pattern solves this: perform a baseline crawl once, then subscribe to change notifications via webhooks, and re-scrape only the URLs that actually changed. For a 100-page site with 5 changes per day, that is 100 initial calls plus 5 per day — 250 total over 30 days instead of 3,000. 12x cheaper, with the same up-to-date knowledge base.

This article shows how to implement this pattern from scratch using KnowledgeSDK's webhook system.

The Architecture

The incremental crawling system has three components:

1. Baseline Crawler — runs once to populate the knowledge base with all pages.

2. Change Detection Webhook — KnowledgeSDK monitors registered URLs and fires an HTTP event when the content of a page changes. Your server receives the webhook, identifies which URL changed, and re-scrapes it.

3. Vector Store Updater — when a page is re-scraped, the old vector chunks are deleted and replaced with fresh ones. The knowledge base stays current without a full rebuild.

Initial setup:
  Crawl all URLs → extract + chunk → upsert to vector DB
                ↓
  Register URLs for change monitoring with KnowledgeSDK

Ongoing:
  KnowledgeSDK detects change → fires webhook → your handler
                ↓
  Re-scrape changed URL → delete old chunks → upsert new chunks

Step 1: Baseline Crawl

Start by crawling all pages in the corpus and storing the extracted content. We will use a sitemap to enumerate the URLs.

Node.js

import KnowledgeSDK from "@knowledgesdk/node";
import { OpenAI } from "openai";
import { createClient } from "@supabase/supabase-js";

const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_KEY!
);

async function baselineCrawl(sitemapUrl: string) {
  // 1. Fetch all URLs from the sitemap
  const sitemapResult = await ks.sitemap(sitemapUrl);
  const urls = sitemapResult.urls;

  console.log(`Baseline crawl: ${urls.length} URLs to process`);

  // 2. Process in batches to avoid overwhelming the API
  const BATCH_SIZE = 10;

  for (let i = 0; i < urls.length; i += BATCH_SIZE) {
    const batch = urls.slice(i, i + BATCH_SIZE);

    await Promise.all(
      batch.map(async (url: string) => {
        try {
          const extracted = await ks.extract(url);

          // Generate embedding for semantic search
          const embedding = await openai.embeddings.create({
            model: "text-embedding-3-small",
            input: extracted.markdown.slice(0, 8000), // truncate to token limit
          });

          // Store in vector database
          await supabase.from("knowledge_items").upsert({
            url,
            title: extracted.title,
            content: extracted.markdown,
            embedding: embedding.data[0].embedding,
            scraped_at: new Date().toISOString(),
            content_hash: hashContent(extracted.markdown),
          });

          console.log(`Processed: ${url}`);
        } catch (err) {
          console.error(`Failed to process ${url}:`, err);
        }
      })
    );

    // Brief pause between batches
    await new Promise((r) => setTimeout(r, 500));
  }

  console.log("Baseline crawl complete");
  return urls;
}

function hashContent(content: string): string {
  const crypto = require("crypto");
  return crypto.createHash("sha256").update(content).digest("hex").slice(0, 16);
}

Python

import os
import hashlib
import asyncio
from typing import list
import knowledgesdk
from openai import AsyncOpenAI
from supabase import create_client

ks = knowledgesdk.Client(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_SERVICE_KEY"])


def hash_content(content: str) -> str:
    return hashlib.sha256(content.encode()).hexdigest()[:16]


async def process_url(url: str) -> None:
    try:
        extracted = ks.extract(url)

        embedding_response = await openai.embeddings.create(
            model="text-embedding-3-small",
            input=extracted["markdown"][:8000]
        )

        supabase.table("knowledge_items").upsert({
            "url": url,
            "title": extracted.get("title", ""),
            "content": extracted["markdown"],
            "embedding": embedding_response.data[0].embedding,
            "scraped_at": datetime.utcnow().isoformat(),
            "content_hash": hash_content(extracted["markdown"]),
        }).execute()

        print(f"Processed: {url}")
    except Exception as e:
        print(f"Failed to process {url}: {e}")


async def baseline_crawl(sitemap_url: str) -> list[str]:
    sitemap_result = ks.sitemap(sitemap_url)
    urls = sitemap_result["urls"]

    print(f"Baseline crawl: {len(urls)} URLs to process")

    # Process in batches of 10
    batch_size = 10
    for i in range(0, len(urls), batch_size):
        batch = urls[i:i + batch_size]
        await asyncio.gather(*[process_url(url) for url in batch])
        await asyncio.sleep(0.5)

    print("Baseline crawl complete")
    return urls

Step 2: Register URLs for Change Monitoring

After the baseline crawl, register the URLs with KnowledgeSDK's webhook system. KnowledgeSDK will periodically check each URL for content changes and fire a webhook to your endpoint when something updates.

Node.js

async function registerWebhooks(urls: string[], webhookUrl: string) {
  // Register your webhook endpoint
  const webhook = await ks.webhooks.create({
    url: webhookUrl,
    events: ["page.changed"],
    secret: process.env.WEBHOOK_SECRET!, // for signature verification
  });

  console.log(`Webhook created: ${webhook.id}`);

  // Register each URL for monitoring
  // KnowledgeSDK batches these internally
  const registrations = await Promise.all(
    urls.map((url) =>
      ks.monitor.register({
        url,
        webhookId: webhook.id,
        checkInterval: "1h", // check every hour
      })
    )
  );

  console.log(`Registered ${registrations.length} URLs for monitoring`);
  return webhook;
}

// Example usage
const urls = await baselineCrawl("https://docs.example.com/sitemap.xml");
await registerWebhooks(urls, "https://your-app.com/webhooks/knowledge-update");

Python

async def register_webhooks(urls: list[str], webhook_url: str) -> dict:
    # Create the webhook endpoint registration
    webhook = ks.webhooks.create(
        url=webhook_url,
        events=["page.changed"],
        secret=os.environ["WEBHOOK_SECRET"]
    )

    print(f"Webhook created: {webhook['id']}")

    # Register URLs for monitoring
    for url in urls:
        ks.monitor.register(
            url=url,
            webhook_id=webhook["id"],
            check_interval="1h"
        )

    print(f"Registered {len(urls)} URLs for monitoring")
    return webhook

Step 3: Handle Incoming Webhooks

When KnowledgeSDK detects a change, it sends an HTTP POST to your webhook URL. Your handler verifies the signature, re-scrapes the changed URL, and updates the vector store.

Node.js (Express)

import express from "express";
import crypto from "crypto";

const app = express();

app.post("/webhooks/knowledge-update", express.raw({ type: "application/json" }), async (req, res) => {
  // Verify webhook signature
  const signature = req.headers["x-knowledgesdk-signature"] as string;
  const expectedSig = crypto
    .createHmac("sha256", process.env.WEBHOOK_SECRET!)
    .update(req.body)
    .digest("hex");

  if (`sha256=${expectedSig}` !== signature) {
    return res.status(401).json({ error: "Invalid signature" });
  }

  const payload = JSON.parse(req.body.toString());

  // Acknowledge receipt immediately (KnowledgeSDK retries if no 2xx within 30s)
  res.status(200).json({ received: true });

  // Process asynchronously
  if (payload.event === "page.changed") {
    await handlePageChange(payload.data.url, payload.data.changeType);
  }
});

async function handlePageChange(url: string, changeType: "updated" | "deleted") {
  console.log(`Page changed: ${url} (${changeType})`);

  if (changeType === "deleted") {
    // Remove from vector store
    await supabase
      .from("knowledge_items")
      .delete()
      .eq("url", url);
    console.log(`Deleted: ${url}`);
    return;
  }

  // Re-scrape the changed page
  const extracted = await ks.extract(url);

  // Generate new embedding
  const embedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: extracted.markdown.slice(0, 8000),
  });

  // Upsert to replace old content
  await supabase.from("knowledge_items").upsert({
    url,
    title: extracted.title,
    content: extracted.markdown,
    embedding: embedding.data[0].embedding,
    scraped_at: new Date().toISOString(),
    content_hash: hashContent(extracted.markdown),
  });

  console.log(`Updated vector store: ${url}`);
}

app.listen(3000, () => console.log("Webhook server running on port 3000"));

Python (FastAPI)

import hmac
import hashlib
import json
from fastapi import FastAPI, Request, HTTPException, BackgroundTasks
from openai import AsyncOpenAI

app = FastAPI()
openai = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])


def verify_signature(payload: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(
        secret.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)


async def handle_page_change(url: str, change_type: str) -> None:
    print(f"Page changed: {url} ({change_type})")

    if change_type == "deleted":
        supabase.table("knowledge_items").delete().eq("url", url).execute()
        print(f"Deleted: {url}")
        return

    # Re-scrape the changed URL
    extracted = ks.extract(url)

    embedding_response = await openai.embeddings.create(
        model="text-embedding-3-small",
        input=extracted["markdown"][:8000]
    )

    supabase.table("knowledge_items").upsert({
        "url": url,
        "title": extracted.get("title", ""),
        "content": extracted["markdown"],
        "embedding": embedding_response.data[0].embedding,
        "scraped_at": datetime.utcnow().isoformat(),
        "content_hash": hash_content(extracted["markdown"]),
    }).execute()

    print(f"Updated vector store: {url}")


@app.post("/webhooks/knowledge-update")
async def webhook_handler(request: Request, background_tasks: BackgroundTasks):
    body = await request.body()
    signature = request.headers.get("x-knowledgesdk-signature", "")

    if not verify_signature(body, signature, os.environ["WEBHOOK_SECRET"]):
        raise HTTPException(status_code=401, detail="Invalid signature")

    payload = json.loads(body)

    # Respond immediately, process in background
    if payload.get("event") == "page.changed":
        background_tasks.add_task(
            handle_page_change,
            payload["data"]["url"],
            payload["data"]["change_type"]
        )

    return {"received": True}

Cost Analysis: Full Re-Crawl vs. Incremental

Let us work through a realistic scenario: a 100-page documentation site that averages 5 page updates per day.

Full Daily Re-Crawl

Item	Calculation	Total
Pages per crawl	100	—
Crawls per month	30	—
Total API calls	100 × 30	3,000
Cost at $0.002/call	3,000 × $0.002	$6.00/month
Embedding API calls	3,000 × $0.0001	$0.30/month
Total		$6.30/month

Incremental Crawl with Webhooks

Item	Calculation	Total
Baseline crawl	100 (once)	100
Changes per day	5	—
Days per month	30	—
Re-scrape calls	5 × 30	150
Total API calls	100 + 150	250
Cost at $0.002/call	250 × $0.002	$0.50/month
Embedding API calls	250 × $0.0001	$0.025/month
Total		$0.525/month

The incremental approach is 12x cheaper — and it actually provides better results, because changes are processed as they happen rather than up to 24 hours later with nightly batch crawling.

Scaling This Up

Corpus Size	Daily Changes (5%)	Full Re-Crawl/month	Incremental/month	Savings
100 pages	5	3,000 calls	250 calls	12x
1,000 pages	50	30,000 calls	1,600 calls	19x
10,000 pages	500	300,000 calls	15,100 calls	20x

At scale, the savings become even more dramatic because the baseline crawl amortizes over a longer period.

Handling Edge Cases

New Pages (Not in Baseline)

Sites continuously add new pages. You need a periodic full sitemap diff to catch pages that were not in the original baseline:

async function syncNewPages(sitemapUrl: string) {
  const current = await ks.sitemap(sitemapUrl);
  const currentUrls = new Set(current.urls);

  const { data: existing } = await supabase
    .from("knowledge_items")
    .select("url");

  const existingUrls = new Set(existing!.map((r: any) => r.url));

  const newUrls = [...currentUrls].filter((url) => !existingUrls.has(url));
  console.log(`Found ${newUrls.length} new pages since baseline`);

  // Process new pages
  for (const url of newUrls) {
    await handlePageChange(url, "updated");
  }
}

// Run sitemap sync once per day to catch new pages
setInterval(syncNewPages, 24 * 60 * 60 * 1000);

Webhook Delivery Failures

KnowledgeSDK retries failed webhook deliveries with exponential backoff (1 min, 5 min, 30 min, 2 hours). If all retries fail, the change event is logged in your dashboard for manual review. Design your webhook handler to be idempotent — processing the same change event twice should produce the same result as processing it once.

Detecting False Positives

Some pages have dynamic content (timestamps, ad blocks, user counters) that changes on every request without meaningful content changes. KnowledgeSDK's change detection compares a normalized content hash that strips common dynamic elements, but for your specific corpus you may want to add an additional check:

async function handlePageChange(url: string, changeType: string) {
  const extracted = await ks.extract(url);
  const newHash = hashContent(extracted.markdown);

  // Check against stored hash before updating
  const { data: existing } = await supabase
    .from("knowledge_items")
    .select("content_hash")
    .eq("url", url)
    .single();

  if (existing?.content_hash === newHash) {
    console.log(`No meaningful change detected for ${url}, skipping update`);
    return;
  }

  // Proceed with update...
}

Monitoring Your Incremental Crawl

Track the health of your incremental system with a simple metrics table:

-- PostgreSQL: Add to your schema
CREATE TABLE crawl_metrics (
  date DATE NOT NULL,
  urls_checked INTEGER DEFAULT 0,
  urls_changed INTEGER DEFAULT 0,
  urls_failed INTEGER DEFAULT 0,
  api_calls_saved INTEGER DEFAULT 0,
  PRIMARY KEY (date)
);

// Log metrics after each webhook batch
async function logMetrics(checked: number, changed: number, failed: number, corpusSize: number) {
  const fullCrawlCost = corpusSize; // API calls if we re-crawled everything
  const actualCost = changed + failed;
  const saved = fullCrawlCost - actualCost;

  await supabase.from("crawl_metrics").upsert({
    date: new Date().toISOString().split("T")[0],
    urls_checked: checked,
    urls_changed: changed,
    urls_failed: failed,
    api_calls_saved: saved,
  });
}

Start Building Your Incremental Crawler

KnowledgeSDK is one of the few managed web scraping APIs that supports change detection webhooks natively — most competitors require you to build and host your own change detection layer. You get the baseline crawl, the monitoring, and the webhook delivery all from a single API.

Get started with a free API key at knowledgesdk.com. The free tier includes 1,000 API calls and webhook monitoring for up to 50 URLs — enough to prototype the full incremental pipeline against your target corpus before committing to production.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →

technical

Webhooks vs Polling for Web Change Detection: Developer Guide

technical

AST-Aware Code Chunking for RAG: Why Text Splitting Fails on Code

technical

Which Embedding Model Should You Use in 2026? (Full MTEB Benchmark Guide)

technical

Scraping JavaScript SPAs: React, Vue, and Angular Without Running a Browser

← Back to blog

Incremental Web Crawling: Only Scrape What Changed (With Webhooks)

Incremental Web Crawling: Only Scrape What Changed (With Webhooks)

The Architecture

Step 1: Baseline Crawl

Node.js

Python

Step 2: Register URLs for Change Monitoring

Node.js

Python

Step 3: Handle Incoming Webhooks

Node.js (Express)

Python (FastAPI)

Cost Analysis: Full Re-Crawl vs. Incremental

Full Daily Re-Crawl

Incremental Crawl with Webhooks

Scaling This Up

Handling Edge Cases

New Pages (Not in Baseline)

Webhook Delivery Failures

Detecting False Positives

Monitoring Your Incremental Crawl

Start Building Your Incremental Crawler

Scrape, search, and monitor any website with one API.

Related Articles