Change Detection

Monitoring web pages over time and detecting when their content has been updated, added, or removed.

What Is Change Detection?

Change detection is the practice of periodically fetching a web page and comparing the newly fetched version to a previously stored snapshot to identify what has changed — new content added, existing content modified, or content removed. It is the foundation of web monitoring systems, price alert tools, regulatory compliance trackers, and AI knowledge base refresh pipelines.

Why Change Detection Matters

The web is not static. Pages change constantly:

E-commerce — prices, stock levels, and promotions update daily or hourly
News and media — new articles are published continuously; existing articles are edited
Regulatory sites — legislation, guidance documents, and official notices are updated without fanfare
Competitor sites — feature pages, pricing tables, and job listings change in response to market conditions
Documentation — API docs and guides are versioned and updated with each software release

Without change detection, a scraped dataset becomes stale immediately after extraction.

Change Detection Techniques

Hash Comparison

The simplest approach: compute a hash (MD5, SHA-256) of the entire page content and compare it to the previous hash. Any difference triggers a "changed" flag. Fast, but does not tell you what changed.

Diff-Based Comparison

Compute a line-by-line or token-by-token diff of the new and old content (similar to git diff). This produces a precise view of additions and deletions. More expensive to compute but highly informative.

Structural Diffing

Compare the extracted data fields rather than raw text — detect when a price field changes from $49.99 to $39.99 without being confused by timestamp updates or unrelated sidebar changes.

Semantic Diffing

Use an LLM to describe what changed in natural language: "The pricing section was updated to add a new Enterprise tier at $299/month." Most powerful, but adds latency and cost.

A Practical Change Detection Pipeline

1. Schedule: every N minutes/hours, for each monitored URL
2. Fetch: GET the current page content (via /v1/scrape)
3. Compare: diff the new Markdown against the stored snapshot
4. Threshold: if change exceeds threshold, trigger alert
5. Store: update the snapshot in the database
6. Notify: send webhook, email, or Slack notification

Integrating with KnowledgeSDK

KnowledgeSDK's POST /v1/scrape and POST /v1/extract endpoints return consistent, clean Markdown on every call — making hash and diff comparisons reliable because you are comparing content rather than noisy HTML with dynamic ad tokens or session IDs embedded in it.

A minimal change detector in Node.js:

const prev = await db.getSnapshot(url);
const { markdown } = await knowledgesdk.scrape({ url });
if (hash(markdown) !== hash(prev.markdown)) {
  await db.saveSnapshot(url, markdown);
  await notify({ url, diff: diff(prev.markdown, markdown) });
}

Common Pitfalls

False positives from dynamic content — timestamps, ad banners, and session tokens change on every load; compare only the meaningful content region
Redirect loops — a URL may start redirecting to a different page; track canonical URLs
Too-frequent polling — scraping a page every second is abusive; use reasonable intervals and respect Crawl-delay in robots.txt
Snapshot storage costs — storing full page snapshots for thousands of URLs can consume significant storage; consider storing only extracted fields or diffs

Related Terms

Web Scraping & Extractionbeginner

Web Scraping

The automated extraction of data from websites by programmatically fetching and parsing HTML content.

Web Scraping & Extractionbeginner

Web Crawling

The systematic traversal of websites by following links to discover and fetch pages at scale.

Web Scraping & Extractionintermediate

Scraping Pipeline

An end-to-end workflow that orchestrates URL discovery, fetching, parsing, deduplication, and storage of scraped web data.

← Chain of Thought Chunking →

Try it now

Build with Change Detection using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary