What Is Change Detection?
Change detection is the practice of periodically fetching a web page and comparing the newly fetched version to a previously stored snapshot to identify what has changed — new content added, existing content modified, or content removed. It is the foundation of web monitoring systems, price alert tools, regulatory compliance trackers, and AI knowledge base refresh pipelines.
Why Change Detection Matters
The web is not static. Pages change constantly:
- E-commerce — prices, stock levels, and promotions update daily or hourly
- News and media — new articles are published continuously; existing articles are edited
- Regulatory sites — legislation, guidance documents, and official notices are updated without fanfare
- Competitor sites — feature pages, pricing tables, and job listings change in response to market conditions
- Documentation — API docs and guides are versioned and updated with each software release
Without change detection, a scraped dataset becomes stale immediately after extraction.
Change Detection Techniques
Hash Comparison
The simplest approach: compute a hash (MD5, SHA-256) of the entire page content and compare it to the previous hash. Any difference triggers a "changed" flag. Fast, but does not tell you what changed.
Diff-Based Comparison
Compute a line-by-line or token-by-token diff of the new and old content (similar to git diff). This produces a precise view of additions and deletions. More expensive to compute but highly informative.
Structural Diffing
Compare the extracted data fields rather than raw text — detect when a price field changes from $49.99 to $39.99 without being confused by timestamp updates or unrelated sidebar changes.
Semantic Diffing
Use an LLM to describe what changed in natural language: "The pricing section was updated to add a new Enterprise tier at $299/month." Most powerful, but adds latency and cost.
A Practical Change Detection Pipeline
1. Schedule: every N minutes/hours, for each monitored URL
2. Fetch: GET the current page content (via /v1/scrape)
3. Compare: diff the new Markdown against the stored snapshot
4. Threshold: if change exceeds threshold, trigger alert
5. Store: update the snapshot in the database
6. Notify: send webhook, email, or Slack notification
Integrating with KnowledgeSDK
KnowledgeSDK's POST /v1/scrape and POST /v1/extract endpoints return consistent, clean Markdown on every call — making hash and diff comparisons reliable because you are comparing content rather than noisy HTML with dynamic ad tokens or session IDs embedded in it.
A minimal change detector in Node.js:
const prev = await db.getSnapshot(url);
const { markdown } = await knowledgesdk.scrape({ url });
if (hash(markdown) !== hash(prev.markdown)) {
await db.saveSnapshot(url, markdown);
await notify({ url, diff: diff(prev.markdown, markdown) });
}
Common Pitfalls
- False positives from dynamic content — timestamps, ad banners, and session tokens change on every load; compare only the meaningful content region
- Redirect loops — a URL may start redirecting to a different page; track canonical URLs
- Too-frequent polling — scraping a page every second is abusive; use reasonable intervals and respect
Crawl-delayinrobots.txt - Snapshot storage costs — storing full page snapshots for thousands of URLs can consume significant storage; consider storing only extracted fields or diffs