You've built an AI chatbot. Users are asking questions. And slowly, a problem emerges: the answers are getting stale.
Your chatbot was trained — or your RAG pipeline was populated — at a specific point in time. Since then, your product has shipped new features. Your docs have been updated. Your pricing has changed. Industry regulations have evolved. But your chatbot is still confidently answering based on the old information.
This is the stale knowledge problem, and it affects every production AI system eventually.
This guide builds a complete freshness pipeline: define URLs to monitor, scrape weekly, diff against previous versions, update your vector store, and use KnowledgeSDK webhooks to notify your application when content changes so it can invalidate its cache and re-embed the updated content.
The Stale Knowledge Problem
AI chatbots get knowledge from three sources:
- Training data — frozen at the model's training cutoff, typically 6-18 months behind current reality
- Initial RAG ingestion — the documents and URLs you loaded when you built the chatbot
- Live retrieval — real-time lookups that happen at query time
Most chatbots rely heavily on sources 1 and 2 and rarely implement source 3. The result is a chatbot that confidently answers questions about pricing tiers that no longer exist, features that were sunset, or policies that were updated.
The fix is a freshness pipeline that keeps your RAG knowledge base synchronized with live web content.
Architecture
┌─────────────────────────────────────────────────────────┐
│ FRESHNESS PIPELINE │
│ │
│ 1. URL Registry 2. Weekly Scrape Job │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Monitored │ ──────→ │ KnowledgeSDK Scrape │ │
│ │ URLs list │ │ POST /v1/scrape │ │
│ └─────────────┘ └──────────┬──────────┘ │
│ │ │
│ 3. Diff Detection 4. Re-embed & Store │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Compare vs │ ──────→ │ OpenAI Embeddings │ │
│ │ baseline │ changed │ → Vector DB upsert │ │
│ └─────────────┘ └──────────┬──────────┘ │
│ │ │
│ 5. Webhook Trigger 6. Cache Invalidation │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ KnowledgeSDK│ ──────→ │ App notified to │ │
│ │ webhook │ │ clear stale cache │ │
│ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Two complementary mechanisms work together:
- Scheduled scraping catches gradual content drift (weekly re-scrape and re-embed)
- KnowledgeSDK webhooks catch immediate significant changes (within hours)
Prerequisites
mkdir chatbot-freshness && cd chatbot-freshness
npm install @knowledgesdk/node openai pg node-cron dotenv
npm install -D typescript tsx @types/node
.env:
KNOWLEDGESDK_API_KEY=sk_ks_your_key
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://...
APP_WEBHOOK_SECRET=your_secret
SERVER_URL=https://your-app.com
Step 1: Define Your URL Registry
// src/urlRegistry.ts
export interface MonitoredUrl {
id: string;
url: string;
label: string; // Human-readable name
category: string; // docs | pricing | product | blog
refreshIntervalDays: number;
priority: 'high' | 'medium' | 'low';
}
export const MONITORED_URLS: MonitoredUrl[] = [
{
id: "pricing",
url: "https://yourapp.com/pricing",
label: "Pricing Page",
category: "pricing",
refreshIntervalDays: 1, // Daily for pricing
priority: "high",
},
{
id: "docs-getting-started",
url: "https://docs.yourapp.com/getting-started",
label: "Getting Started Docs",
category: "docs",
refreshIntervalDays: 7,
priority: "high",
},
{
id: "docs-api-reference",
url: "https://docs.yourapp.com/api",
label: "API Reference",
category: "docs",
refreshIntervalDays: 7,
priority: "medium",
},
{
id: "changelog",
url: "https://yourapp.com/changelog",
label: "Changelog",
category: "product",
refreshIntervalDays: 1,
priority: "high",
},
{
id: "tos",
url: "https://yourapp.com/terms",
label: "Terms of Service",
category: "legal",
refreshIntervalDays: 30,
priority: "medium",
},
];
Step 2: Database Schema
CREATE TABLE knowledge_items (
id TEXT PRIMARY KEY, -- matches MonitoredUrl.id
url TEXT NOT NULL UNIQUE,
label TEXT NOT NULL,
category TEXT NOT NULL,
markdown TEXT NOT NULL,
embedding vector(1536), -- pgvector extension
content_hash TEXT NOT NULL, -- SHA256 of markdown for quick change detection
scraped_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
embedded_at TIMESTAMPTZ,
version INT NOT NULL DEFAULT 1
);
CREATE TABLE content_history (
id SERIAL PRIMARY KEY,
url TEXT NOT NULL,
old_markdown TEXT NOT NULL,
new_markdown TEXT NOT NULL,
changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Enable vector similarity search
CREATE INDEX ON knowledge_items USING ivfflat (embedding vector_cosine_ops);
Step 3: The Scraping and Embedding Pipeline
// src/pipeline.ts
import KnowledgeSDK from "@knowledgesdk/node";
import OpenAI from "openai";
import { Pool } from "pg";
import crypto from "crypto";
import { MONITORED_URLS, MonitoredUrl } from "./urlRegistry";
import "dotenv/config";
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const db = new Pool({ connectionString: process.env.DATABASE_URL });
function hashContent(content: string): string {
return crypto.createHash("sha256").update(content).digest("hex");
}
async function getEmbedding(text: string): Promise<number[]> {
// Chunk long content before embedding
const MAX_CHARS = 8000;
const truncated = text.slice(0, MAX_CHARS);
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: truncated,
});
return response.data[0].embedding;
}
async function processUrl(item: MonitoredUrl): Promise<boolean> {
console.log(`Processing: ${item.label} (${item.url})`);
// Scrape the URL
const scraped = await ks.scrape({ url: item.url });
const newHash = hashContent(scraped.markdown);
// Check if content changed
const { rows } = await db.query(
"SELECT content_hash, markdown, version FROM knowledge_items WHERE id = $1",
[item.id]
);
const existing = rows[0];
const isNew = !existing;
const hasChanged = existing && existing.content_hash !== newHash;
if (!isNew && !hasChanged) {
console.log(` No change detected, skipping re-embed`);
return false;
}
// Store change history
if (hasChanged) {
await db.query(
`INSERT INTO content_history (url, old_markdown, new_markdown)
VALUES ($1, $2, $3)`,
[item.url, existing.markdown, scraped.markdown]
);
console.log(` Content changed, re-embedding...`);
} else {
console.log(` New content, embedding for first time...`);
}
// Generate embedding
const embedding = await getEmbedding(scraped.markdown);
// Upsert into knowledge base
await db.query(
`INSERT INTO knowledge_items
(id, url, label, category, markdown, embedding, content_hash, scraped_at, embedded_at, version)
VALUES ($1, $2, $3, $4, $5, $6::vector, $7, NOW(), NOW(), 1)
ON CONFLICT (id) DO UPDATE SET
markdown = EXCLUDED.markdown,
embedding = EXCLUDED.embedding,
content_hash = EXCLUDED.content_hash,
scraped_at = NOW(),
embedded_at = NOW(),
version = knowledge_items.version + 1`,
[
item.id,
item.url,
item.label,
item.category,
scraped.markdown,
JSON.stringify(embedding),
newHash,
]
);
console.log(` Updated: v${(existing?.version ?? 0) + 1}`);
return true; // Content was updated
}
export async function runFullRefresh(): Promise<void> {
console.log(`Starting full knowledge base refresh (${MONITORED_URLS.length} URLs)`);
let updated = 0;
let unchanged = 0;
let failed = 0;
for (const item of MONITORED_URLS) {
try {
const wasUpdated = await processUrl(item);
if (wasUpdated) updated++;
else unchanged++;
// Rate limit: 1 second between scrapes
await new Promise((r) => setTimeout(r, 1000));
} catch (error) {
console.error(` FAILED: ${item.label}`, error);
failed++;
}
}
console.log(`Refresh complete: ${updated} updated, ${unchanged} unchanged, ${failed} failed`);
if (updated > 0) {
await notifyAppOfUpdates();
}
}
async function notifyAppOfUpdates(): Promise<void> {
// Notify your application that the knowledge base was updated
// so it can invalidate its query cache
try {
await fetch(`${process.env.SERVER_URL}/api/knowledge/invalidate-cache`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-internal-secret": process.env.APP_WEBHOOK_SECRET!,
},
body: JSON.stringify({ updatedAt: new Date().toISOString() }),
});
console.log("App notified of knowledge base update");
} catch (e) {
console.error("Failed to notify app:", e);
}
}
Step 4: Scheduled Refresh
// src/scheduler.ts
import cron from "node-cron";
import { runFullRefresh } from "./pipeline";
// Run every Sunday at 2 AM
cron.schedule("0 2 * * 0", async () => {
console.log("Weekly knowledge base refresh starting...");
await runFullRefresh();
});
// High-priority items run daily at 6 AM
cron.schedule("0 6 * * *", async () => {
const highPriorityItems = MONITORED_URLS.filter((u) => u.priority === "high");
console.log(`Daily high-priority refresh: ${highPriorityItems.length} items`);
for (const item of highPriorityItems) {
await processUrl(item).catch(console.error);
await new Promise((r) => setTimeout(r, 1000));
}
});
console.log("Scheduler started");
Step 5: Webhook-Triggered Immediate Updates
Scheduled scraping handles gradual drift. For immediate, significant changes — a pricing update, a new product announcement — register KnowledgeSDK webhooks:
// src/webhooks.ts
import KnowledgeSDK from "@knowledgesdk/node";
import { MONITORED_URLS } from "./urlRegistry";
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
export async function registerWebhooks(): Promise<void> {
const highPriorityUrls = MONITORED_URLS
.filter((u) => u.priority === "high")
.map((u) => u.url);
const webhook = await ks.webhooks.create({
url: `${process.env.SERVER_URL}/webhooks/content-changed`,
watchUrls: highPriorityUrls,
events: ["content.changed"],
secret: process.env.APP_WEBHOOK_SECRET,
});
console.log(`Webhook registered: ${webhook.id}`);
console.log(`Monitoring ${highPriorityUrls.length} high-priority URLs for immediate changes`);
}
The webhook handler in your Express server:
// In your Express app
app.post("/webhooks/content-changed", express.raw({ type: "application/json" }), async (req, res) => {
// Verify signature...
res.status(200).json({ received: true });
const { url, newContent } = req.body as any;
// Find the monitored item for this URL
const item = MONITORED_URLS.find((u) => u.url === url);
if (!item) return;
// Process immediately
try {
const embedding = await getEmbedding(newContent);
const hash = hashContent(newContent);
await db.query(
`UPDATE knowledge_items
SET markdown = $1, embedding = $2::vector, content_hash = $3,
scraped_at = NOW(), embedded_at = NOW(), version = version + 1
WHERE id = $4`,
[newContent, JSON.stringify(embedding), hash, item.id]
);
await notifyAppOfUpdates();
console.log(`Webhook update processed for: ${item.label}`);
} catch (error) {
console.error(`Webhook update failed for ${url}:`, error);
}
});
Step 6: Semantic Search for the Chatbot
Now your chatbot retrieves fresh knowledge at query time:
// src/retrieval.ts
import { Pool } from "pg";
import OpenAI from "openai";
import "dotenv/config";
const db = new Pool({ connectionString: process.env.DATABASE_URL });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
export interface RetrievalResult {
label: string;
url: string;
content: string;
similarity: number;
scrapedAt: Date;
}
export async function retrieveRelevantContent(
query: string,
limit: number = 5,
maxAgeHours: number = 168 // 1 week default
): Promise<RetrievalResult[]> {
// Get query embedding
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: query,
});
const queryEmbedding = response.data[0].embedding;
// Vector similarity search with recency filter
const { rows } = await db.query(
`SELECT
label,
url,
LEFT(markdown, 1500) AS content,
1 - (embedding <=> $1::vector) AS similarity,
scraped_at
FROM knowledge_items
WHERE scraped_at > NOW() - INTERVAL '${maxAgeHours} hours'
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[JSON.stringify(queryEmbedding), limit]
);
return rows.map((r) => ({
label: r.label,
url: r.url,
content: r.content,
similarity: r.similarity,
scrapedAt: r.scraped_at,
}));
}
export async function answerWithFreshContext(question: string): Promise<string> {
const results = await retrieveRelevantContent(question);
const context = results
.map((r) => `Source: ${r.label} (${r.url})\nLast updated: ${r.scrapedAt.toISOString()}\n\n${r.content}`)
.join("\n\n---\n\n");
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `You are a helpful assistant. Use the provided context to answer questions accurately.
Always cite which source you're drawing from. If the context doesn't contain the answer, say so clearly.
Context was scraped from live web pages — treat it as current information.`,
},
{
role: "user",
content: `Context:\n\n${context}\n\nQuestion: ${question}`,
},
],
});
return response.choices[0].message.content!;
}
Step 7: Cache Invalidation Handler
When content updates, your chatbot's response cache should be cleared to prevent serving stale cached responses:
// In your Next.js API route or Express server
app.post("/api/knowledge/invalidate-cache", (req, res) => {
const secret = req.headers["x-internal-secret"];
if (secret !== process.env.APP_WEBHOOK_SECRET) {
return res.status(401).json({ error: "Unauthorized" });
}
// Clear your application's response cache
// This depends on your caching layer:
// For Redis:
await redis.flushDb(); // Or use a pattern: redis.del('chatbot:cache:*')
// For in-memory cache:
responseCache.clear();
// For Next.js revalidation:
await res.revalidate('/');
console.log(`Cache invalidated at ${req.body.updatedAt}`);
return res.json({ cleared: true });
});
Putting It All Together
// src/index.ts
import { runFullRefresh } from "./pipeline";
import { registerWebhooks } from "./webhooks";
import "./scheduler";
async function main() {
// Initial setup
await runFullRefresh();
await registerWebhooks();
console.log("Freshness pipeline running");
}
main().catch(console.error);
Measuring Freshness
Add a simple freshness dashboard to your admin panel:
export async function getFreshnessReport() {
const { rows } = await db.query(`
SELECT
label,
url,
scraped_at,
version,
NOW() - scraped_at AS age,
CASE
WHEN NOW() - scraped_at < INTERVAL '1 day' THEN 'fresh'
WHEN NOW() - scraped_at < INTERVAL '7 days' THEN 'aging'
ELSE 'stale'
END AS freshness_status
FROM knowledge_items
ORDER BY scraped_at ASC
`);
return rows;
}
This shows you at a glance which items are fresh, aging, or stale — and ensures your freshness pipeline is working.
Comparison: Static RAG vs. Fresh RAG
| Dimension | Static RAG | Fresh RAG with KnowledgeSDK |
|---|---|---|
| Knowledge cutoff | Ingestion date | Rolling (24h-7d) |
| Pricing accuracy | Drifts over time | Always current |
| Feature accuracy | Stale after releases | Updated on changelog change |
| Infrastructure | Vector DB only | Vector DB + scraping pipeline |
| Maintenance | None | Minimal (monitor the scheduler) |
| Cost | One-time ingestion | Small ongoing scraping cost |
| User trust | Erodes over time | Maintained |
FAQ
How do I handle documents that shouldn't be re-scraped (PDFs, internal wikis)? Segment your URL registry by type. Only URLs pointing to public web pages need KnowledgeSDK scraping. For PDFs and internal wikis, use a separate ingestion pipeline and update them manually or on a different trigger.
What's a good chunk size for embedding long documents?
For text-embedding-3-small, the limit is 8191 tokens (~6000 words). For longer pages, chunk by section heading rather than arbitrary character count. KnowledgeSDK's markdown output uses headings consistently, making it easy to split on \n## or \n### .
Can I run this without pgvector?
Yes. Use KnowledgeSDK's own search endpoint (POST /v1/search) as your vector search layer instead of maintaining your own pgvector. This removes the embedding step entirely — KnowledgeSDK stores and searches the content for you.
How do I handle content that's only partially updated? KnowledgeSDK's webhook diff shows exactly which lines changed. If only one section of a large page changed, you can re-embed only that section rather than the entire document — though for simplicity, re-embedding the whole page is usually fast enough.
What if my chatbot is serving many users and can't afford the latency of retrieval? Use a two-tier approach: a fast in-memory cache of the most recent retrieval results per query pattern, and the full vector search for cache misses. Invalidate the cache when the knowledge base updates.
Stop letting your chatbot give outdated answers. Build a live freshness pipeline today at knowledgesdk.com/setup.