How to Keep Your AI Chatbot's Knowledge Base Fresh with Web Scraping

Solve the stale knowledge problem: build a pipeline that scrapes URLs weekly, diffs against previous versions, updates your vector store, and notifies your app.

You've built an AI chatbot. Users are asking questions. And slowly, a problem emerges: the answers are getting stale.

Your chatbot was trained — or your RAG pipeline was populated — at a specific point in time. Since then, your product has shipped new features. Your docs have been updated. Your pricing has changed. Industry regulations have evolved. But your chatbot is still confidently answering based on the old information.

This is the stale knowledge problem, and it affects every production AI system eventually.

This guide builds a complete freshness pipeline: define URLs to monitor, scrape weekly, diff against previous versions, update your vector store, and use KnowledgeSDK webhooks to notify your application when content changes so it can invalidate its cache and re-embed the updated content.

The Stale Knowledge Problem

AI chatbots get knowledge from three sources:

Training data — frozen at the model's training cutoff, typically 6-18 months behind current reality
Initial RAG ingestion — the documents and URLs you loaded when you built the chatbot
Live retrieval — real-time lookups that happen at query time

Most chatbots rely heavily on sources 1 and 2 and rarely implement source 3. The result is a chatbot that confidently answers questions about pricing tiers that no longer exist, features that were sunset, or policies that were updated.

The fix is a freshness pipeline that keeps your RAG knowledge base synchronized with live web content.

Architecture

┌─────────────────────────────────────────────────────────┐
│                    FRESHNESS PIPELINE                    │
│                                                         │
│  1. URL Registry          2. Weekly Scrape Job          │
│  ┌─────────────┐          ┌─────────────────────┐       │
│  │ Monitored   │ ──────→  │ KnowledgeSDK Scrape │       │
│  │ URLs list   │          │ POST /v1/extract      │       │
│  └─────────────┘          └──────────┬──────────┘       │
│                                      │                  │
│  3. Diff Detection        4. Re-embed & Store           │
│  ┌─────────────┐          ┌─────────────────────┐       │
│  │ Compare vs  │ ──────→  │ OpenAI Embeddings   │       │
│  │ baseline    │  changed │ → Vector DB upsert  │       │
│  └─────────────┘          └──────────┬──────────┘       │
│                                      │                  │
│  5. Webhook Trigger       6. Cache Invalidation         │
│  ┌─────────────┐          ┌─────────────────────┐       │
│  │ KnowledgeSDK│ ──────→  │ App notified to     │       │
│  │ webhook     │          │ clear stale cache   │       │
│  └─────────────┘          └─────────────────────┘       │
└─────────────────────────────────────────────────────────┘

Two complementary mechanisms work together:

Scheduled scraping catches gradual content drift (weekly re-scrape and re-embed)
KnowledgeSDK webhooks catch immediate significant changes (within hours)

Prerequisites

mkdir chatbot-freshness && cd chatbot-freshness
npm install @knowledgesdk/node openai pg node-cron dotenv
npm install -D typescript tsx @types/node

.env:

KNOWLEDGESDK_API_KEY=knowledgesdk_live_your_key
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://...
APP_WEBHOOK_SECRET=your_secret
SERVER_URL=https://your-app.com

Step 1: Define Your URL Registry

// src/urlRegistry.ts

export interface MonitoredUrl {
  id: string;
  url: string;
  label: string;           // Human-readable name
  category: string;        // docs | pricing | product | blog
  refreshIntervalDays: number;
  priority: 'high' | 'medium' | 'low';
}

export const MONITORED_URLS: MonitoredUrl[] = [
  {
    id: "pricing",
    url: "https://yourapp.com/pricing",
    label: "Pricing Page",
    category: "pricing",
    refreshIntervalDays: 1,  // Daily for pricing
    priority: "high",
  },
  {
    id: "docs-getting-started",
    url: "https://docs.yourapp.com/getting-started",
    label: "Getting Started Docs",
    category: "docs",
    refreshIntervalDays: 7,
    priority: "high",
  },
  {
    id: "docs-api-reference",
    url: "https://docs.yourapp.com/api",
    label: "API Reference",
    category: "docs",
    refreshIntervalDays: 7,
    priority: "medium",
  },
  {
    id: "changelog",
    url: "https://yourapp.com/changelog",
    label: "Changelog",
    category: "product",
    refreshIntervalDays: 1,
    priority: "high",
  },
  {
    id: "tos",
    url: "https://yourapp.com/terms",
    label: "Terms of Service",
    category: "legal",
    refreshIntervalDays: 30,
    priority: "medium",
  },
];

Step 2: Database Schema

CREATE TABLE knowledge_items (
  id TEXT PRIMARY KEY,           -- matches MonitoredUrl.id
  url TEXT NOT NULL UNIQUE,
  label TEXT NOT NULL,
  category TEXT NOT NULL,
  markdown TEXT NOT NULL,
  embedding vector(1536),        -- pgvector extension
  content_hash TEXT NOT NULL,    -- SHA256 of markdown for quick change detection
  scraped_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  embedded_at TIMESTAMPTZ,
  version INT NOT NULL DEFAULT 1
);

CREATE TABLE content_history (
  id SERIAL PRIMARY KEY,
  url TEXT NOT NULL,
  old_markdown TEXT NOT NULL,
  new_markdown TEXT NOT NULL,
  changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Enable vector similarity search
CREATE INDEX ON knowledge_items USING ivfflat (embedding vector_cosine_ops);

Step 3: The Scraping and Embedding Pipeline

// src/pipeline.ts
import KnowledgeSDK from "@knowledgesdk/node";
import OpenAI from "openai";
import { Pool } from "pg";
import crypto from "crypto";
import { MONITORED_URLS, MonitoredUrl } from "./urlRegistry";
import "dotenv/config";

const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const db = new Pool({ connectionString: process.env.DATABASE_URL });

function hashContent(content: string): string {
  return crypto.createHash("sha256").update(content).digest("hex");
}

async function getEmbedding(text: string): Promise<number[]> {
  // Chunk long content before embedding
  const MAX_CHARS = 8000;
  const truncated = text.slice(0, MAX_CHARS);

  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: truncated,
  });

  return response.data[0].embedding;
}

async function processUrl(item: MonitoredUrl): Promise<boolean> {
  console.log(`Processing: ${item.label} (${item.url})`);

  // Scrape the URL
  const scraped = await ks.extract({ url: item.url });
  const newHash = hashContent(scraped.markdown);

  // Check if content changed
  const { rows } = await db.query(
    "SELECT content_hash, markdown, version FROM knowledge_items WHERE id = $1",
    [item.id]
  );

  const existing = rows[0];
  const isNew = !existing;
  const hasChanged = existing && existing.content_hash !== newHash;

  if (!isNew && !hasChanged) {
    console.log(`  No change detected, skipping re-embed`);
    return false;
  }

  // Store change history
  if (hasChanged) {
    await db.query(
      `INSERT INTO content_history (url, old_markdown, new_markdown)
       VALUES ($1, $2, $3)`,
      [item.url, existing.markdown, scraped.markdown]
    );
    console.log(`  Content changed, re-embedding...`);
  } else {
    console.log(`  New content, embedding for first time...`);
  }

  // Generate embedding
  const embedding = await getEmbedding(scraped.markdown);

  // Upsert into knowledge base
  await db.query(
    `INSERT INTO knowledge_items
       (id, url, label, category, markdown, embedding, content_hash, scraped_at, embedded_at, version)
     VALUES ($1, $2, $3, $4, $5, $6::vector, $7, NOW(), NOW(), 1)
     ON CONFLICT (id) DO UPDATE SET
       markdown = EXCLUDED.markdown,
       embedding = EXCLUDED.embedding,
       content_hash = EXCLUDED.content_hash,
       scraped_at = NOW(),
       embedded_at = NOW(),
       version = knowledge_items.version + 1`,
    [
      item.id,
      item.url,
      item.label,
      item.category,
      scraped.markdown,
      JSON.stringify(embedding),
      newHash,
    ]
  );

  console.log(`  Updated: v${(existing?.version ?? 0) + 1}`);
  return true; // Content was updated
}

export async function runFullRefresh(): Promise<void> {
  console.log(`Starting full knowledge base refresh (${MONITORED_URLS.length} URLs)`);
  let updated = 0;
  let unchanged = 0;
  let failed = 0;

  for (const item of MONITORED_URLS) {
    try {
      const wasUpdated = await processUrl(item);
      if (wasUpdated) updated++;
      else unchanged++;

      // Rate limit: 1 second between scrapes
      await new Promise((r) => setTimeout(r, 1000));
    } catch (error) {
      console.error(`  FAILED: ${item.label}`, error);
      failed++;
    }
  }

  console.log(`Refresh complete: ${updated} updated, ${unchanged} unchanged, ${failed} failed`);

  if (updated > 0) {
    await notifyAppOfUpdates();
  }
}

async function notifyAppOfUpdates(): Promise<void> {
  // Notify your application that the knowledge base was updated
  // so it can invalidate its query cache
  try {
    await fetch(`${process.env.SERVER_URL}/api/knowledge/invalidate-cache`, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-internal-secret": process.env.APP_WEBHOOK_SECRET!,
      },
      body: JSON.stringify({ updatedAt: new Date().toISOString() }),
    });
    console.log("App notified of knowledge base update");
  } catch (e) {
    console.error("Failed to notify app:", e);
  }
}

Step 4: Scheduled Refresh

// src/scheduler.ts
import cron from "node-cron";
import { runFullRefresh } from "./pipeline";

// Run every Sunday at 2 AM
cron.schedule("0 2 * * 0", async () => {
  console.log("Weekly knowledge base refresh starting...");
  await runFullRefresh();
});

// High-priority items run daily at 6 AM
cron.schedule("0 6 * * *", async () => {
  const highPriorityItems = MONITORED_URLS.filter((u) => u.priority === "high");
  console.log(`Daily high-priority refresh: ${highPriorityItems.length} items`);
  for (const item of highPriorityItems) {
    await processUrl(item).catch(console.error);
    await new Promise((r) => setTimeout(r, 1000));
  }
});

console.log("Scheduler started");

Step 5: Webhook-Triggered Immediate Updates

Scheduled scraping handles gradual drift. For immediate, significant changes — a pricing update, a new product announcement — register KnowledgeSDK webhooks:

// src/webhooks.ts
import KnowledgeSDK from "@knowledgesdk/node";
import { MONITORED_URLS } from "./urlRegistry";

const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });

export async function registerWebhooks(): Promise<void> {
  const highPriorityUrls = MONITORED_URLS
    .filter((u) => u.priority === "high")
    .map((u) => u.url);

  const webhook = await ks.webhooks.create({
    url: `${process.env.SERVER_URL}/webhooks/content-changed`,
    watchUrls: highPriorityUrls,
    events: ["content.changed"],
    secret: process.env.APP_WEBHOOK_SECRET,
  });

  console.log(`Webhook registered: ${webhook.id}`);
  console.log(`Monitoring ${highPriorityUrls.length} high-priority URLs for immediate changes`);
}

The webhook handler in your Express server:

// In your Express app
app.post("/webhooks/content-changed", express.raw({ type: "application/json" }), async (req, res) => {
  // Verify signature...
  res.status(200).json({ received: true });

  const { url, newContent } = req.body as any;

  // Find the monitored item for this URL
  const item = MONITORED_URLS.find((u) => u.url === url);
  if (!item) return;

  // Process immediately
  try {
    const embedding = await getEmbedding(newContent);
    const hash = hashContent(newContent);

    await db.query(
      `UPDATE knowledge_items
       SET markdown = $1, embedding = $2::vector, content_hash = $3,
           scraped_at = NOW(), embedded_at = NOW(), version = version + 1
       WHERE id = $4`,
      [newContent, JSON.stringify(embedding), hash, item.id]
    );

    await notifyAppOfUpdates();
    console.log(`Webhook update processed for: ${item.label}`);
  } catch (error) {
    console.error(`Webhook update failed for ${url}:`, error);
  }
});

Step 6: Semantic Search for the Chatbot

Now your chatbot retrieves fresh knowledge at query time:

// src/retrieval.ts
import { Pool } from "pg";
import OpenAI from "openai";
import "dotenv/config";

const db = new Pool({ connectionString: process.env.DATABASE_URL });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

export interface RetrievalResult {
  label: string;
  url: string;
  content: string;
  similarity: number;
  scrapedAt: Date;
}

export async function retrieveRelevantContent(
  query: string,
  limit: number = 5,
  maxAgeHours: number = 168 // 1 week default
): Promise<RetrievalResult[]> {
  // Get query embedding
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  const queryEmbedding = response.data[0].embedding;

  // Vector similarity search with recency filter
  const { rows } = await db.query(
    `SELECT
       label,
       url,
       LEFT(markdown, 1500) AS content,
       1 - (embedding <=> $1::vector) AS similarity,
       scraped_at
     FROM knowledge_items
     WHERE scraped_at > NOW() - INTERVAL '${maxAgeHours} hours'
     ORDER BY embedding <=> $1::vector
     LIMIT $2`,
    [JSON.stringify(queryEmbedding), limit]
  );

  return rows.map((r) => ({
    label: r.label,
    url: r.url,
    content: r.content,
    similarity: r.similarity,
    scrapedAt: r.scraped_at,
  }));
}

export async function answerWithFreshContext(question: string): Promise<string> {
  const results = await retrieveRelevantContent(question);

  const context = results
    .map((r) => `Source: ${r.label} (${r.url})\nLast updated: ${r.scrapedAt.toISOString()}\n\n${r.content}`)
    .join("\n\n---\n\n");

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `You are a helpful assistant. Use the provided context to answer questions accurately.
Always cite which source you're drawing from. If the context doesn't contain the answer, say so clearly.
Context was scraped from live web pages — treat it as current information.`,
      },
      {
        role: "user",
        content: `Context:\n\n${context}\n\nQuestion: ${question}`,
      },
    ],
  });

  return response.choices[0].message.content!;
}

Step 7: Cache Invalidation Handler

When content updates, your chatbot's response cache should be cleared to prevent serving stale cached responses:

// In your Next.js API route or Express server
app.post("/api/knowledge/invalidate-cache", (req, res) => {
  const secret = req.headers["x-internal-secret"];
  if (secret !== process.env.APP_WEBHOOK_SECRET) {
    return res.status(401).json({ error: "Unauthorized" });
  }

  // Clear your application's response cache
  // This depends on your caching layer:

  // For Redis:
  await redis.flushDb(); // Or use a pattern: redis.del('chatbot:cache:*')

  // For in-memory cache:
  responseCache.clear();

  // For Next.js revalidation:
  await res.revalidate('/');

  console.log(`Cache invalidated at ${req.body.updatedAt}`);
  return res.json({ cleared: true });
});

Putting It All Together

// src/index.ts
import { runFullRefresh } from "./pipeline";
import { registerWebhooks } from "./webhooks";
import "./scheduler";

async function main() {
  // Initial setup
  await runFullRefresh();
  await registerWebhooks();
  console.log("Freshness pipeline running");
}

main().catch(console.error);

Measuring Freshness

Add a simple freshness dashboard to your admin panel:

export async function getFreshnessReport() {
  const { rows } = await db.query(`
    SELECT
      label,
      url,
      scraped_at,
      version,
      NOW() - scraped_at AS age,
      CASE
        WHEN NOW() - scraped_at < INTERVAL '1 day' THEN 'fresh'
        WHEN NOW() - scraped_at < INTERVAL '7 days' THEN 'aging'
        ELSE 'stale'
      END AS freshness_status
    FROM knowledge_items
    ORDER BY scraped_at ASC
  `);

  return rows;
}

This shows you at a glance which items are fresh, aging, or stale — and ensures your freshness pipeline is working.

Comparison: Static RAG vs. Fresh RAG

Dimension	Static RAG	Fresh RAG with KnowledgeSDK
Knowledge cutoff	Ingestion date	Rolling (24h-7d)
Pricing accuracy	Drifts over time	Always current
Feature accuracy	Stale after releases	Updated on changelog change
Infrastructure	Vector DB only	Vector DB + scraping pipeline
Maintenance	None	Minimal (monitor the scheduler)
Cost	One-time ingestion	Small ongoing scraping cost
User trust	Erodes over time	Maintained

FAQ

How do I handle documents that shouldn't be re-scraped (PDFs, internal wikis)? Segment your URL registry by type. Only URLs pointing to public web pages need KnowledgeSDK scraping. For PDFs and internal wikis, use a separate ingestion pipeline and update them manually or on a different trigger.

What's a good chunk size for embedding long documents? For text-embedding-3-small, the limit is 8191 tokens (~6000 words). For longer pages, chunk by section heading rather than arbitrary character count. KnowledgeSDK's markdown output uses headings consistently, making it easy to split on \n## or \n### .

Can I run this without pgvector? Yes. Use KnowledgeSDK's own search endpoint (POST /v1/search) as your vector search layer instead of maintaining your own pgvector. This removes the embedding step entirely — KnowledgeSDK stores and searches the content for you.

How do I handle content that's only partially updated? KnowledgeSDK's webhook diff shows exactly which lines changed. If only one section of a large page changed, you can re-embed only that section rather than the entire document — though for simplicity, re-embedding the whole page is usually fast enough.

What if my chatbot is serving many users and can't afford the latency of retrieval? Use a two-tier approach: a fast in-memory cache of the most recent retrieval results per query pattern, and the full vector search for cache misses. Invalidate the cache when the knowledge base updates.

Stop letting your chatbot give outdated answers. Build a live freshness pipeline today at knowledgesdk.com/setup.

Try it now