Keeping Your RAG Knowledge Base Fresh: Automated Re-indexing Strategies

Stale RAG is worse than no RAG — it confidently returns outdated answers. Here are five strategies to keep your knowledge base current automatically.

Here is a failure mode that does not announce itself. You build a RAG agent, index your competitor's pricing page in January, and ship it. In March, they restructure their pricing — new tiers, new limits, new numbers. Your agent keeps quoting January prices, with the full confidence of a grounded retrieval system. A user acts on that information. They are wrong. They blame your product.

Stale RAG is worse than no RAG because it answers confidently. This post covers five strategies for keeping your knowledge base current automatically, with implementation patterns for each.

The Staleness Problem in Practice

When you index a page, you are capturing a snapshot of that content at a point in time. Unlike a live web request, that snapshot does not update itself. If you never re-index, your knowledge base ages while the web keeps changing.

The problem compounds because retrieval gives answers a false credibility. A user who asks "what does Competitor X charge for their Pro plan?" and gets a confident, grounded answer assumes that answer is current. The retrieval system has no way to signal "by the way, I last checked this in January."

The solution is not to rebuild from scratch periodically — it is to implement intelligent re-indexing that keeps each piece of knowledge fresh relative to how quickly that type of content changes.

Strategy 1: Time-Based Re-indexing

The simplest approach: crawl every URL in your corpus every X days, regardless of whether anything changed.

How it works: Maintain a list of source URLs and a last_crawled_at timestamp for each. A cron job runs daily and re-extracts any URL where last_crawled_at is older than your threshold.

Pros: Dead simple to implement. No external dependencies. Every URL stays fresh within your defined window.

Cons: Wasteful. If 90% of your pages have not changed, you are spending extraction budget re-indexing unchanged content. At scale, this adds up.

Best for: Small corpora (under 1,000 URLs), or when you cannot distinguish which content changes frequently.

Strategy 2: Change-Detection Webhooks

Instead of polling on a schedule, receive a notification when content actually changes.

How it works: Some web monitoring services (Visualping, Distill, or custom implementations using headless browsers) can watch a URL and fire a webhook when the rendered content changes. You register your callback URL, and they call you when something changes.

Pros: Maximum efficiency — you only re-index when there is actually something to re-index. Near-real-time freshness for high-priority pages.

Cons: Requires external monitoring infrastructure. Most web monitoring services are not free at scale. Does not work for pages behind authentication.

Implementation with KnowledgeSDK:

// Your webhook handler (Express/Hapi endpoint)
app.post('/webhooks/content-changed', async (req, res) => {
  const { url } = req.body;

  // Re-extract the changed page — replaces existing chunks for this URL
  await fetch('https://api.knowledgesdk.com/v1/extract', {
    method: 'POST',
    headers: {
      'x-api-key': process.env.KNOWLEDGE_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ url }),
  });

  res.json({ ok: true });
});

KnowledgeSDK automatically replaces existing indexed content for the same URL, so re-extraction does not create duplicates — it updates in place.

Strategy 3: TTL Per Document

Assign different freshness windows to different types of content, then run separate cron schedules for each tier.

How it works: When you index a URL, tag it with a freshness_tier metadata field. Your re-indexing jobs check this tier and apply the appropriate TTL.

Pros: Smart resource allocation. You spend extraction budget where it matters most. Pricing pages get daily refreshes; about pages get monthly ones.

Cons: Requires upfront classification of your URL corpus. Misclassified URLs get the wrong freshness treatment.

Content freshness tiers:

Tier	Content Types	Re-index Frequency
High-churn	Pricing pages, news articles, job boards, stock data, changelogs	Daily
Medium-churn	Product docs, feature pages, blog posts, API references	Weekly
Low-churn	About pages, terms of service, legal, mission statements	Monthly

Implementation:

import KnowledgeSDK from '@knowledgesdk/node';

const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGE_API_KEY });

const FRESHNESS_TIERS = {
  high: { maxAgeMs: 24 * 60 * 60 * 1000 },    // 1 day
  medium: { maxAgeMs: 7 * 24 * 60 * 60 * 1000 }, // 7 days
  low: { maxAgeMs: 30 * 24 * 60 * 60 * 1000 },   // 30 days
};

async function reindexStaleUrls(tier: 'high' | 'medium' | 'low') {
  const { maxAgeMs } = FRESHNESS_TIERS[tier];
  const cutoff = new Date(Date.now() - maxAgeMs);

  // Fetch URLs in this tier that haven't been crawled since cutoff
  const staleUrls = await db('indexed_urls')
    .where({ freshness_tier: tier })
    .where('last_crawled_at', '<', cutoff)
    .select('url');

  for (const { url } of staleUrls) {
    await ks.extract({ url });
    await db('indexed_urls')
      .where({ url })
      .update({ last_crawled_at: new Date() });
  }
}

// Run from separate cron jobs:
// Daily:   reindexStaleUrls('high')
// Weekly:  reindexStaleUrls('medium')
// Monthly: reindexStaleUrls('low')

Strategy 4: User-Triggered Re-indexing

When a user reports a stale answer, trigger re-indexing of the source automatically.

How it works: Add a "flag as outdated" action to your agent's UI. When triggered, extract the source URL that produced the stale answer and queue it for immediate re-indexing.

Pros: Reactive and efficient. Users are the best detectors of staleness — they know when an answer does not match what they are seeing on the actual website. Free signal about which pages matter to your users.

Cons: Reactive by nature. Users have to encounter the stale answer before you fix it. Does not work for pages users never query.

Best for: Complementing other strategies, not replacing them. Treat user-triggered re-indexing as a fast lane for high-priority corrections.

Strategy 5: Sitemap Diff

Compare the sitemap from today against the sitemap from yesterday. Re-index new URLs and remove deleted ones.

How it works: Call POST /v1/sitemap to retrieve all URLs from a domain each day. Diff today's list against yesterday's. New URLs get extracted immediately; removed URLs get purged from your index.

Pros: Excellent for large content sites (news, documentation, e-commerce) where new pages appear constantly and old pages get removed. No wasted re-indexing of unchanged content.

Cons: Only catches page additions and removals, not updates to existing pages. Best used alongside TTL-based re-indexing.

Implementation:

async function syncSitemap(domain: string) {
  // Fetch current sitemap via KnowledgeSDK
  const response = await fetch('https://api.knowledgesdk.com/v1/sitemap', {
    method: 'POST',
    headers: { 'x-api-key': process.env.KNOWLEDGE_API_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ url: domain }),
  });
  const { urls: currentUrls } = await response.json();

  const previousUrls = await db('indexed_urls')
    .where({ domain })
    .pluck('url');

  const previousSet = new Set(previousUrls);
  const currentSet = new Set(currentUrls);

  // Extract new URLs
  const newUrls = currentUrls.filter(url => !previousSet.has(url));
  for (const url of newUrls) {
    await ks.extract({ url });
  }

  // Remove deleted URLs from tracking
  const removedUrls = previousUrls.filter(url => !currentSet.has(url));
  await db('indexed_urls').whereIn('url', removedUrls).delete();
}

The Deduplication Problem

One concern that comes up immediately: if you re-extract a URL, do you end up with two copies of the same content in your index?

With KnowledgeSDK, no. Re-extracting a URL automatically replaces the existing indexed chunks for that URL. The deduplication is handled at the extraction layer — you do not need to manually purge and re-insert. This matters because building deduplication logic yourself is non-trivial when content has partially changed (some chunks are the same, some are new).

Monitoring Freshness

Track last_crawled_at per URL in your database. Set up alerts for any URL that has exceeded its freshness tier TTL by more than a defined buffer. A simple daily job that queries for WHERE freshness_tier = 'high' AND last_crawled_at < NOW() - INTERVAL '2 days' gives you a freshness health dashboard.

The goal is never to be surprised by stale answers. Build observability into your re-indexing pipeline from day one, and treat knowledge freshness as an operational metric alongside uptime and latency.

A knowledge base that stays fresh is an asset. One that goes stale is a liability — and the more confidently your agent answers, the more expensive that liability becomes.

Try it now