Temporal RAG: Building Systems That Know When Knowledge Goes Stale

Your RAG pipeline is only as good as its most recent data. Learn how to build temporal awareness into your retrieval system so agents always know what's current.

Temporal RAG: Building Systems That Know When Knowledge Goes Stale

RAG pipelines are built to retrieve relevant information. Most of them are not built to track how old that information is.

This gap causes real problems. An agent recommends a product that was discontinued six months ago. It quotes pricing that changed last quarter. It cites documentation from a library version that's two major releases behind. From the user's perspective, the AI confidently said something wrong — and that's worse than admitting uncertainty.

The solution is temporal RAG: retrieval systems that understand time, track when content was last seen, and actively manage knowledge freshness.

The Freshness Problem in Practice

When a RAG pipeline is first set up, it indexes content and the world is in sync. Then time passes.

Here's what stale knowledge looks like in production:

A competitive intelligence agent quotes a competitor's pricing from 8 months ago, before they changed plans
A developer assistant recommends a deprecated API endpoint that was removed in a library update
A support bot references a help article that was updated after a product change, giving users incorrect instructions
A market research agent cites funding data for a startup that has since been acquired

In each case, the content was correct when indexed. The index just wasn't updated when reality changed. The RAG system had no way to know that its knowledge had drifted from truth.

The Three Temporal Dimensions

Not all content ages the same way. To build a temporally-aware system, you need to track three distinct dimensions:

1. Document creation date (published_at) When was this content originally created? A pricing page published today is inherently more trustworthy than one created three years ago and never updated. News articles from 2022 about a company's strategy may be completely outdated. Published dates give you a baseline for how fresh the original content was.

2. Last crawled date (crawled_at) When did your system last retrieve this content from the source? This is the date your index actually reflects. Even if a page was published years ago, if you crawled it yesterday, you have a recent snapshot. This is the most operationally important timestamp.

3. Content change frequency How often does this type of content change? A pricing page for an active SaaS product changes frequently — new plans get added, prices adjust, features shift. A company "about" page might not change for years. Understanding change frequency per content type lets you prioritize re-crawl efforts.

Strategies for Temporal Awareness

Store timestamps on every indexed chunk

Every piece of content in your index should have temporal metadata attached:

interface KnowledgeChunk {
  id: string;
  content: string;
  url: string;
  title: string;
  publishedAt?: string;   // when the page was originally published
  crawledAt: string;      // when your system last fetched this
  domain: string;
  topic: string;
}

When KnowledgeSDK indexes a URL via POST /v1/extract, it stores crawled_at automatically. You can query for recency at search time.

Time-decay scoring

Give fresher content a scoring advantage. A result that scores 0.82 on relevance but was crawled 6 months ago should rank below a result that scores 0.78 but was crawled yesterday, for most use cases.

A simple time-decay function:

function applyTimeDecay(
  score: number,
  crawledAt: string,
  halfLifeDays: number = 30
): number {
  const ageMs = Date.now() - new Date(crawledAt).getTime();
  const ageDays = ageMs / (1000 * 60 * 60 * 24);
  const decayFactor = Math.pow(0.5, ageDays / halfLifeDays);
  return score * decayFactor;
}

A 30-day half-life means content crawled 30 days ago has half the effective score of equivalent freshly-crawled content. Tune halfLifeDays per content type — news might use 3 days, documentation might use 90.

Filter by recency at search time

For use cases where freshness is critical, filter results to only return chunks crawled within a time window:

import KnowledgeSDK from "@knowledgesdk/node";

const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });

async function searchFreshKnowledge(query: string, maxAgeDays: number = 7) {
  const results = await ks.search({ query, limit: 10 });

  const cutoff = new Date();
  cutoff.setDate(cutoff.getDate() - maxAgeDays);

  const freshResults = results.results.filter((r) => {
    if (!r.crawledAt) return true; // include if no timestamp
    return new Date(r.crawledAt) >= cutoff;
  });

  if (freshResults.length === 0) {
    // Trigger re-extraction and return a freshness warning
    return {
      results: [],
      warning: `No results crawled within the last ${maxAgeDays} days. Consider re-indexing.`,
    };
  }

  return { results: freshResults };
}

This gives users (and your agent) a clear signal when the knowledge base needs refreshing.

TTL-based scheduled re-extraction

The simplest freshness strategy: re-crawl everything on a schedule. The schedule should be calibrated to content change frequency:

import cron from "node-cron";
import KnowledgeSDK from "@knowledgesdk/node";

const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });

// High-change content: re-crawl daily
const highChangeUrls = [
  "https://competitor.com/pricing",
  "https://competitor.com/changelog",
  "https://news-source.com/category/ai",
];

// Stable content: re-crawl weekly
const stableUrls = [
  "https://competitor.com/about",
  "https://competitor.com/team",
  "https://docs.library.com/api-reference",
];

cron.schedule("0 6 * * *", async () => {
  console.log("Re-extracting high-change URLs...");
  for (const url of highChangeUrls) {
    try {
      await ks.extract({ url });
      console.log(`Re-indexed: ${url}`);
    } catch (err) {
      console.error(`Failed to re-index ${url}:`, err);
    }
  }
});

cron.schedule("0 6 * * 1", async () => {
  console.log("Re-extracting stable URLs...");
  for (const url of stableUrls) {
    try {
      await ks.extract({ url });
      console.log(`Re-indexed: ${url}`);
    } catch (err) {
      console.error(`Failed to re-index ${url}:`, err);
    }
  }
});

When POST /v1/extract is called for a URL that's already indexed, it overwrites the previous version. Search will return the updated content immediately after re-extraction.

Change-detection webhooks

Scheduled re-crawling is simple but inefficient — you re-extract content that hasn't changed, and you might miss changes that happen between scheduled runs.

A more precise approach: monitor pages for changes using a webhook or change-detection service. When a change is detected, trigger re-extraction immediately.

// Webhook handler — called when a monitored page changes
app.post("/webhooks/page-changed", async (req, res) => {
  const { url, changeType, detectedAt } = req.body;

  console.log(`Change detected on ${url} at ${detectedAt}: ${changeType}`);

  // Re-extract the changed URL
  try {
    const result = await ks.extract({ url });
    console.log(`Re-indexed after change: ${url} → ${result.title}`);
    res.status(200).json({ success: true });
  } catch (err) {
    console.error(`Re-extraction failed for ${url}:`, err);
    res.status(500).json({ error: "Re-extraction failed" });
  }
});

This approach gives you near-real-time freshness without the overhead of re-crawling unchanged content.

Production Pattern: Freshness Tiers

In practice, a tiered freshness strategy works well:

Tier	Content Type	Re-crawl Frequency	Example Pages
Critical	Pricing, availability, current events	Every 6-24 hours	`/pricing`, `/status`, news feeds
Standard	Feature pages, docs, blog posts	Every 3-7 days	`/features`, `/docs/`, `/blog/`
Archive	About, team, evergreen content	Monthly	`/about`, `/team`, old blog posts

Assign each URL to a tier when you first index it, and run separate cron schedules per tier. This balances freshness against API call volume.

Communicating Freshness to the LLM

Beyond retrieval logic, tell the LLM when content was last crawled so it can reason about staleness:

const webContext = results.results
  .map((r) => {
    const age = Math.round(
      (Date.now() - new Date(r.crawledAt).getTime()) / (1000 * 60 * 60 * 24)
    );
    return `## ${r.title}\nSource: ${r.url} (last crawled ${age} days ago)\n\n${r.content}`;
  })
  .join("\n\n---\n\n");

Now when the model sees a chunk that was crawled 45 days ago, it can appropriately hedge: "As of 45 days ago, the pricing was..." rather than presenting potentially outdated information as current fact.

The Metadata Strategy

Structure your indexed content with rich temporal metadata from the start:

crawled_at — timestamp of last extraction (required)
published_at — when the page was originally published, if extractable
source_domain — the root domain, for domain-level freshness rules
topic — content category, for topic-specific decay rates
change_frequency — your classification of how often this type of content changes

This metadata becomes the foundation for all your temporal retrieval logic. Add it at indexing time; retrofitting it later is painful.

Freshness Is a Product Feature

Users of AI agents implicitly trust that the information they receive is current. When it isn't, they lose trust — not just in that one wrong answer, but in the system overall.

Temporal RAG reframes freshness from an afterthought to a first-class system property. By tracking when knowledge was last seen, applying time-aware scoring, and maintaining active re-extraction pipelines, you build an agent that doesn't just retrieve relevant information — it retrieves information you can trust to be accurate right now.

The web changes constantly. Your knowledge base should too.

Try it now