Temporal RAG: Building Systems That Know When Knowledge Goes Stale
RAG pipelines are built to retrieve relevant information. Most of them are not built to track how old that information is.
This gap causes real problems. An agent recommends a product that was discontinued six months ago. It quotes pricing that changed last quarter. It cites documentation from a library version that's two major releases behind. From the user's perspective, the AI confidently said something wrong — and that's worse than admitting uncertainty.
The solution is temporal RAG: retrieval systems that understand time, track when content was last seen, and actively manage knowledge freshness.
The Freshness Problem in Practice
When a RAG pipeline is first set up, it indexes content and the world is in sync. Then time passes.
Here's what stale knowledge looks like in production:
- A competitive intelligence agent quotes a competitor's pricing from 8 months ago, before they changed plans
- A developer assistant recommends a deprecated API endpoint that was removed in a library update
- A support bot references a help article that was updated after a product change, giving users incorrect instructions
- A market research agent cites funding data for a startup that has since been acquired
In each case, the content was correct when indexed. The index just wasn't updated when reality changed. The RAG system had no way to know that its knowledge had drifted from truth.
The Three Temporal Dimensions
Not all content ages the same way. To build a temporally-aware system, you need to track three distinct dimensions:
1. Document creation date (published_at)
When was this content originally created? A pricing page published today is inherently more trustworthy than one created three years ago and never updated. News articles from 2022 about a company's strategy may be completely outdated. Published dates give you a baseline for how fresh the original content was.
2. Last crawled date (crawled_at)
When did your system last retrieve this content from the source? This is the date your index actually reflects. Even if a page was published years ago, if you crawled it yesterday, you have a recent snapshot. This is the most operationally important timestamp.
3. Content change frequency How often does this type of content change? A pricing page for an active SaaS product changes frequently — new plans get added, prices adjust, features shift. A company "about" page might not change for years. Understanding change frequency per content type lets you prioritize re-crawl efforts.
Strategies for Temporal Awareness
Store timestamps on every indexed chunk
Every piece of content in your index should have temporal metadata attached:
interface KnowledgeChunk {
id: string;
content: string;
url: string;
title: string;
publishedAt?: string; // when the page was originally published
crawledAt: string; // when your system last fetched this
domain: string;
topic: string;
}
When KnowledgeSDK indexes a URL via POST /v1/extract, it stores crawled_at automatically. You can query for recency at search time.
Time-decay scoring
Give fresher content a scoring advantage. A result that scores 0.82 on relevance but was crawled 6 months ago should rank below a result that scores 0.78 but was crawled yesterday, for most use cases.
A simple time-decay function:
function applyTimeDecay(
score: number,
crawledAt: string,
halfLifeDays: number = 30
): number {
const ageMs = Date.now() - new Date(crawledAt).getTime();
const ageDays = ageMs / (1000 * 60 * 60 * 24);
const decayFactor = Math.pow(0.5, ageDays / halfLifeDays);
return score * decayFactor;
}
A 30-day half-life means content crawled 30 days ago has half the effective score of equivalent freshly-crawled content. Tune halfLifeDays per content type — news might use 3 days, documentation might use 90.
Filter by recency at search time
For use cases where freshness is critical, filter results to only return chunks crawled within a time window:
import KnowledgeSDK from "@knowledgesdk/node";
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
async function searchFreshKnowledge(query: string, maxAgeDays: number = 7) {
const results = await ks.search({ query, limit: 10 });
const cutoff = new Date();
cutoff.setDate(cutoff.getDate() - maxAgeDays);
const freshResults = results.results.filter((r) => {
if (!r.crawledAt) return true; // include if no timestamp
return new Date(r.crawledAt) >= cutoff;
});
if (freshResults.length === 0) {
// Trigger re-extraction and return a freshness warning
return {
results: [],
warning: `No results crawled within the last ${maxAgeDays} days. Consider re-indexing.`,
};
}
return { results: freshResults };
}
This gives users (and your agent) a clear signal when the knowledge base needs refreshing.
TTL-based scheduled re-extraction
The simplest freshness strategy: re-crawl everything on a schedule. The schedule should be calibrated to content change frequency:
import cron from "node-cron";
import KnowledgeSDK from "@knowledgesdk/node";
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
// High-change content: re-crawl daily
const highChangeUrls = [
"https://competitor.com/pricing",
"https://competitor.com/changelog",
"https://news-source.com/category/ai",
];
// Stable content: re-crawl weekly
const stableUrls = [
"https://competitor.com/about",
"https://competitor.com/team",
"https://docs.library.com/api-reference",
];
cron.schedule("0 6 * * *", async () => {
console.log("Re-extracting high-change URLs...");
for (const url of highChangeUrls) {
try {
await ks.extract({ url });
console.log(`Re-indexed: ${url}`);
} catch (err) {
console.error(`Failed to re-index ${url}:`, err);
}
}
});
cron.schedule("0 6 * * 1", async () => {
console.log("Re-extracting stable URLs...");
for (const url of stableUrls) {
try {
await ks.extract({ url });
console.log(`Re-indexed: ${url}`);
} catch (err) {
console.error(`Failed to re-index ${url}:`, err);
}
}
});
When POST /v1/extract is called for a URL that's already indexed, it overwrites the previous version. Search will return the updated content immediately after re-extraction.
Change-detection webhooks
Scheduled re-crawling is simple but inefficient — you re-extract content that hasn't changed, and you might miss changes that happen between scheduled runs.
A more precise approach: monitor pages for changes using a webhook or change-detection service. When a change is detected, trigger re-extraction immediately.
// Webhook handler — called when a monitored page changes
app.post("/webhooks/page-changed", async (req, res) => {
const { url, changeType, detectedAt } = req.body;
console.log(`Change detected on ${url} at ${detectedAt}: ${changeType}`);
// Re-extract the changed URL
try {
const result = await ks.extract({ url });
console.log(`Re-indexed after change: ${url} → ${result.title}`);
res.status(200).json({ success: true });
} catch (err) {
console.error(`Re-extraction failed for ${url}:`, err);
res.status(500).json({ error: "Re-extraction failed" });
}
});
This approach gives you near-real-time freshness without the overhead of re-crawling unchanged content.
Production Pattern: Freshness Tiers
In practice, a tiered freshness strategy works well:
| Tier | Content Type | Re-crawl Frequency | Example Pages |
|---|---|---|---|
| Critical | Pricing, availability, current events | Every 6-24 hours | /pricing, /status, news feeds |
| Standard | Feature pages, docs, blog posts | Every 3-7 days | /features, /docs/*, /blog/* |
| Archive | About, team, evergreen content | Monthly | /about, /team, old blog posts |
Assign each URL to a tier when you first index it, and run separate cron schedules per tier. This balances freshness against API call volume.
Communicating Freshness to the LLM
Beyond retrieval logic, tell the LLM when content was last crawled so it can reason about staleness:
const webContext = results.results
.map((r) => {
const age = Math.round(
(Date.now() - new Date(r.crawledAt).getTime()) / (1000 * 60 * 60 * 24)
);
return `## ${r.title}\nSource: ${r.url} (last crawled ${age} days ago)\n\n${r.content}`;
})
.join("\n\n---\n\n");
Now when the model sees a chunk that was crawled 45 days ago, it can appropriately hedge: "As of 45 days ago, the pricing was..." rather than presenting potentially outdated information as current fact.
The Metadata Strategy
Structure your indexed content with rich temporal metadata from the start:
crawled_at— timestamp of last extraction (required)published_at— when the page was originally published, if extractablesource_domain— the root domain, for domain-level freshness rulestopic— content category, for topic-specific decay rateschange_frequency— your classification of how often this type of content changes
This metadata becomes the foundation for all your temporal retrieval logic. Add it at indexing time; retrofitting it later is painful.
Freshness Is a Product Feature
Users of AI agents implicitly trust that the information they receive is current. When it isn't, they lose trust — not just in that one wrong answer, but in the system overall.
Temporal RAG reframes freshness from an afterthought to a first-class system property. By tracking when knowledge was last seen, applying time-aware scoring, and maintaining active re-extraction pipelines, you build an agent that doesn't just retrieve relevant information — it retrieves information you can trust to be accurate right now.
The web changes constantly. Your knowledge base should too.