Retrieval-Augmented Generation promised to solve the knowledge cutoff problem. Instead of relying on a model's training data, you retrieve relevant documents at query time and include them in the prompt. The model answers from current information, not stale training data.
But here is the problem most teams discover six months into production: their RAG pipeline is also frozen.
The documents they ingested when they launched are still the documents they're retrieving today. The pricing page they scraped in October now reflects old tiers. The documentation they indexed in Q1 doesn't have the new API endpoints. The competitor analysis articles are six months behind reality.
RAG solved the training cutoff problem by shifting it earlier in the pipeline. Most teams just moved the staleness problem from the model to the vector database.
This guide explains the staleness problem clearly, proposes a two-layer architecture that actually solves it, and shows production-ready code for hybrid retrieval using KnowledgeSDK as the live web layer.
The Staleness Taxonomy
Not all knowledge goes stale at the same rate. Understanding the taxonomy helps you decide which data needs live retrieval and which is fine in a vector database.
Evergreen content — content that rarely changes and whose staleness has low business impact. Mathematical explanations, historical facts, foundational concepts. Suitable for training data and static vector DBs.
Slow-drift content — changes occasionally but predictably. API documentation, product feature descriptions, company positioning. Should be re-indexed regularly (weekly or monthly).
Fast-drift content — changes frequently and whose staleness has high business impact. Pricing, availability, news, market data, job postings. Needs live retrieval or very frequent re-indexing (daily or near-real-time).
Volatile content — changes constantly. Stock prices, live scores, real-time availability. Should never be cached — always retrieved live.
Most RAG systems treat everything as evergreen. That's the mistake.
The Two-Layer Architecture
The solution is a pipeline with two complementary retrieval layers:
USER QUERY
│
▼
┌─────────────────┐
│ Query Router │
│ (classify query)│
└────────┬────────┘
│
┌───────────┴───────────┐
▼ ▼
┌────────────────┐ ┌──────────────────────┐
│ LAYER 1 │ │ LAYER 2 │
│ Vector DB │ │ KnowledgeSDK │
│ (Long-term │ │ (Live web layer) │
│ memory) │ │ │
│ │ │ POST /v1/scrape │
│ Stable docs │ │ POST /v1/search │
│ Ingested once │ │ POST /v1/extract │
└───────┬────────┘ └────────┬─────────────┘
│ │
└───────────┬───────────┘
▼
┌────────────────────┐
│ Context Merger │
│ + Deduplication │
└──────────┬─────────┘
▼
┌────────────────────┐
│ LLM Synthesis │
│ with citations │
└──────────┬─────────┘
▼
ANSWER
Layer 1 — The Vector Database (long-term memory)
Your existing vector database stores documents that have been carefully cleaned, chunked, and embedded. These are your stable, high-quality sources: internal documentation, processed reports, curated articles. Retrieval is fast (sub-10ms) and predictable.
Layer 2 — KnowledgeSDK (live web layer)
For queries that touch fast-drift or volatile content, KnowledgeSDK retrieves the current version of relevant web pages. Results are returned in clean markdown, ready for embedding or direct inclusion in the prompt. KnowledgeSDK's own search index handles sub-100ms semantic search across previously scraped content.
Implementing the Two-Layer Pipeline
// src/hybridRag.ts
import KnowledgeSDK from "@knowledgesdk/node";
import { OpenAI } from "openai";
import { Pool } from "pg";
import "dotenv/config";
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const db = new Pool({ connectionString: process.env.DATABASE_URL });
// Query classification: does this need live web data?
async function classifyQuery(query: string): Promise<{
needsLiveData: boolean;
queryType: "stable" | "slow-drift" | "fast-drift";
reason: string;
}> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
response_format: { type: "json_object" },
messages: [
{
role: "system",
content: `Classify whether a query needs live web data or can be answered from a static knowledge base.
Return JSON: {
"needsLiveData": boolean,
"queryType": "stable" | "slow-drift" | "fast-drift",
"reason": "brief explanation"
}
fast-drift queries (needsLiveData: true): pricing, availability, current news, recent announcements, "latest" anything
slow-drift queries (needsLiveData: false, usually): API docs, feature descriptions, company info
stable queries (needsLiveData: false): concepts, history, how-to explanations`,
},
{ role: "user", content: query },
],
});
return JSON.parse(response.choices[0].message.content!);
}
// Layer 1: Vector DB retrieval (your existing RAG)
async function retrieveFromVectorDb(
query: string,
limit: number = 5
): Promise<Array<{ content: string; source: string; score: number }>> {
// Get query embedding
const embeddingResponse = await openai.embeddings.create({
model: "text-embedding-3-small",
input: query,
});
const queryEmbedding = embeddingResponse.data[0].embedding;
// Vector similarity search (using pgvector)
const { rows } = await db.query(
`SELECT content, source_url AS source, title,
1 - (embedding <=> $1::vector) AS score
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[JSON.stringify(queryEmbedding), limit]
);
return rows;
}
// Layer 2: KnowledgeSDK live retrieval
async function retrieveFromLiveWeb(
query: string,
targetUrls?: string[]
): Promise<Array<{ content: string; source: string; score: number }>> {
// If we have specific URLs to check, scrape them directly
if (targetUrls && targetUrls.length > 0) {
const scrapeResults = await Promise.allSettled(
targetUrls.map((url) => ks.scrape({ url }))
);
return scrapeResults
.filter((r): r is PromiseFulfilledResult<any> => r.status === "fulfilled")
.map((r, i) => ({
content: r.value.markdown.slice(0, 2000),
source: targetUrls[i],
score: 0.9, // Direct scrape is highly relevant
}));
}
// Otherwise, search across previously indexed content
const searchResult = await ks.search({
query,
limit: 5,
hybrid: true,
});
return searchResult.hits.map((hit) => ({
content: hit.content,
source: hit.url,
score: hit.score,
}));
}
// Context merger: combine and deduplicate results from both layers
function mergeResults(
vectorResults: Array<{ content: string; source: string; score: number }>,
liveResults: Array<{ content: string; source: string; score: number }>,
maxTokens: number = 6000
): Array<{ content: string; source: string; score: number; layer: string }> {
// Tag results by layer
const tagged = [
...vectorResults.map((r) => ({ ...r, layer: "vector-db" })),
...liveResults.map((r) => ({ ...r, layer: "live-web" })),
];
// Deduplicate by source URL (prefer live web for same URL)
const deduped = new Map<
string,
{ content: string; source: string; score: number; layer: string }
>();
for (const result of tagged) {
const existing = deduped.get(result.source);
if (!existing || result.layer === "live-web") {
deduped.set(result.source, result);
}
}
// Sort by score and fit within token budget
let totalChars = 0;
const MAX_CHARS = maxTokens * 4; // Rough chars-to-tokens ratio
const results = [];
for (const result of [...deduped.values()].sort((a, b) => b.score - a.score)) {
if (totalChars + result.content.length > MAX_CHARS) break;
results.push(result);
totalChars += result.content.length;
}
return results;
}
// Main hybrid retrieval function
export async function hybridRetrieve(
query: string,
options: {
targetUrls?: string[];
skipClassification?: boolean;
forceLayer?: "vector" | "live" | "both";
} = {}
): Promise<{
results: Array<{ content: string; source: string; score: number; layer: string }>;
queryClassification: { needsLiveData: boolean; queryType: string };
}> {
let classification = { needsLiveData: false, queryType: "stable", reason: "" };
if (!options.skipClassification && !options.forceLayer) {
classification = await classifyQuery(query);
}
const useVectorDb =
options.forceLayer === "vector" ||
options.forceLayer === "both" ||
(!options.forceLayer && true); // Always use vector DB
const useLiveWeb =
options.forceLayer === "live" ||
options.forceLayer === "both" ||
(!options.forceLayer && classification.needsLiveData) ||
(options.targetUrls && options.targetUrls.length > 0);
const [vectorResults, liveResults] = await Promise.all([
useVectorDb ? retrieveFromVectorDb(query) : Promise.resolve([]),
useLiveWeb ? retrieveFromLiveWeb(query, options.targetUrls) : Promise.resolve([]),
]);
const merged = mergeResults(vectorResults, liveResults);
return {
results: merged,
queryClassification: classification,
};
}
// Full RAG pipeline
export async function ragAnswer(
query: string,
options: { targetUrls?: string[] } = {}
): Promise<{
answer: string;
sources: Array<{ url: string; layer: string }>;
usedLiveData: boolean;
}> {
const { results, queryClassification } = await hybridRetrieve(query, options);
if (results.length === 0) {
return {
answer: "I don't have enough information to answer this question.",
sources: [],
usedLiveData: false,
};
}
const context = results
.map(
(r, i) =>
`[${i + 1}] (${r.layer === "live-web" ? "Live Web" : "Knowledge Base"}) Source: ${r.source}\n\n${r.content}`
)
.join("\n\n---\n\n");
const usedLiveData = results.some((r) => r.layer === "live-web");
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `You are a helpful assistant. Answer questions using the provided context.
Use inline citations [1], [2], etc. to reference sources.
${usedLiveData ? "Some sources are from live web scraping and reflect current information." : ""}
If you cannot answer from the provided context, say so clearly.`,
},
{
role: "user",
content: `Context:\n\n${context}\n\nQuestion: ${query}`,
},
],
});
return {
answer: response.choices[0].message.content!,
sources: results.map((r) => ({ url: r.source, layer: r.layer })),
usedLiveData,
};
}
Python Implementation
# hybrid_rag.py
import os
import asyncio
from openai import AsyncOpenAI
from knowledgesdk import AsyncKnowledgeSDK
import asyncpg
import json
ks = AsyncKnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def classify_query(query: str) -> dict:
response = await openai.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": """Classify whether a query needs live web data.
Return JSON: {"needsLiveData": bool, "queryType": "stable|slow-drift|fast-drift"}
fast-drift: pricing, news, availability, "latest" anything → needsLiveData: true"""
},
{"role": "user", "content": query}
]
)
return json.loads(response.choices[0].message.content)
async def retrieve_from_vector_db(
query: str,
conn: asyncpg.Connection,
limit: int = 5
) -> list[dict]:
# Get embedding
resp = await openai.embeddings.create(
model="text-embedding-3-small",
input=query
)
embedding = resp.data[0].embedding
rows = await conn.fetch(
"""SELECT content, source_url AS source,
1 - (embedding <=> $1::vector) AS score
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT $2""",
json.dumps(embedding), limit
)
return [{"content": r["content"], "source": r["source"],
"score": float(r["score"]), "layer": "vector-db"} for r in rows]
async def retrieve_from_live_web(
query: str, target_urls: list[str] | None = None
) -> list[dict]:
if target_urls:
tasks = [ks.scrape(url) for url in target_urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [
{"content": r.markdown[:2000], "source": target_urls[i],
"score": 0.9, "layer": "live-web"}
for i, r in enumerate(results)
if not isinstance(r, Exception)
]
search_result = await ks.search(query=query, limit=5, hybrid=True)
return [
{"content": hit.content, "source": hit.url,
"score": hit.score, "layer": "live-web"}
for hit in search_result.hits
]
async def hybrid_rag(
query: str,
db_conn: asyncpg.Connection,
target_urls: list[str] | None = None
) -> dict:
classification = await classify_query(query)
needs_live = classification.get("needsLiveData", False) or bool(target_urls)
vector_task = retrieve_from_vector_db(query, db_conn)
live_task = (retrieve_from_live_web(query, target_urls)
if needs_live else asyncio.sleep(0, result=[]))
vector_results, live_results = await asyncio.gather(vector_task, live_task)
# Merge: deduplicate, prefer live-web for same URL
seen = {}
for r in vector_results + live_results:
if r["source"] not in seen or r["layer"] == "live-web":
seen[r["source"]] = r
merged = sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:8]
# Build context
context = "\n\n---\n\n".join(
f"[{i+1}] ({r['layer']}) {r['source']}\n\n{r['content']}"
for i, r in enumerate(merged)
)
used_live = any(r["layer"] == "live-web" for r in merged)
response = await openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using provided context with citations [1], [2], etc."},
{"role": "user", "content": f"Context:\n\n{context}\n\nQuestion: {query}"}
]
)
return {
"answer": response.choices[0].message.content,
"sources": [{"url": r["source"], "layer": r["layer"]} for r in merged],
"used_live_data": used_live,
"query_type": classification.get("queryType")
}
When to Trigger Live Retrieval
The query classifier handles automatic routing, but you can also build explicit rules:
// Rule-based routing (faster than LLM classification for known patterns)
function shouldUseLiveData(query: string): boolean {
const liveDataKeywords = [
"current", "latest", "now", "today", "price", "pricing",
"cost", "how much", "available", "status", "version",
"recent", "new", "updated", "2025", "2026",
];
const queryLower = query.toLowerCase();
return liveDataKeywords.some((kw) => queryLower.includes(kw));
}
This rule-based check is cheaper than an LLM call for high-traffic applications. Use it as a fast path before the LLM classifier.
Keeping the Live Layer Warm
Cold scrapes add 2-5 seconds of latency. You can warm the live layer by pre-scraping high-traffic URLs during off-peak hours:
// Pre-scrape the URLs your users ask about most often
const HIGH_TRAFFIC_URLS = [
"https://yourapp.com/pricing",
"https://yourapp.com/changelog",
"https://docs.yourapp.com/api",
];
async function warmLiveLayer(): Promise<void> {
for (const url of HIGH_TRAFFIC_URLS) {
await ks.scrape({ url }); // Results are cached by KnowledgeSDK
await new Promise((r) => setTimeout(r, 500));
}
console.log("Live layer warmed");
}
// Run daily at 5 AM before peak traffic
cron.schedule("0 5 * * *", warmLiveLayer);
KnowledgeSDK caches scrape results. Pre-warming ensures the cache is hot during peak hours, reducing latency to near-zero for frequently accessed URLs.
Measuring the Impact
Track a freshness score across your retrieved results:
function calculateFreshnessScore(
results: Array<{ layer: string; source: string }>
): {
score: number;
breakdown: { vectorDb: number; liveWeb: number };
} {
const total = results.length;
const liveCount = results.filter((r) => r.layer === "live-web").length;
const vectorCount = total - liveCount;
return {
score: total > 0 ? liveCount / total : 0,
breakdown: { vectorDb: vectorCount, liveWeb: liveCount },
};
}
Log this per query. If you see a drop in the live web fraction for fast-drift queries, your URL corpus is missing coverage — add more URLs to monitor.
Comparison: Static RAG vs. Two-Layer Hybrid RAG
| Dimension | Static RAG Only | Two-Layer Hybrid |
|---|---|---|
| Long-term memory | Excellent | Excellent |
| Pricing accuracy | Drifts | Always current |
| News and announcements | Stale | Live |
| Latency (stable queries) | Fast (~50ms) | Fast (~50ms, hits vector DB) |
| Latency (fast-drift queries) | Fast but wrong | Slightly slower but correct |
| Infrastructure complexity | Medium | Medium + KnowledgeSDK |
| Knowledge cutoff problem | Shifted to ingestion time | Solved for live queries |
| Hallucination on current events | High risk | Low risk |
FAQ
Does the two-layer architecture increase latency significantly? For stable queries that only hit the vector database, there is zero added latency. For fast-drift queries that also trigger KnowledgeSDK, expect 100-500ms of additional latency for cached content and 2-5s for cold scrapes. This is usually acceptable given that users are asking about current pricing, availability, or news — where accuracy matters more than a 2-second wait.
How do I decide which documents go in the vector DB vs. which are retrieved live? Use the staleness taxonomy: stable content (concepts, historical information, thoroughly reviewed documentation) goes in the vector DB. Fast-drift content (pricing, changelogs, current news) should be retrieved live or from a frequently refreshed KnowledgeSDK index.
Can I skip the vector DB entirely and just use KnowledgeSDK? Yes. If all your knowledge comes from web URLs and you're comfortable with the latency of live retrieval, you can use only KnowledgeSDK's search endpoint. The two-layer architecture is for teams that have a mix of internal documents (better suited for a private vector DB) and web content.
How does KnowledgeSDK handle duplicate content from the same site? KnowledgeSDK deduplicates scraped content at the URL level. If you scrape the same URL twice within the cache window, the second call returns cached results. Different pages on the same domain are stored and searched independently.
What's the best embedding model to use alongside KnowledgeSDK?
For the vector DB layer, text-embedding-3-small is fast and cost-effective. For maximum recall on technical queries, text-embedding-3-large improves results by about 5-10%. KnowledgeSDK's own search uses hybrid retrieval internally — you don't control the embedding model it uses, but it's tuned for web content.
Is there a way to preview what the two-layer retrieval would return before building the full pipeline? Yes. Use KnowledgeSDK's search endpoint directly from the command line to test queries against your indexed content before integrating it into your application.
Stop serving stale answers. Add a live web layer to your RAG pipeline today at knowledgesdk.com/setup.