Why Your RAG Pipeline Needs Fresh Web Data (And How to Get It)

Most RAG systems are frozen at ingestion time. Learn how to add a live web layer to your pipeline for hybrid retrieval that combines long-term memory with real-time data.

Retrieval-Augmented Generation promised to solve the knowledge cutoff problem. Instead of relying on a model's training data, you retrieve relevant documents at query time and include them in the prompt. The model answers from current information, not stale training data.

But here is the problem most teams discover six months into production: their RAG pipeline is also frozen.

The documents they ingested when they launched are still the documents they're retrieving today. The pricing page they scraped in October now reflects old tiers. The documentation they indexed in Q1 doesn't have the new API endpoints. The competitor analysis articles are six months behind reality.

RAG solved the training cutoff problem by shifting it earlier in the pipeline. Most teams just moved the staleness problem from the model to the vector database.

This guide explains the staleness problem clearly, proposes a two-layer architecture that actually solves it, and shows production-ready code for hybrid retrieval using KnowledgeSDK as the live web layer.

The Staleness Taxonomy

Not all knowledge goes stale at the same rate. Understanding the taxonomy helps you decide which data needs live retrieval and which is fine in a vector database.

Evergreen content — content that rarely changes and whose staleness has low business impact. Mathematical explanations, historical facts, foundational concepts. Suitable for training data and static vector DBs.

Slow-drift content — changes occasionally but predictably. API documentation, product feature descriptions, company positioning. Should be re-indexed regularly (weekly or monthly).

Fast-drift content — changes frequently and whose staleness has high business impact. Pricing, availability, news, market data, job postings. Needs live retrieval or very frequent re-indexing (daily or near-real-time).

Volatile content — changes constantly. Stock prices, live scores, real-time availability. Should never be cached — always retrieved live.

Most RAG systems treat everything as evergreen. That's the mistake.

The Two-Layer Architecture

The solution is a pipeline with two complementary retrieval layers:

                    USER QUERY
                        │
                        ▼
              ┌─────────────────┐
              │  Query Router   │
              │ (classify query)│
              └────────┬────────┘
                       │
           ┌───────────┴───────────┐
           ▼                       ▼
  ┌────────────────┐     ┌──────────────────────┐
  │  LAYER 1       │     │  LAYER 2             │
  │  Vector DB     │     │  KnowledgeSDK        │
  │  (Long-term    │     │  (Live web layer)    │
  │   memory)      │     │                      │
  │                │     │  POST /v1/extract     │
  │  Stable docs   │     │  POST /v1/search     │
  │  Ingested once │     │  POST /v1/extract    │
  └───────┬────────┘     └────────┬─────────────┘
          │                       │
          └───────────┬───────────┘
                      ▼
           ┌────────────────────┐
           │   Context Merger   │
           │   + Deduplication  │
           └──────────┬─────────┘
                      ▼
           ┌────────────────────┐
           │   LLM Synthesis    │
           │   with citations   │
           └──────────┬─────────┘
                      ▼
                   ANSWER

Layer 1 — The Vector Database (long-term memory)

Your existing vector database stores documents that have been carefully cleaned, chunked, and embedded. These are your stable, high-quality sources: internal documentation, processed reports, curated articles. Retrieval is fast (sub-10ms) and predictable.

Layer 2 — KnowledgeSDK (live web layer)

For queries that touch fast-drift or volatile content, KnowledgeSDK retrieves the current version of relevant web pages. Results are returned in clean markdown, ready for embedding or direct inclusion in the prompt. KnowledgeSDK's own search index handles sub-300ms semantic search across previously scraped content.

Implementing the Two-Layer Pipeline

// src/hybridRag.ts
import KnowledgeSDK from "@knowledgesdk/node";
import { OpenAI } from "openai";
import { Pool } from "pg";
import "dotenv/config";

const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const db = new Pool({ connectionString: process.env.DATABASE_URL });

// Query classification: does this need live web data?
async function classifyQuery(query: string): Promise<{
  needsLiveData: boolean;
  queryType: "stable" | "slow-drift" | "fast-drift";
  reason: string;
}> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    response_format: { type: "json_object" },
    messages: [
      {
        role: "system",
        content: `Classify whether a query needs live web data or can be answered from a static knowledge base.

Return JSON: {
  "needsLiveData": boolean,
  "queryType": "stable" | "slow-drift" | "fast-drift",
  "reason": "brief explanation"
}

fast-drift queries (needsLiveData: true): pricing, availability, current news, recent announcements, "latest" anything
slow-drift queries (needsLiveData: false, usually): API docs, feature descriptions, company info
stable queries (needsLiveData: false): concepts, history, how-to explanations`,
      },
      { role: "user", content: query },
    ],
  });

  return JSON.parse(response.choices[0].message.content!);
}

// Layer 1: Vector DB retrieval (your existing RAG)
async function retrieveFromVectorDb(
  query: string,
  limit: number = 5
): Promise<Array<{ content: string; source: string; score: number }>> {
  // Get query embedding
  const embeddingResponse = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  const queryEmbedding = embeddingResponse.data[0].embedding;

  // Vector similarity search (using pgvector)
  const { rows } = await db.query(
    `SELECT content, source_url AS source, title,
            1 - (embedding <=> $1::vector) AS score
     FROM documents
     ORDER BY embedding <=> $1::vector
     LIMIT $2`,
    [JSON.stringify(queryEmbedding), limit]
  );

  return rows;
}

// Layer 2: KnowledgeSDK live retrieval
async function retrieveFromLiveWeb(
  query: string,
  targetUrls?: string[]
): Promise<Array<{ content: string; source: string; score: number }>> {
  // If we have specific URLs to check, scrape them directly
  if (targetUrls && targetUrls.length > 0) {
    const scrapeResults = await Promise.allSettled(
      targetUrls.map((url) => ks.extract({ url }))
    );

    return scrapeResults
      .filter((r): r is PromiseFulfilledResult<any> => r.status === "fulfilled")
      .map((r, i) => ({
        content: r.value.markdown.slice(0, 2000),
        source: targetUrls[i],
        score: 0.9, // Direct scrape is highly relevant
      }));
  }

  // Otherwise, search across previously indexed content
  const searchResult = await ks.search({
    query,
    limit: 5,
    hybrid: true,
  });

  return searchResult.hits.map((hit) => ({
    content: hit.content,
    source: hit.url,
    score: hit.score,
  }));
}

// Context merger: combine and deduplicate results from both layers
function mergeResults(
  vectorResults: Array<{ content: string; source: string; score: number }>,
  liveResults: Array<{ content: string; source: string; score: number }>,
  maxTokens: number = 6000
): Array<{ content: string; source: string; score: number; layer: string }> {
  // Tag results by layer
  const tagged = [
    ...vectorResults.map((r) => ({ ...r, layer: "vector-db" })),
    ...liveResults.map((r) => ({ ...r, layer: "live-web" })),
  ];

  // Deduplicate by source URL (prefer live web for same URL)
  const deduped = new Map<
    string,
    { content: string; source: string; score: number; layer: string }
  >();

  for (const result of tagged) {
    const existing = deduped.get(result.source);
    if (!existing || result.layer === "live-web") {
      deduped.set(result.source, result);
    }
  }

  // Sort by score and fit within token budget
  let totalChars = 0;
  const MAX_CHARS = maxTokens * 4; // Rough chars-to-tokens ratio
  const results = [];

  for (const result of [...deduped.values()].sort((a, b) => b.score - a.score)) {
    if (totalChars + result.content.length > MAX_CHARS) break;
    results.push(result);
    totalChars += result.content.length;
  }

  return results;
}

// Main hybrid retrieval function
export async function hybridRetrieve(
  query: string,
  options: {
    targetUrls?: string[];
    skipClassification?: boolean;
    forceLayer?: "vector" | "live" | "both";
  } = {}
): Promise<{
  results: Array<{ content: string; source: string; score: number; layer: string }>;
  queryClassification: { needsLiveData: boolean; queryType: string };
}> {
  let classification = { needsLiveData: false, queryType: "stable", reason: "" };

  if (!options.skipClassification && !options.forceLayer) {
    classification = await classifyQuery(query);
  }

  const useVectorDb =
    options.forceLayer === "vector" ||
    options.forceLayer === "both" ||
    (!options.forceLayer && true); // Always use vector DB

  const useLiveWeb =
    options.forceLayer === "live" ||
    options.forceLayer === "both" ||
    (!options.forceLayer && classification.needsLiveData) ||
    (options.targetUrls && options.targetUrls.length > 0);

  const [vectorResults, liveResults] = await Promise.all([
    useVectorDb ? retrieveFromVectorDb(query) : Promise.resolve([]),
    useLiveWeb ? retrieveFromLiveWeb(query, options.targetUrls) : Promise.resolve([]),
  ]);

  const merged = mergeResults(vectorResults, liveResults);

  return {
    results: merged,
    queryClassification: classification,
  };
}

// Full RAG pipeline
export async function ragAnswer(
  query: string,
  options: { targetUrls?: string[] } = {}
): Promise<{
  answer: string;
  sources: Array<{ url: string; layer: string }>;
  usedLiveData: boolean;
}> {
  const { results, queryClassification } = await hybridRetrieve(query, options);

  if (results.length === 0) {
    return {
      answer: "I don't have enough information to answer this question.",
      sources: [],
      usedLiveData: false,
    };
  }

  const context = results
    .map(
      (r, i) =>
        `[${i + 1}] (${r.layer === "live-web" ? "Live Web" : "Knowledge Base"}) Source: ${r.source}\n\n${r.content}`
    )
    .join("\n\n---\n\n");

  const usedLiveData = results.some((r) => r.layer === "live-web");

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `You are a helpful assistant. Answer questions using the provided context.
Use inline citations [1], [2], etc. to reference sources.
${usedLiveData ? "Some sources are from live web scraping and reflect current information." : ""}
If you cannot answer from the provided context, say so clearly.`,
      },
      {
        role: "user",
        content: `Context:\n\n${context}\n\nQuestion: ${query}`,
      },
    ],
  });

  return {
    answer: response.choices[0].message.content!,
    sources: results.map((r) => ({ url: r.source, layer: r.layer })),
    usedLiveData,
  };
}

Python Implementation

# hybrid_rag.py
import os
import asyncio
from openai import AsyncOpenAI
from knowledgesdk import AsyncKnowledgeSDK
import asyncpg
import json

ks = AsyncKnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def classify_query(query: str) -> dict:
    response = await openai.chat.completions.create(
        model="gpt-4o-mini",
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": """Classify whether a query needs live web data.
Return JSON: {"needsLiveData": bool, "queryType": "stable|slow-drift|fast-drift"}
fast-drift: pricing, news, availability, "latest" anything → needsLiveData: true"""
            },
            {"role": "user", "content": query}
        ]
    )
    return json.loads(response.choices[0].message.content)

async def retrieve_from_vector_db(
    query: str,
    conn: asyncpg.Connection,
    limit: int = 5
) -> list[dict]:
    # Get embedding
    resp = await openai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    embedding = resp.data[0].embedding

    rows = await conn.fetch(
        """SELECT content, source_url AS source,
                  1 - (embedding <=> $1::vector) AS score
           FROM documents
           ORDER BY embedding <=> $1::vector
           LIMIT $2""",
        json.dumps(embedding), limit
    )
    return [{"content": r["content"], "source": r["source"],
             "score": float(r["score"]), "layer": "vector-db"} for r in rows]

async def retrieve_from_live_web(
    query: str, target_urls: list[str] | None = None
) -> list[dict]:
    if target_urls:
        tasks = [ks.extract(url) for url in target_urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [
            {"content": r.markdown[:2000], "source": target_urls[i],
             "score": 0.9, "layer": "live-web"}
            for i, r in enumerate(results)
            if not isinstance(r, Exception)
        ]

    search_result = await ks.search(query=query, limit=5, hybrid=True)
    return [
        {"content": hit.content, "source": hit.url,
         "score": hit.score, "layer": "live-web"}
        for hit in search_result.hits
    ]

async def hybrid_rag(
    query: str,
    db_conn: asyncpg.Connection,
    target_urls: list[str] | None = None
) -> dict:
    classification = await classify_query(query)
    needs_live = classification.get("needsLiveData", False) or bool(target_urls)

    vector_task = retrieve_from_vector_db(query, db_conn)
    live_task = (retrieve_from_live_web(query, target_urls)
                 if needs_live else asyncio.sleep(0, result=[]))

    vector_results, live_results = await asyncio.gather(vector_task, live_task)

    # Merge: deduplicate, prefer live-web for same URL
    seen = {}
    for r in vector_results + live_results:
        if r["source"] not in seen or r["layer"] == "live-web":
            seen[r["source"]] = r
    merged = sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:8]

    # Build context
    context = "\n\n---\n\n".join(
        f"[{i+1}] ({r['layer']}) {r['source']}\n\n{r['content']}"
        for i, r in enumerate(merged)
    )

    used_live = any(r["layer"] == "live-web" for r in merged)

    response = await openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using provided context with citations [1], [2], etc."},
            {"role": "user", "content": f"Context:\n\n{context}\n\nQuestion: {query}"}
        ]
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [{"url": r["source"], "layer": r["layer"]} for r in merged],
        "used_live_data": used_live,
        "query_type": classification.get("queryType")
    }

When to Trigger Live Retrieval

The query classifier handles automatic routing, but you can also build explicit rules:

// Rule-based routing (faster than LLM classification for known patterns)
function shouldUseLiveData(query: string): boolean {
  const liveDataKeywords = [
    "current", "latest", "now", "today", "price", "pricing",
    "cost", "how much", "available", "status", "version",
    "recent", "new", "updated", "2025", "2026",
  ];

  const queryLower = query.toLowerCase();
  return liveDataKeywords.some((kw) => queryLower.includes(kw));
}

This rule-based check is cheaper than an LLM call for high-traffic applications. Use it as a fast path before the LLM classifier.

Keeping the Live Layer Warm

Cold scrapes add 2-5 seconds of latency. You can warm the live layer by pre-scraping high-traffic URLs during off-peak hours:

// Pre-scrape the URLs your users ask about most often
const HIGH_TRAFFIC_URLS = [
  "https://yourapp.com/pricing",
  "https://yourapp.com/changelog",
  "https://docs.yourapp.com/api",
];

async function warmLiveLayer(): Promise<void> {
  for (const url of HIGH_TRAFFIC_URLS) {
    await ks.extract({ url }); // Results are cached by KnowledgeSDK
    await new Promise((r) => setTimeout(r, 500));
  }
  console.log("Live layer warmed");
}

// Run daily at 5 AM before peak traffic
cron.schedule("0 5 * * *", warmLiveLayer);

KnowledgeSDK caches scrape results. Pre-warming ensures the cache is hot during peak hours, reducing latency to near-zero for frequently accessed URLs.

Measuring the Impact

Track a freshness score across your retrieved results:

function calculateFreshnessScore(
  results: Array<{ layer: string; source: string }>
): {
  score: number;
  breakdown: { vectorDb: number; liveWeb: number };
} {
  const total = results.length;
  const liveCount = results.filter((r) => r.layer === "live-web").length;
  const vectorCount = total - liveCount;

  return {
    score: total > 0 ? liveCount / total : 0,
    breakdown: { vectorDb: vectorCount, liveWeb: liveCount },
  };
}

Log this per query. If you see a drop in the live web fraction for fast-drift queries, your URL corpus is missing coverage — add more URLs to monitor.

Comparison: Static RAG vs. Two-Layer Hybrid RAG

Dimension	Static RAG Only	Two-Layer Hybrid
Long-term memory	Excellent	Excellent
Pricing accuracy	Drifts	Always current
News and announcements	Stale	Live
Latency (stable queries)	Fast (~50ms)	Fast (~50ms, hits vector DB)
Latency (fast-drift queries)	Fast but wrong	Slightly slower but correct
Infrastructure complexity	Medium	Medium + KnowledgeSDK
Knowledge cutoff problem	Shifted to ingestion time	Solved for live queries
Hallucination on current events	High risk	Low risk

FAQ

Does the two-layer architecture increase latency significantly? For stable queries that only hit the vector database, there is zero added latency. For fast-drift queries that also trigger KnowledgeSDK, expect 100-500ms of additional latency for cached content and 2-5s for cold scrapes. This is usually acceptable given that users are asking about current pricing, availability, or news — where accuracy matters more than a 2-second wait.

How do I decide which documents go in the vector DB vs. which are retrieved live? Use the staleness taxonomy: stable content (concepts, historical information, thoroughly reviewed documentation) goes in the vector DB. Fast-drift content (pricing, changelogs, current news) should be retrieved live or from a frequently refreshed KnowledgeSDK index.

Can I skip the vector DB entirely and just use KnowledgeSDK? Yes. If all your knowledge comes from web URLs and you're comfortable with the latency of live retrieval, you can use only KnowledgeSDK's search endpoint. The two-layer architecture is for teams that have a mix of internal documents (better suited for a private vector DB) and web content.

How does KnowledgeSDK handle duplicate content from the same site? KnowledgeSDK deduplicates scraped content at the URL level. If you scrape the same URL twice within the cache window, the second call returns cached results. Different pages on the same domain are stored and searched independently.

What's the best embedding model to use alongside KnowledgeSDK? For the vector DB layer, text-embedding-3-small is fast and cost-effective. For maximum recall on technical queries, text-embedding-3-large improves results by about 5-10%. KnowledgeSDK's own search uses hybrid retrieval internally — you don't control the embedding model it uses, but it's tuned for web content.

Is there a way to preview what the two-layer retrieval would return before building the full pipeline? Yes. Use KnowledgeSDK's search endpoint directly from the command line to test queries against your indexed content before integrating it into your application.

Stop serving stale answers. Add a live web layer to your RAG pipeline today at knowledgesdk.com/setup.

Try it now