Web RAG Pipeline: Architecture Guide for Live Web Retrieval in 2026
Standard RAG (Retrieval Augmented Generation) is well understood: embed your documents, store them in a vector database, retrieve relevant chunks at query time, and inject them into your LLM prompt. It works exceptionally well for stable corpora — internal documentation, product manuals, historical data.
But what happens when the information your agent needs doesn't exist in your vector database yet? What happens when you need to answer questions about a competitor's latest pricing, a breaking news story, or a documentation page that was updated yesterday?
This is where web RAG — using the live public web as your retrieval source — becomes essential.
This guide covers the full architecture of a web RAG pipeline: when to use it, how it differs from static RAG, the three functional layers, and how to implement it in Python with LangChain and TypeScript with the Vercel AI SDK.
Static RAG vs Web RAG: Choosing the Right Architecture
The fundamental difference comes down to data freshness and control.
Static RAG gives you full control over what's in your retrieval corpus. You ingest documents, process them, embed them, and store them. Retrieval is fast (sub-100ms) because it's just a vector similarity search against an index you control. The limitation: your knowledge is frozen at ingestion time.
Web RAG treats the live web as your retrieval corpus. Instead of searching a pre-built index, you search the web in real time, fetch the relevant pages, normalize them into LLM-ready format, and inject them into your prompt. Retrieval is slower (500ms–3s) but the information is always current.
| Dimension | Static RAG | Web RAG |
|---|---|---|
| Data freshness | Frozen at ingestion | Always current |
| Retrieval latency | 10–100ms | 500ms–3000ms |
| Control over sources | Full control | Limited (public web) |
| Setup complexity | High (ingestion pipeline) | Low (API call) |
| Maintenance overhead | High (re-ingestion) | Low |
| Cost per query | Low (vector search) | Medium (scraping + LLM) |
| Best for | Internal docs, stable data | News, competitors, live docs |
When Web RAG Wins
Use web RAG when:
- The information changes frequently (news, stock prices, sports scores)
- The information source is external and you don't control it (competitor sites, public documentation)
- You need to answer questions about URLs the user provides at query time
- You're building a general-purpose research agent that needs to access any public URL
- The query domain is unpredictable — you cannot know in advance what to ingest
When Static RAG Wins
Use static RAG when:
- You own the data and it changes infrequently (internal knowledge bases, product manuals)
- You need sub-100ms retrieval latency
- You want fine-grained control over what the model can and cannot access
- You're operating at scale where per-query web scraping costs are prohibitive
Many production systems combine both: a static RAG index for owned content, with a web RAG fallback for queries that exceed the index's coverage.
The Three Layers of a Web RAG Pipeline
Every web RAG pipeline has three functional layers, regardless of implementation:
┌─────────────────────────────────────────────────────────┐
│ USER QUERY │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ LAYER 1: SEARCH │
│ Find which URLs on the web are relevant to the query │
│ (Google, Bing, Tavily, domain-specific search) │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ LAYER 2: FETCH │
│ Retrieve the HTML content of each URL │
│ Handle JavaScript rendering, anti-bot, rate limits │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ LAYER 3: NORMALIZE │
│ Convert raw HTML to clean LLM-ready markdown │
│ Remove navigation, ads, boilerplate │
│ Chunk and embed for retrieval │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ LLM GENERATION │
│ Inject normalized content into prompt │
│ Generate grounded answer with citations │
└─────────────────────────────────────────────────────────┘
Building these three layers from scratch is substantial engineering work. You need to integrate a search API, set up a headless browser fleet for rendering, implement content extraction and cleaning, and manage the orchestration between all of them.
KnowledgeSDK collapses all three layers into a single API call via /v1/extract, which handles search, fetch, and normalize in one request. For domain-specific search over previously extracted content, /v1/search provides semantic retrieval.
Architecture Pattern 1: Just-in-Time Web RAG
The simplest web RAG pattern fetches web content at query time for every request. This is ideal for low-volume, high-freshness use cases.
User Query
│
▼
Search Web → [URL1, URL2, URL3]
│
▼
Fetch + Normalize each URL
│
▼
Rank by relevance to query
│
▼
Inject top chunks into LLM prompt
│
▼
LLM generates grounded response
Architecture Pattern 2: Crawl-Then-Search RAG
For domain-specific use cases (e.g., "answer questions using only Stripe's documentation"), you crawl and index a set of URLs first, then search that index at query time. This gives you the freshness control of web RAG with the latency of static RAG.
[Setup Phase]
Target URLs → KnowledgeSDK Extract → Indexed Knowledge
│
[Query Phase] │
User Query → /v1/search ──────────────────┘
│
▼
Relevant Chunks
│
▼
LLM Response
Full Implementation: Python with LangChain
Here is a complete web RAG implementation using LangChain and KnowledgeSDK:
# Python — Web RAG with LangChain
import os
from typing import List
from knowledgesdk import KnowledgeSDK
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
llm = ChatOpenAI(model="gpt-4o", temperature=0)
RAG_PROMPT = ChatPromptTemplate.from_messages([
("system", """You are a helpful research assistant. Answer the user's question
using ONLY the web content provided below. If the content doesn't contain
enough information to answer the question, say so clearly.
Always cite your sources by referencing the URL where you found information.
Web Content:
{context}
"""),
("human", "{question}"),
])
def fetch_web_context(urls: List[str], query: str) -> str:
"""Fetch and normalize multiple URLs, return as a single context string."""
documents = []
for url in urls:
try:
result = client.scrape(url=url)
if result.markdown:
documents.append(f"Source: {url}\n\n{result.markdown[:3000]}")
except Exception as e:
print(f"Failed to fetch {url}: {e}")
return "\n\n---\n\n".join(documents)
def web_rag_query(question: str, urls: List[str]) -> str:
"""Answer a question using live web content from the provided URLs."""
print(f"Fetching {len(urls)} URLs...")
context = fetch_web_context(urls, question)
if not context:
return "Could not retrieve any web content to answer your question."
chain = RAG_PROMPT | llm
response = chain.invoke({"context": context, "question": question})
return response.content
# Domain-specific RAG: search a pre-indexed knowledge base
def domain_rag_query(question: str) -> str:
"""Answer using pre-indexed knowledge (crawl-then-search pattern)."""
search_results = client.search(
query=question,
limit=5,
)
context_parts = []
for result in search_results.results:
context_parts.append(
f"Source: {result.url}\nTitle: {result.title}\n\n{result.content}"
)
context = "\n\n---\n\n".join(context_parts)
chain = RAG_PROMPT | llm
response = chain.invoke({"context": context, "question": question})
return response.content
# Example: Just-in-time web RAG
answer = web_rag_query(
question="What are the key changes in GPT-4o's latest update?",
urls=[
"https://openai.com/blog/gpt-4o",
"https://openai.com/index/hello-gpt-4o/",
],
)
print(answer)
Adding Parallel Fetch for Lower Latency
Fetching URLs sequentially adds latency proportional to the number of sources. Use asyncio to parallelize:
# Python — Parallel URL fetching with asyncio
import asyncio
from knowledgesdk import AsyncKnowledgeSDK
async_client = AsyncKnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
async def fetch_url_async(url: str) -> tuple[str, str]:
"""Fetch a single URL asynchronously."""
try:
result = await async_client.scrape(url=url)
return url, result.markdown or ""
except Exception as e:
return url, ""
async def fetch_web_context_parallel(urls: List[str]) -> str:
"""Fetch multiple URLs in parallel."""
tasks = [fetch_url_async(url) for url in urls]
results = await asyncio.gather(*tasks)
documents = []
for url, content in results:
if content:
documents.append(f"Source: {url}\n\n{content[:3000]}")
return "\n\n---\n\n".join(documents)
# This reduces latency from (n * avg_fetch_time) to max(individual_fetch_times)
# e.g., 5 URLs at 1.5s each: 7.5s sequential → ~1.8s parallel
context = asyncio.run(fetch_web_context_parallel([
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]))
Full Implementation: TypeScript with Vercel AI SDK
// TypeScript — Web RAG with Vercel AI SDK
import { KnowledgeSDK } from "@knowledgesdk/node";
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";
const client = new KnowledgeSDK({
apiKey: process.env.KNOWLEDGESDK_API_KEY!,
});
async function fetchWebContext(urls: string[]): Promise<string> {
const results = await Promise.all(
urls.map(async (url) => {
try {
const result = await client.scrape({ url });
return `Source: ${url}\n\n${result.markdown?.slice(0, 3000) ?? ""}`;
} catch {
return "";
}
})
);
return results.filter(Boolean).join("\n\n---\n\n");
}
async function webRagQuery(question: string, urls: string[]): Promise<string> {
const context = await fetchWebContext(urls);
const { text } = await generateText({
model: openai("gpt-4o"),
system: `You are a helpful research assistant. Answer the user's question
using ONLY the web content provided below. Cite your sources by referencing
the URL where you found information.
Web Content:
${context}`,
prompt: question,
});
return text;
}
// Domain-specific RAG using the search endpoint
async function domainRagQuery(question: string): Promise<string> {
const searchResults = await client.search({
query: question,
limit: 5,
});
const context = searchResults.results
.map((r) => `Source: ${r.url}\nTitle: ${r.title}\n\n${r.content}`)
.join("\n\n---\n\n");
const { text } = await generateText({
model: openai("gpt-4o"),
system: `You are a helpful assistant. Answer using only the content below.
${context}`,
prompt: question,
});
return text;
}
// Next.js API route example
export async function POST(req: Request) {
const { question, urls } = await req.json();
// If specific URLs provided, use just-in-time web RAG
// Otherwise, search the pre-indexed knowledge base
const answer = urls?.length > 0
? await webRagQuery(question, urls)
: await domainRagQuery(question);
return Response.json({ answer });
}
Latency Benchmarks
We measured end-to-end latency for a three-source web RAG query (fetch 3 URLs, generate answer):
| Configuration | P50 Latency | P95 Latency | Notes |
|---|---|---|---|
| Sequential fetch + GPT-4o | 6.2s | 11.4s | Baseline |
| Parallel fetch + GPT-4o | 2.8s | 5.1s | 55% faster |
| Parallel fetch + GPT-4o mini | 1.9s | 3.6s | Cheaper model |
| Domain RAG (search index) + GPT-4o | 1.4s | 2.3s | Pre-indexed |
| Domain RAG (search index) + streaming | 0.8s TTFT | — | First token fast |
Key observations:
- Parallelizing URL fetches is the single highest-leverage optimization
- Pre-indexing domains and using
/v1/searchcuts latency by more than half versus just-in-time fetching - Streaming responses dramatically improve perceived latency for chat interfaces
Handling Retrieval Quality
Web RAG pipelines can return irrelevant content if the source URLs are poorly chosen. A few strategies to improve retrieval quality:
Relevance filtering: Score each fetched document against the query before injecting into the prompt. Discard documents below a relevance threshold.
Chunk-level retrieval: Instead of injecting full page content, chunk each page into 500-token segments and retrieve only the most relevant chunks. This reduces prompt length and improves signal-to-noise.
Source ranking: Weight results from authoritative sources higher. A Wikipedia article about a topic is more reliable than a random blog post.
Recency weighting: For time-sensitive queries, bias toward recently published pages. KnowledgeSDK's metadata includes page publish dates when available.
When to Combine Static and Web RAG
The most robust production architectures combine both approaches. A typical pattern:
- First, search your static knowledge index (fast, controlled)
- If confidence is below threshold, fall back to web RAG (slow, fresh)
- Cache web RAG results briefly (e.g., 15 minutes) to reduce redundant fetches
- Use webhooks to invalidate cache when source pages change
# Python — Hybrid static + web RAG
async def hybrid_rag_query(question: str, urls: List[str] = None) -> str:
# Try static index first
static_results = client.search(query=question, limit=3)
if static_results.results and static_results.results[0].score > 0.85:
# High confidence static result — use it
context = format_results(static_results.results)
elif urls:
# Low confidence — fall back to just-in-time web fetch
context = await fetch_web_context_parallel(urls)
else:
# No URLs provided — use static results anyway
context = format_results(static_results.results)
return await generate_answer(question, context)
Summary
Web RAG is not a replacement for static RAG — it is a complementary pattern for use cases that require live, external, or unpredictable data sources.
The three-layer architecture (search → fetch → normalize) is well established, but building it from scratch is significant work. KnowledgeSDK provides this infrastructure as a single API, letting you focus on the retrieval logic and LLM orchestration that differentiates your product.
For production systems, the crawl-then-search pattern (index specific domains, search at query time) gives you the freshness benefits of web RAG with latency closer to static RAG.