architectureMarch 20, 2026·14 min read

Web RAG Pipeline: Architecture Guide for Live Web Retrieval in 2026

Complete architecture guide for building a web RAG pipeline. Learn when to use live web retrieval vs static vector databases, with code in Python and TypeScript.

Web RAG Pipeline: Architecture Guide for Live Web Retrieval in 2026

Standard RAG (Retrieval Augmented Generation) is well understood: embed your documents, store them in a vector database, retrieve relevant chunks at query time, and inject them into your LLM prompt. It works exceptionally well for stable corpora — internal documentation, product manuals, historical data.

But what happens when the information your agent needs doesn't exist in your vector database yet? What happens when you need to answer questions about a competitor's latest pricing, a breaking news story, or a documentation page that was updated yesterday?

This is where web RAG — using the live public web as your retrieval source — becomes essential.

This guide covers the full architecture of a web RAG pipeline: when to use it, how it differs from static RAG, the three functional layers, and how to implement it in Python with LangChain and TypeScript with the Vercel AI SDK.

Static RAG vs Web RAG: Choosing the Right Architecture

The fundamental difference comes down to data freshness and control.

Static RAG gives you full control over what's in your retrieval corpus. You ingest documents, process them, embed them, and store them. Retrieval is fast (sub-100ms) because it's just a vector similarity search against an index you control. The limitation: your knowledge is frozen at ingestion time.

Web RAG treats the live web as your retrieval corpus. Instead of searching a pre-built index, you search the web in real time, fetch the relevant pages, normalize them into LLM-ready format, and inject them into your prompt. Retrieval is slower (500ms–3s) but the information is always current.

Dimension	Static RAG	Web RAG
Data freshness	Frozen at ingestion	Always current
Retrieval latency	10–100ms	500ms–3000ms
Control over sources	Full control	Limited (public web)
Setup complexity	High (ingestion pipeline)	Low (API call)
Maintenance overhead	High (re-ingestion)	Low
Cost per query	Low (vector search)	Medium (scraping + LLM)
Best for	Internal docs, stable data	News, competitors, live docs

When Web RAG Wins

Use web RAG when:

The information changes frequently (news, stock prices, sports scores)
The information source is external and you don't control it (competitor sites, public documentation)
You need to answer questions about URLs the user provides at query time
You're building a general-purpose research agent that needs to access any public URL
The query domain is unpredictable — you cannot know in advance what to ingest

When Static RAG Wins

Use static RAG when:

You own the data and it changes infrequently (internal knowledge bases, product manuals)
You need sub-100ms retrieval latency
You want fine-grained control over what the model can and cannot access
You're operating at scale where per-query web scraping costs are prohibitive

Many production systems combine both: a static RAG index for owned content, with a web RAG fallback for queries that exceed the index's coverage.

The Three Layers of a Web RAG Pipeline

Every web RAG pipeline has three functional layers, regardless of implementation:

┌─────────────────────────────────────────────────────────┐
│                    USER QUERY                           │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│              LAYER 1: SEARCH                            │
│  Find which URLs on the web are relevant to the query   │
│  (Google, Bing, Tavily, domain-specific search)         │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│              LAYER 2: FETCH                             │
│  Retrieve the HTML content of each URL                  │
│  Handle JavaScript rendering, anti-bot, rate limits     │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│              LAYER 3: NORMALIZE                         │
│  Convert raw HTML to clean LLM-ready markdown           │
│  Remove navigation, ads, boilerplate                    │
│  Chunk and embed for retrieval                          │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│              LLM GENERATION                             │
│  Inject normalized content into prompt                  │
│  Generate grounded answer with citations                │
└─────────────────────────────────────────────────────────┘

Building these three layers from scratch is substantial engineering work. You need to integrate a search API, set up a headless browser fleet for rendering, implement content extraction and cleaning, and manage the orchestration between all of them.

KnowledgeSDK collapses all three layers into a single API call via /v1/extract, which handles search, fetch, and normalize in one request. For domain-specific search over previously extracted content, /v1/search provides semantic retrieval.

Architecture Pattern 1: Just-in-Time Web RAG

The simplest web RAG pattern fetches web content at query time for every request. This is ideal for low-volume, high-freshness use cases.

User Query
    │
    ▼
Search Web → [URL1, URL2, URL3]
    │
    ▼
Fetch + Normalize each URL
    │
    ▼
Rank by relevance to query
    │
    ▼
Inject top chunks into LLM prompt
    │
    ▼
LLM generates grounded response

Architecture Pattern 2: Crawl-Then-Search RAG

For domain-specific use cases (e.g., "answer questions using only Stripe's documentation"), you crawl and index a set of URLs first, then search that index at query time. This gives you the freshness control of web RAG with the latency of static RAG.

[Setup Phase]
Target URLs → KnowledgeSDK Extract → Indexed Knowledge
                                           │
[Query Phase]                              │
User Query → /v1/search ──────────────────┘
                │
                ▼
         Relevant Chunks
                │
                ▼
         LLM Response

Full Implementation: Python with LangChain

Here is a complete web RAG implementation using LangChain and KnowledgeSDK:

# Python — Web RAG with LangChain
import os
from typing import List
from knowledgesdk import KnowledgeSDK
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
llm = ChatOpenAI(model="gpt-4o", temperature=0)

RAG_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful research assistant. Answer the user's question
    using ONLY the web content provided below. If the content doesn't contain
    enough information to answer the question, say so clearly.

    Always cite your sources by referencing the URL where you found information.

    Web Content:
    {context}
    """),
    ("human", "{question}"),
])


def fetch_web_context(urls: List[str], query: str) -> str:
    """Fetch and normalize multiple URLs, return as a single context string."""
    documents = []

    for url in urls:
        try:
            result = client.scrape(url=url)
            if result.markdown:
                documents.append(f"Source: {url}\n\n{result.markdown[:3000]}")
        except Exception as e:
            print(f"Failed to fetch {url}: {e}")

    return "\n\n---\n\n".join(documents)


def web_rag_query(question: str, urls: List[str]) -> str:
    """Answer a question using live web content from the provided URLs."""
    print(f"Fetching {len(urls)} URLs...")
    context = fetch_web_context(urls, question)

    if not context:
        return "Could not retrieve any web content to answer your question."

    chain = RAG_PROMPT | llm
    response = chain.invoke({"context": context, "question": question})
    return response.content


# Domain-specific RAG: search a pre-indexed knowledge base
def domain_rag_query(question: str) -> str:
    """Answer using pre-indexed knowledge (crawl-then-search pattern)."""
    search_results = client.search(
        query=question,
        limit=5,
    )

    context_parts = []
    for result in search_results.results:
        context_parts.append(
            f"Source: {result.url}\nTitle: {result.title}\n\n{result.content}"
        )

    context = "\n\n---\n\n".join(context_parts)
    chain = RAG_PROMPT | llm
    response = chain.invoke({"context": context, "question": question})
    return response.content


# Example: Just-in-time web RAG
answer = web_rag_query(
    question="What are the key changes in GPT-4o's latest update?",
    urls=[
        "https://openai.com/blog/gpt-4o",
        "https://openai.com/index/hello-gpt-4o/",
    ],
)
print(answer)

Adding Parallel Fetch for Lower Latency

Fetching URLs sequentially adds latency proportional to the number of sources. Use asyncio to parallelize:

# Python — Parallel URL fetching with asyncio
import asyncio
from knowledgesdk import AsyncKnowledgeSDK

async_client = AsyncKnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])


async def fetch_url_async(url: str) -> tuple[str, str]:
    """Fetch a single URL asynchronously."""
    try:
        result = await async_client.scrape(url=url)
        return url, result.markdown or ""
    except Exception as e:
        return url, ""


async def fetch_web_context_parallel(urls: List[str]) -> str:
    """Fetch multiple URLs in parallel."""
    tasks = [fetch_url_async(url) for url in urls]
    results = await asyncio.gather(*tasks)

    documents = []
    for url, content in results:
        if content:
            documents.append(f"Source: {url}\n\n{content[:3000]}")

    return "\n\n---\n\n".join(documents)


# This reduces latency from (n * avg_fetch_time) to max(individual_fetch_times)
# e.g., 5 URLs at 1.5s each: 7.5s sequential → ~1.8s parallel
context = asyncio.run(fetch_web_context_parallel([
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
]))

Full Implementation: TypeScript with Vercel AI SDK

// TypeScript — Web RAG with Vercel AI SDK
import { KnowledgeSDK } from "@knowledgesdk/node";
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";

const client = new KnowledgeSDK({
  apiKey: process.env.KNOWLEDGESDK_API_KEY!,
});

async function fetchWebContext(urls: string[]): Promise<string> {
  const results = await Promise.all(
    urls.map(async (url) => {
      try {
        const result = await client.scrape({ url });
        return `Source: ${url}\n\n${result.markdown?.slice(0, 3000) ?? ""}`;
      } catch {
        return "";
      }
    })
  );

  return results.filter(Boolean).join("\n\n---\n\n");
}

async function webRagQuery(question: string, urls: string[]): Promise<string> {
  const context = await fetchWebContext(urls);

  const { text } = await generateText({
    model: openai("gpt-4o"),
    system: `You are a helpful research assistant. Answer the user's question
using ONLY the web content provided below. Cite your sources by referencing
the URL where you found information.

Web Content:
${context}`,
    prompt: question,
  });

  return text;
}

// Domain-specific RAG using the search endpoint
async function domainRagQuery(question: string): Promise<string> {
  const searchResults = await client.search({
    query: question,
    limit: 5,
  });

  const context = searchResults.results
    .map((r) => `Source: ${r.url}\nTitle: ${r.title}\n\n${r.content}`)
    .join("\n\n---\n\n");

  const { text } = await generateText({
    model: openai("gpt-4o"),
    system: `You are a helpful assistant. Answer using only the content below.

${context}`,
    prompt: question,
  });

  return text;
}

// Next.js API route example
export async function POST(req: Request) {
  const { question, urls } = await req.json();

  // If specific URLs provided, use just-in-time web RAG
  // Otherwise, search the pre-indexed knowledge base
  const answer = urls?.length > 0
    ? await webRagQuery(question, urls)
    : await domainRagQuery(question);

  return Response.json({ answer });
}

Latency Benchmarks

We measured end-to-end latency for a three-source web RAG query (fetch 3 URLs, generate answer):

Configuration	P50 Latency	P95 Latency	Notes
Sequential fetch + GPT-4o	6.2s	11.4s	Baseline
Parallel fetch + GPT-4o	2.8s	5.1s	55% faster
Parallel fetch + GPT-4o mini	1.9s	3.6s	Cheaper model
Domain RAG (search index) + GPT-4o	1.4s	2.3s	Pre-indexed
Domain RAG (search index) + streaming	0.8s TTFT	—	First token fast

Key observations:

Parallelizing URL fetches is the single highest-leverage optimization
Pre-indexing domains and using /v1/search cuts latency by more than half versus just-in-time fetching
Streaming responses dramatically improve perceived latency for chat interfaces

Handling Retrieval Quality

Web RAG pipelines can return irrelevant content if the source URLs are poorly chosen. A few strategies to improve retrieval quality:

Relevance filtering: Score each fetched document against the query before injecting into the prompt. Discard documents below a relevance threshold.

Chunk-level retrieval: Instead of injecting full page content, chunk each page into 500-token segments and retrieve only the most relevant chunks. This reduces prompt length and improves signal-to-noise.

Source ranking: Weight results from authoritative sources higher. A Wikipedia article about a topic is more reliable than a random blog post.

Recency weighting: For time-sensitive queries, bias toward recently published pages. KnowledgeSDK's metadata includes page publish dates when available.

When to Combine Static and Web RAG

The most robust production architectures combine both approaches. A typical pattern:

First, search your static knowledge index (fast, controlled)
If confidence is below threshold, fall back to web RAG (slow, fresh)
Cache web RAG results briefly (e.g., 15 minutes) to reduce redundant fetches
Use webhooks to invalidate cache when source pages change

# Python — Hybrid static + web RAG
async def hybrid_rag_query(question: str, urls: List[str] = None) -> str:
    # Try static index first
    static_results = client.search(query=question, limit=3)

    if static_results.results and static_results.results[0].score > 0.85:
        # High confidence static result — use it
        context = format_results(static_results.results)
    elif urls:
        # Low confidence — fall back to just-in-time web fetch
        context = await fetch_web_context_parallel(urls)
    else:
        # No URLs provided — use static results anyway
        context = format_results(static_results.results)

    return await generate_answer(question, context)

Summary

Web RAG is not a replacement for static RAG — it is a complementary pattern for use cases that require live, external, or unpredictable data sources.

The three-layer architecture (search → fetch → normalize) is well established, but building it from scratch is significant work. KnowledgeSDK provides this infrastructure as a single API, letting you focus on the retrieval logic and LLM orchestration that differentiates your product.

For production systems, the crawl-then-search pattern (index specific domains, search at query time) gives you the freshness benefits of web RAG with latency closer to static RAG.

Start building your web RAG pipeline at knowledgesdk.com

Try it now