Private Corpus Search vs Public Web Search: Which Does Your AI Agent Need?

Tavily and Exa search the public internet. KnowledgeSDK searches your indexed content. Here's when to use each — and why the distinction matters for AI agents.

When developers add web search to AI agents, they face a choice that is rarely explained clearly: should the agent search the public internet, or should it search a private corpus of content you control? These are not the same thing, and picking the wrong one creates real problems in production.

This article breaks down the distinction, explains when each approach is right, and shows you how to implement private corpus search with KnowledgeSDK.

Two Types of Search for AI Agents

Public web search means your agent queries a search engine — Tavily, Exa, Google, or similar — and gets back results from wherever those engines have indexed. The content could be from any site on the internet. You have no control over what is in the index.

Private corpus search means your agent queries a knowledge base you built. You decided which URLs to include. You extracted the content. You control when it updates. The agent only sees what you put in.

Both are legitimate tools. The problem is developers often default to public web search when private corpus search would serve them better — and vice versa.

When to Use Public Web Search

Public web search is the right call when your agent needs to answer questions about things you do not know ahead of time, using content from across the open internet.

Good fits:

Live news: "What happened with [company] today?" — requires real-time indexing of news sites you did not pre-select
General research: "What are the leading frameworks for multi-agent systems?" — you want broad coverage from many sources
Fact verification: "Did [claim] get reported by reliable sources?" — requires diversity of sources
Open-ended discovery: finding relevant pages your team has not seen yet

In these cases, Tavily and Exa are genuinely the better tools. They maintain massive crawl indexes, run search pipelines you do not have to build, and return results in under 200ms.

When to Use Private Corpus Search

Private corpus search wins when you have a defined set of content sources that your agent needs to reason over reliably. The classic situations:

Competitor monitoring. You want your agent to answer "does Competitor X offer a free tier?" by searching their pricing page — not a third-party review article about them. With public web search, you get whatever Google-adjacent sources happen to mention the competitor. With a private corpus, you search the actual competitor page you extracted.

Documentation site indexing. A customer support agent grounded in your own help center docs. Public web search might return community forums, Reddit threads, or outdated blog posts about your product. Private corpus search returns only your canonical documentation.

Domain-specific accuracy. Any time a hallucination from a mismatched source is worse than returning no result, private corpus search is safer. You know exactly what knowledge your agent can access.

Cost at scale. At high search volumes against a fixed set of documents, private corpus search is significantly cheaper. You pay to extract once, then search many times. Public web search charges per query, with each query triggering a live fetch.

The Architecture Difference

The implementation difference is fundamental, not just a configuration option.

Public web search flow:

Agent receives query → API call to Tavily/Exa → Live fetch + ranking across web index → Results returned → Agent responds

Each query triggers a live crawl and ranking pass. The content source is the internet at large.

Private corpus flow:

You: extract URLs → content indexed in vector store
Agent receives query → API call to KnowledgeSDK → Semantic search over your index → Results returned → Agent responds

Extraction is a one-time or scheduled operation. Queries hit your stored index. You control every document in the index.

Why Private Corpus Search Wins for Production Agents

Control. You know exactly what your agent can cite. No surprises from SEO-optimized content that ranks well but contains bad information.

Speed. Searching a pre-built vector index takes 10–20ms. Live web fetching adds latency that compounds across agent turns.

Relevance. A narrow corpus means every search result is relevant by construction. You are not sorting through 20 web pages to find 2 useful ones.

Freshness on your terms. You decide when content updates. Pair with webhooks to re-index only when pages actually change, rather than re-fetching on every query.

Building a Private Corpus with KnowledgeSDK

Here is the full flow: extract URLs into your knowledge base, then search them.

import KnowledgeSDK from "@knowledgesdk/node";

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

// Step 1: Extract and index specific URLs
const urls = [
  "https://competitor.com/pricing",
  "https://competitor.com/features",
  "https://competitor.com/docs/api",
];

for (const url of urls) {
  await client.extract(url);
  console.log(`Indexed: ${url}`);
}

// Step 2: Search your private corpus
const results = await client.search("what are the rate limits on their API?", {
  limit: 5,
});

for (const item of results.items) {
  console.log(`[${item.score.toFixed(2)}] ${item.title}`);
  console.log(item.snippet);
}

from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=KNOWLEDGESDK_API_KEY)

# Step 1: Extract and index specific URLs
urls = [
    "https://competitor.com/pricing",
    "https://competitor.com/features",
    "https://competitor.com/docs/api",
]

for url in urls:
    client.extract(url)
    print(f"Indexed: {url}")

# Step 2: Search your private corpus
results = client.search("what are the rate limits on their API?", limit=5)

for item in results.items:
    print(f"[{item.score:.2f}] {item.title}")
    print(item.snippet)

The search uses hybrid retrieval — vector similarity via pgvector plus keyword fallback — so it handles both semantic queries ("what are the authentication options?") and specific term lookups ("OAuth2") reliably.

Using Both in One Agent

Most production AI systems benefit from both approaches in a layered architecture:

async function search(query: string) {
  // First: check private corpus (fast, high precision)
  const privateResults = await ksClient.search(query, { limit: 3 });

  if (privateResults.items.length > 0 && privateResults.items[0].score > 0.78) {
    return { source: "corpus", results: privateResults.items };
  }

  // Fallback: public web search for out-of-corpus queries
  const webResults = await tavilyClient.search(query, { maxResults: 3 });
  return { source: "web", results: webResults.results };
}

Your private corpus handles everything you have explicitly indexed with high precision. Tavily or Exa covers the long tail of general questions that go beyond your curated content.

Summary

The distinction is straightforward once you see it:

Public web search (Tavily, Exa): use when the question requires broad internet coverage and you do not know the sources ahead of time
Private corpus search (KnowledgeSDK): use when you have a defined set of URLs, need control over what your agent knows, and want consistent, auditable retrieval

For most production AI agents that operate in a specific domain — customer support, competitive intelligence, documentation Q&A — private corpus search is the correct default. Public web search is a useful fallback, not the primary retrieval layer.