How to Build Your Own Tavily for Private Content with KnowledgeSDK

Tavily searches the public internet. This tutorial shows you how to build an equivalent private-corpus search system for your own URLs using KnowledgeSDK's extract and search API.

Tavily is a clean API: send a query, get back relevant content from the public web. For developers building AI agents, the experience is as simple as it gets.

The problem is Tavily only searches the internet. If you have 50 specific URLs — competitor sites, internal documentation, monitored domains — that you want your agent to search, Tavily cannot help. It does not let you add content to its index.

This tutorial walks through building a Tavily-equivalent search experience for your own content using KnowledgeSDK.

What Tavily Gives You (and What It Cannot)

A Tavily search call looks like this:

import { tavily } from "@tavily/core";

const client = tavily({ apiKey: process.env.TAVILY_API_KEY });

const results = await client.search("what does Stripe charge for international cards?", {
  searchDepth: "advanced",
  maxResults: 5,
});

// Returns content from across the web — could be Stripe's site, blog posts, Reddit, etc.
console.log(results.results.map((r) => ({ url: r.url, content: r.content })));

One call. Straightforward. But note what determines the output: Tavily's internet index, not your choices. You cannot say "only search these 10 URLs I trust."

For private content, you need to build the equivalent yourself. Here is how to do it in three steps.

Step 1: Extract URLs Into Your Knowledge Base

The equivalent of Tavily's indexing phase is KnowledgeSDK's extraction step. For each URL you want to be searchable, call POST /v1/extract. This fetches the page (with JavaScript rendering and anti-bot bypass), converts it to clean markdown, chunks it, generates embeddings, and stores it in your private index.

import KnowledgeSDK from "@knowledgesdk/node";

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

const urls = [
  "https://stripe.com/pricing",
  "https://stripe.com/docs/api",
  "https://stripe.com/docs/payments",
  "https://competitor.com/pricing",
  "https://competitor.com/docs",
  // ... up to 50+ URLs
];

// Extract all URLs — this indexes them in your private corpus
async function buildCorpus(urls: string[]) {
  const results = await Promise.allSettled(
    urls.map(async (url) => {
      const job = await client.extractAsync(url, {
        callbackUrl: "https://your-app.com/webhooks/indexed",
      });
      console.log(`Queued: ${url} (job: ${job.jobId})`);
      return job;
    })
  );

  const failed = results.filter((r) => r.status === "rejected");
  if (failed.length > 0) {
    console.warn(`${failed.length} extractions failed`);
  }
}

await buildCorpus(urls);

For smaller batches where you want to wait for completion before searching:

async function extractAndWait(urls: string[]) {
  for (const url of urls) {
    await client.extract(url); // sync — waits for indexing
    console.log(`Indexed: ${url}`);
  }
}

Step 2: Search Your Private Corpus

Once URLs are indexed, searching them is as simple as Tavily:

async function privateSearch(query: string, limit = 5) {
  const results = await client.search(query, { limit });

  return results.items.map((item) => ({
    url: item.sourceUrl,
    title: item.title,
    content: item.snippet,
    score: item.score,
  }));
}

// Usage — identical interface to Tavily, but over your private corpus
const results = await privateSearch("what does Stripe charge for international cards?");
console.log(results);

from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=KNOWLEDGESDK_API_KEY)

def private_search(query: str, limit: int = 5):
    results = client.search(query, limit=limit)
    return [
        {
            "url": item.source_url,
            "title": item.title,
            "content": item.snippet,
            "score": item.score,
        }
        for item in results.items
    ]

results = private_search("what does Stripe charge for international cards?")
print(results)

The search uses hybrid retrieval — vector similarity via pgvector plus ILIKE keyword fallback — which handles both semantic ("what are the authentication requirements?") and exact-term queries ("OAuth2 PKCE") reliably.

Step 3: Set Webhooks for Freshness

Tavily searches the live web, so its results are inherently fresh. Your private corpus becomes stale unless you refresh it when content changes.

KnowledgeSDK's webhook system handles this: register a URL for monitoring, receive a webhook payload when the content changes, then re-extract to update the index.

// Register all corpus URLs for change monitoring
async function registerWebhooks(urls: string[]) {
  const webhooks = await Promise.all(
    urls.map((url) =>
      client.webhooks.create({
        url,
        callbackUrl: "https://your-app.com/webhooks/content-changed",
        events: ["content.changed"],
      })
    )
  );

  console.log(`Registered ${webhooks.length} webhooks`);
}

await registerWebhooks(urls);

Your webhook handler re-extracts the changed URL:

// Express.js webhook handler
import express from "express";

const app = express();
app.use(express.json());

app.post("/webhooks/content-changed", async (req, res) => {
  const { url, event } = req.body;

  if (event === "content.changed") {
    console.log(`Content changed: ${url} — re-indexing`);
    await client.extract(url);
  }

  res.status(200).json({ received: true });
});

With this setup, your private corpus stays fresh without polling — you only re-index when content actually changes.

Full Example: Private Search Function

Here is a drop-in replacement for a Tavily search call, searching your private corpus instead:

import KnowledgeSDK from "@knowledgesdk/node";

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

interface SearchResult {
  url: string;
  title: string;
  content: string;
  score: number;
}

/**
 * Search your private corpus — equivalent to Tavily search over your indexed URLs.
 */
async function search(query: string, options: { limit?: number } = {}): Promise<SearchResult[]> {
  const { limit = 5 } = options;
  const response = await client.search(query, { limit });

  return response.items.map((item) => ({
    url: item.sourceUrl,
    title: item.title,
    content: item.snippet,
    score: item.score,
  }));
}

// Use in your LangChain or LlamaIndex agent as a tool
const agentTool = {
  name: "search_knowledge_base",
  description: "Search the private knowledge base of indexed web pages",
  execute: async (query: string) => {
    const results = await search(query);
    return results.map((r) => `[${r.title}] ${r.content} (source: ${r.url})`).join("\n\n");
  },
};

Production Tips

Batch large extractions. For 50+ URLs, use extractAsync with a callback URL rather than synchronous extraction. This avoids timeouts and lets you process results as they complete.

Handle async job status. If you need to poll instead of using webhooks, use GET /v1/jobs/{jobId} to check extraction status before searching.

Rate limit awareness. Extract URLs in batches of 5-10 with a short delay between batches for large corpora:

async function batchExtract(urls: string[], batchSize = 10, delayMs = 1000) {
  for (let i = 0; i < urls.length; i += batchSize) {
    const batch = urls.slice(i, i + batchSize);
    await Promise.all(batch.map((url) => client.extractAsync(url)));
    if (i + batchSize < urls.length) {
      await new Promise((resolve) => setTimeout(resolve, delayMs));
    }
  }
}

Filter by project or category. If you are managing multiple corpora (documentation vs. competitor sites), use the projectId or category filter in search to scope results:

const docsResults = await client.search(query, {
  limit: 5,
  filter: { category: "documentation" },
});

The Result

What you end up with is a search system that:

Returns results only from URLs you explicitly chose
Runs hybrid semantic + keyword search over your indexed content
Stays fresh via webhook-triggered re-indexing
Has a nearly identical API surface to Tavily

The key difference from Tavily is control. Every document in your corpus is one you reviewed and approved. No noise from unrelated web sources. No dependency on a third-party index that changes without notice.