Build a Searchable Knowledge Base from Any Website in Minutes

Step-by-step tutorial: extract any website into a searchable knowledge base using KnowledgeSDK — no infrastructure, no vector DB setup, just a few API calls.

Build a Searchable Knowledge Base from Any Website in Minutes

Every team building an AI agent eventually hits the same wall: you need to make a website searchable for your agent, and the traditional approach turns out to be a two-week project.

The standard path looks something like this:

Set up a scraper that handles JavaScript rendering
Figure out how to bypass anti-bot measures on the target site
Implement chunking logic (what is the right chunk size? overlap?)
Choose and deploy an embedding model
Provision a vector database (Pinecone, Chroma, Weaviate, Qdrant...)
Write the indexing pipeline that connects all of the above
Implement search logic with appropriate retrieval strategies
Set up re-crawling for content freshness
Handle errors, retries, and partial failures throughout

Before you have written a single line of your actual agent, you have built a substantial data pipeline. And maintaining it becomes a permanent background task.

KnowledgeSDK reduces this to three API calls.

The Core Idea

KnowledgeSDK is a knowledge extraction API. You point it at a URL, it returns structured markdown and metadata, and the content is automatically indexed in your private knowledge collection. You can then search that collection using hybrid semantic and keyword retrieval — no vector database to manage, no embedding infrastructure to operate.

Your agent goes from "how do I index this website" to "what do I want my agent to know."

Step 1: Extract a Page

The entry point is POST /v1/extract. Pass it a URL, and it returns clean structured markdown along with metadata: title, description, headings, author information where available, and structured content sections.

import Knowledgesdk from "@knowledgesdk/node";

const client = new Knowledgesdk({
  apiKey: "knowledgesdk_live_...",
});

const result = await client.extract({
  url: "https://docs.example.com/api-reference/authentication",
});

console.log(result.title);
// "Authentication — Example API Reference"

console.log(result.content.slice(0, 500));
// Clean markdown content from the page

console.log(result.metadata);
// { description: "...", headings: [...], wordCount: 1240 }

The extraction handles JavaScript-rendered pages — single-page apps, React-based documentation sites, dynamically loaded content. It also handles common anti-bot measures, so you are not writing custom headers and retry logic for each target site.

The extracted content is automatically indexed in your private collection. Every API key gets its own isolated knowledge collection — your competitor's docs are not mixed with anyone else's.

Step 2: Search It

Once pages are extracted, you can search across all indexed content with POST /v1/search. The search uses hybrid retrieval — combining semantic vector search with keyword matching — so it handles both conceptual queries ("how does auth work?") and specific keyword lookups ("oauth2 scope parameter").

const results = await client.search({
  query: "how do I authenticate API requests?",
  limit: 5,
});

for (const item of results.items) {
  console.log(item.title);
  console.log(item.url);
  console.log(item.score);
  console.log(item.content.slice(0, 400));
  console.log("---");
}

The search returns ranked results with relevance scores, titles, source URLs, and content excerpts. These are ready to pass directly to your LLM as retrieved context.

Full Example: Index a Competitor's Docs, Then Search Them

Here is a complete workflow — enumerate a documentation site, extract every page, and build a searchable knowledge base in one script:

import Knowledgesdk from "@knowledgesdk/node";

const client = new Knowledgesdk({ apiKey: "knowledgesdk_live_..." });

async function buildKnowledgeBase(baseUrl: string) {
  console.log(`Fetching sitemap for ${baseUrl}...`);

  // Step 1: Get all URLs from the site
  const { urls } = await client.sitemap({ url: baseUrl });
  console.log(`Found ${urls.length} URLs`);

  // Step 2: Extract each page (filter to relevant paths)
  const docUrls = urls.filter((url) => url.includes("/docs/") || url.includes("/api/"));
  console.log(`Extracting ${docUrls.length} documentation pages...`);

  for (const url of docUrls) {
    try {
      const result = await client.extract({ url });
      console.log(`Extracted: ${result.title}`);
    } catch (err) {
      console.warn(`Failed to extract ${url}:`, err.message);
    }
  }

  console.log("Knowledge base ready.");
}

async function searchKnowledgeBase(query: string) {
  const results = await client.search({ query, limit: 3 });

  return results.items.map((item) => ({
    title: item.title,
    url: item.url,
    excerpt: item.content.slice(0, 600),
  }));
}

// Build the knowledge base
await buildKnowledgeBase("https://docs.competitor.com");

// Search it
const chunks = await searchKnowledgeBase("rate limits and throttling");

// Pass to your LLM
const context = chunks.map((c) => `Source: ${c.title}\n${c.excerpt}`).join("\n\n---\n\n");
console.log(context);

This script goes from zero to a searchable knowledge base over an entire documentation site. The sitemap call gives you the full URL list; the extract calls index each page; the search call retrieves relevant chunks for any query.

Handling Large Sites: Async Extraction

For sites with hundreds or thousands of pages, synchronous extraction can time out. Use POST /v1/extract/async to submit extraction jobs and poll for completion:

// Submit async extraction job
const job = await client.extract.async({
  url: "https://docs.large-site.com/reference/endpoints",
});

console.log(`Job ID: ${job.jobId}`);

// Poll until complete
let status = "pending";
while (status === "pending" || status === "running") {
  await new Promise((resolve) => setTimeout(resolve, 3000));
  const jobStatus = await client.jobs.get(job.jobId);
  status = jobStatus.status;
  console.log(`Status: ${status}`);
}

console.log("Extraction complete.");

For bulk extraction over a full sitemap, submit all jobs asynchronously and poll them in parallel rather than sequentially.

Keeping Your Knowledge Base Fresh

Static knowledge bases go stale. The competitor's pricing page changes. The documentation gets updated. New endpoints are added.

KnowledgeSDK webhooks let you react to change signals. You can configure a webhook to receive a notification when a re-extraction detects content changes, triggering automatic re-indexing:

// Re-extract a page on a schedule or via webhook trigger
async function refreshPage(url: string) {
  const result = await client.extract({ url });
  console.log(`Re-indexed: ${result.title} (${result.metadata.wordCount} words)`);
}

// Example: refresh all pages weekly
const { urls } = await client.sitemap({ url: "https://docs.competitor.com" });
for (const url of urls) {
  await refreshPage(url);
}

For production systems, a simple cron job that re-extracts high-priority pages on a schedule is often sufficient. For change detection at scale, pair it with webhook notifications.

Real Use Cases

Competitor monitoring. Extract competitor documentation and marketing pages. Search them with your product-related queries. Surface the gaps and differences automatically.

Documentation chatbot. Index your own product docs. Wire the search API into your support chatbot. Your users ask questions in natural language; the chatbot retrieves the relevant documentation sections and generates answers grounded in current docs.

Research agent. Give your research agent a list of authoritative URLs. Extract and index them. The agent can now answer questions grounded in current, reliable sources rather than LLM training data.

Internal knowledge bases. Extract and index internal wikis, runbooks, or documentation sites. Make your internal knowledge searchable via natural language for your team's agents and tools.

Production Tips

Filter URLs before extracting. Use the sitemap output and filter to paths that contain actual documentation (/docs/, /guide/, /api/). Skip navigation pages, changelog entries, and marketing pages unless they are relevant to your use case.
Use async for anything over 10 pages. The synchronous extract endpoint has a generous timeout, but large sites benefit from the async flow with job polling.
Search before you prompt. Retrieve 3–5 chunks per query, not 20. More chunks means more tokens, higher cost, and often worse LLM output quality due to context dilution. Let the search ranking do its job.
Re-extract on a schedule, not just on demand. For knowledge bases that need to stay current, a weekly or daily re-extraction of key pages is cheaper than discovering stale data in production.

The infrastructure complexity that used to take two weeks now takes an afternoon. The knowledge base is the easy part — your agent logic is where the value is.

Try it now