From URL to Searchable Knowledge in One API Call

Most web data pipelines have 4-6 steps: scrape, convert, chunk, embed, store, index. Here's how to collapse that into a single API call with semantic search included.

If you have built a RAG pipeline from scratch, you know the step count. A realistic implementation involves at minimum: an HTTP client to fetch the page, an HTML parser to extract content, a markdown converter, a chunking strategy, an embedding model call, a vector database write, and a search endpoint. Seven distinct operations across at least three services.

This article explains what each step does, why it exists, and how KnowledgeSDK collapses the entire pipeline into two API calls.

The Traditional RAG Pipeline

Here is what a minimal do-it-yourself pipeline looks like in practice:

import axios from "axios";
import TurndownService from "turndown";
import OpenAI from "openai";
import { PineconeClient } from "@pinecone-database/pinecone";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new PineconeClient();

async function indexUrl(url: string) {
  // Step 1: Fetch the page
  const response = await axios.get(url, {
    headers: { "User-Agent": "Mozilla/5.0" },
    timeout: 10000,
  });

  // Step 2: Strip HTML and convert to markdown
  const turndown = new TurndownService();
  const markdown = turndown.turndown(response.data);

  // Step 3: Chunk the content
  const chunks = chunkText(markdown, { maxTokens: 512, overlap: 64 });

  // Step 4: Generate embeddings for each chunk
  const embeddingResponse = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: chunks,
  });
  const embeddings = embeddingResponse.data.map((d) => d.embedding);

  // Step 5: Store in vector database
  await pinecone.index("knowledge").upsert({
    vectors: chunks.map((chunk, i) => ({
      id: `${url}-chunk-${i}`,
      values: embeddings[i],
      metadata: { url, text: chunk },
    })),
  });

  console.log(`Indexed ${chunks.length} chunks from ${url}`);
}

function chunkText(text: string, options: { maxTokens: number; overlap: number }): string[] {
  // Chunking implementation omitted for brevity — typically 40-80 lines
  return [];
}

This is the simplified version. A production-quality pipeline also handles: JavaScript-rendered pages (requires a headless browser like Playwright), anti-bot bypass, retry logic, error handling for malformed HTML, token counting for accurate chunking, and rate limiting on the embedding API.

Service count: HTTP client, HTML-to-markdown converter, OpenAI API, Pinecone (or equivalent) — minimum 3 external dependencies and their associated configuration, pricing, and failure modes.

What KnowledgeSDK Collapses

POST /v1/extract does everything in the DIY pipeline as a single API call:

Fetches the URL with a headless browser (handles JavaScript rendering)
Applies anti-bot bypass where needed
Converts the rendered HTML to clean markdown
Chunks the content using a token-aware strategy
Generates embeddings via text-embedding-3-small (1536 dimensions)
Stores vectors in pgvector with HNSW indexing
Makes the content immediately searchable

POST /v1/search runs hybrid retrieval — vector similarity search plus ILIKE keyword fallback — over your indexed content.

The full pipeline that required 7 steps and 3 services becomes:

import KnowledgeSDK from "@knowledgesdk/node";

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

// Index a URL — all 7 steps happen server-side
await client.extract("https://docs.stripe.com/api/payment_intents");

// Search your indexed content
const results = await client.search("how do I handle payment confirmation?", {
  limit: 5,
});

for (const item of results.items) {
  console.log(`[${item.score.toFixed(2)}] ${item.title}`);
  console.log(item.snippet);
  console.log(`Source: ${item.sourceUrl}\n`);
}

from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=KNOWLEDGESDK_API_KEY)

# Index the URL
client.extract("https://docs.stripe.com/api/payment_intents")

# Search
results = client.search("how do I handle payment confirmation?", limit=5)

for item in results.items:
    print(f"[{item.score:.2f}] {item.title}")
    print(item.snippet)
    print(f"Source: {item.source_url}\n")

What Happens Under the Hood

Understanding the pipeline internals helps you make informed decisions about when to use a managed solution versus building your own.

pgvector with HNSW indexing. KnowledgeSDK uses PostgreSQL with the pgvector extension and HNSW (Hierarchical Navigable Small World) indexing for approximate nearest neighbor search. In benchmarks against same-region deployments, this achieves ~10-20ms search latency at 94% accuracy. No separate vector database service is required.

text-embedding-3-small. Embeddings are generated via OpenAI's text-embedding-3-small model (1536 dimensions), routed through the Vercel AI gateway. Embedding vectors for repeated queries are cached in Redis with a 24-hour TTL — search queries you run frequently skip the embedding API call on subsequent requests.

Hybrid search. The search endpoint runs both vector similarity (embedding <=> ?::vector cosine distance) and keyword fallback (ILIKE pattern matching). This handles edge cases where a very specific technical term does not have a strong vector match — keyword fallback catches it.

A Complete Production Example

Here is an end-to-end implementation for indexing a competitor's documentation site and searching it:

import KnowledgeSDK from "@knowledgesdk/node";

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

// Index a documentation site
async function indexDocumentation(baseUrl: string) {
  // First, discover all URLs on the site
  const sitemap = await client.sitemap(baseUrl);
  const docUrls = sitemap.urls.filter((url) => url.includes("/docs/"));

  console.log(`Found ${docUrls.length} documentation pages`);

  // Extract each page asynchronously
  const jobs = await Promise.all(
    docUrls.slice(0, 50).map((url) =>
      client.extractAsync(url, {
        callbackUrl: `${process.env.YOUR_APP_URL}/webhooks/indexed`,
      })
    )
  );

  console.log(`Queued ${jobs.length} extraction jobs`);
}

// Search the indexed documentation
async function searchDocs(query: string) {
  const results = await client.search(query, {
    limit: 5,
    filter: { domain: "docs.competitor.com" },
  });

  if (results.items.length === 0) {
    return "No results found in the indexed documentation.";
  }

  return results.items
    .map((item) => `## ${item.title}\n${item.snippet}\nSource: ${item.sourceUrl}`)
    .join("\n\n");
}

// Example usage
await indexDocumentation("https://docs.competitor.com");

// Later — after extraction completes
const answer = await searchDocs("how do they handle webhook retry logic?");
console.log(answer);

DIY Pipeline Cost Estimate

For a team running 1,000 URL extractions per month and 10,000 searches:

DIY pipeline:

Playwright/headless browser infrastructure: ~$50-100/month (self-hosted) or $200+ (managed)
OpenAI embeddings (1,000 docs × ~4K tokens each): ~$2
Pinecone starter: $70/month
Postgres hosting (if separate from your main DB): $15-25/month
Development and maintenance time: significant

KnowledgeSDK:

Starter plan: $29/month (includes extractions, embeddings, search, webhooks, MCP)
No infrastructure to configure or maintain

The cost difference is most obvious in developer time. The DIY pipeline is maintainable once built, but building it correctly — handling JS rendering, anti-bot, proper chunking, embedding caching — takes days, not hours.

When You Would Still Build DIY

There are legitimate reasons to build the pipeline yourself:

Custom chunking strategies. If your content requires domain-specific chunking (code-aware chunking for a developer docs corpus, section-aware chunking for legal documents), the managed extraction may not match your requirements.

Specific embedding models. If your retrieval accuracy depends on a particular model (Voyage 3.5 for code, Gemini embedding-001 for multilingual content), you need control over the embedding step.

Full data ownership. If your compliance requirements prohibit sending content to a third-party API, a fully self-hosted pipeline is necessary.

For most developers building AI agents that need web knowledge retrieval, none of these conditions apply. The managed pipeline is faster to build, easier to maintain, and cheaper at early-to-mid scale.

Summary

The seven-step DIY pipeline — fetch, parse, convert, chunk, embed, store, index — is a reasonable engineering exercise and provides maximum flexibility. It is also significant infrastructure to build and maintain.

For the common use case — "I need URLs to be searchable for my AI agent" — KnowledgeSDK collapses that pipeline into two API calls: extract and search. The tradeoff is control for simplicity, which is the right tradeoff for most production AI agents.