tutorialMarch 20, 2026·15 min read

How to Keep Your RAG Pipeline Fresh Without Re-Indexing Everything

Stop re-crawling your entire knowledge base every 24 hours. Use KnowledgeSDK webhooks to update only changed pages in Pinecone or Weaviate — 10x cheaper.

How to Keep Your RAG Pipeline Fresh Without Re-Indexing Everything

Every RAG tutorial covers the same ground: chunk your documents, embed them, store in a vector database, retrieve at query time. There are hundreds of guides showing you how to build the pipeline once.

Almost none of them tell you what to do on day two.

Your knowledge base goes stale. Documentation changes. Pricing pages update. Product features get added. If your RAG pipeline was built from a crawl three weeks ago, it's already wrong. And the naive fix — re-crawl everything every 24 hours — is expensive and slow.

This tutorial covers the right pattern: incremental updates using webhooks. You'll build a system that:

Does a full initial crawl with KnowledgeSDK
Registers webhook subscriptions for each URL
When a page changes, re-scrapes only that URL and updates only those chunks in your vector store

We'll implement this with both Pinecone and Weaviate. At the end, we'll compare the cost of this approach against naive daily re-crawling.

The Problem with "Re-Crawl Everything Every 24h"

Suppose your knowledge base covers 500 documentation pages. On any given day, maybe 5-10 of them actually change. But with a daily re-crawl:

You make 500 API calls to KnowledgeSDK
You generate embeddings for ~25,000 chunks (500 pages × ~50 chunks each)
You upsert 25,000 vectors into Pinecone or Weaviate
You pay for all of it, every day

At $0.005 per 1,000 tokens for embedding generation and $0.005 per 1,000 KnowledgeSDK extract calls, a 500-page daily re-crawl costs roughly $3.50/day — $1,275/year. For content that changes 1-2% per day.

The webhook approach: you scrape 5-10 pages when they actually change. Same data quality, ~10x cheaper.

Architecture

Initial crawl
    ↓
KnowledgeSDK /v1/extract (all URLs)
    ↓
Chunk + embed + upsert to Pinecone/Weaviate
    ↓
Register webhook per URL → KnowledgeSDK
    ↓
                    [page changes]
                         ↓
             KnowledgeSDK fires webhook
                         ↓
             Your handler re-scrapes URL
                         ↓
             Delete old chunks for that URL
                         ↓
             Embed + upsert new chunks

The key insight: every chunk in your vector store should be tagged with its source URL. When a webhook fires for URL X, you delete all chunks tagged with URL X and replace them. You never touch the other 490 pages.

Step 1: Initial Full Crawl

Install the SDK:

Node.js:

npm install @knowledgesdk/node openai @pinecone-database/pinecone

Python:

pip install knowledgesdk openai pinecone-client weaviate-client

Full Crawl and Index to Pinecone

Node.js:

import KnowledgeSDK from '@knowledgesdk/node';
import OpenAI from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';

const knowledge = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index('rag-knowledge-base');

function chunkMarkdown(markdown, chunkSize = 800, overlap = 100) {
  const words = markdown.split(' ');
  const chunks = [];
  let i = 0;

  while (i < words.length) {
    const chunk = words.slice(i, i + chunkSize).join(' ');
    chunks.push(chunk);
    i += chunkSize - overlap;
  }

  return chunks;
}

async function embedText(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });
  return response.data[0].embedding;
}

async function indexPage(url) {
  // Scrape the page
  const result = await knowledge.extract(url, { includeMarkdown: true });

  // Chunk the markdown
  const chunks = chunkMarkdown(result.markdown);

  // Delete existing chunks for this URL (handles updates)
  await index.deleteMany({ filter: { source_url: { $eq: url } } });

  // Embed and upsert new chunks
  const vectors = await Promise.all(
    chunks.map(async (chunk, i) => ({
      id: `${encodeURIComponent(url)}-chunk-${i}`,
      values: await embedText(chunk),
      metadata: {
        source_url: url,
        title: result.title,
        chunk_index: i,
        text: chunk,
      },
    }))
  );

  await index.upsert(vectors);
  console.log(`Indexed ${chunks.length} chunks from ${url}`);
}

// Initial crawl
const urls = [
  'https://docs.example.com/getting-started',
  'https://docs.example.com/api-reference',
  'https://docs.example.com/pricing',
  // ... more URLs
];

for (const url of urls) {
  await indexPage(url);
}

Python (with Weaviate):

import os
import weaviate
from knowledgesdk import KnowledgeSDK
from openai import OpenAI

knowledge = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
weaviate_client = weaviate.connect_to_local()

collection = weaviate_client.collections.get("KnowledgeBase")

def chunk_markdown(markdown: str, chunk_size: int = 800, overlap: int = 100) -> list[str]:
    words = markdown.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

def embed_text(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def index_page(url: str):
    # Scrape and extract
    result = knowledge.extract(url, include_markdown=True)
    chunks = chunk_markdown(result.markdown)

    # Delete existing objects for this URL
    collection.data.delete_many(
        where=weaviate.classes.query.Filter.by_property("source_url").equal(url)
    )

    # Upsert new chunks
    with collection.batch.dynamic() as batch:
        for i, chunk in enumerate(chunks):
            embedding = embed_text(chunk)
            batch.add_object(
                properties={
                    "source_url": url,
                    "title": result.title,
                    "chunk_index": i,
                    "text": chunk,
                },
                vector=embedding,
            )

    print(f"Indexed {len(chunks)} chunks from {url}")

# Initial crawl
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/pricing",
]

for url in urls:
    index_page(url)

Step 2: Register Webhook Subscriptions

After the initial crawl, register a webhook for each URL with KnowledgeSDK. When the content changes, KnowledgeSDK will POST to your callback URL.

Node.js:

async function registerWebhooks(urls, callbackUrl) {
  const results = [];

  for (const url of urls) {
    const response = await fetch('https://api.knowledgesdk.com/v1/webhooks', {
      method: 'POST',
      headers: {
        'x-api-key': process.env.KNOWLEDGESDK_API_KEY,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        url,
        callbackUrl,
        events: ['content_changed'],
      }),
    });

    const data = await response.json();
    results.push(data);
    console.log(`Registered webhook for: ${url} → ${data.webhookId}`);
  }

  return results;
}

await registerWebhooks(urls, 'https://your-app.com/webhooks/rag-update');

Python:

import httpx

def register_webhooks(urls: list[str], callback_url: str) -> list[dict]:
    results = []

    with httpx.Client() as http:
        for url in urls:
            response = http.post(
                "https://api.knowledgesdk.com/v1/webhooks",
                headers={"x-api-key": os.environ["KNOWLEDGESDK_API_KEY"]},
                json={
                    "url": url,
                    "callbackUrl": callback_url,
                    "events": ["content_changed"],
                },
            )
            data = response.json()
            results.append(data)
            print(f"Registered webhook for: {url} → {data['webhookId']}")

    return results

register_webhooks(urls, "https://your-app.com/webhooks/rag-update")

Step 3: Handle Webhook Events

When KnowledgeSDK detects a change, it fires a POST to your callback URL. Your handler re-scrapes just that one URL and updates the vector store.

FastAPI Webhook Handler (Python + Weaviate)

from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib

app = FastAPI()

WEBHOOK_SECRET = os.environ["KNOWLEDGESDK_WEBHOOK_SECRET"]

def verify_webhook_signature(payload: bytes, signature: str) -> bool:
    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        payload,
        hashlib.sha256,
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

@app.post("/webhooks/rag-update")
async def handle_rag_update(request: Request):
    body = await request.body()
    signature = request.headers.get("x-knowledgesdk-signature", "")

    if not verify_webhook_signature(body, signature):
        raise HTTPException(status_code=401, detail="Invalid signature")

    payload = await request.json()

    if payload.get("event") != "content_changed":
        return {"status": "ignored"}

    changed_url = payload["url"]
    print(f"Content changed at: {changed_url}, re-indexing...")

    # Re-index only this URL
    index_page(changed_url)

    return {"status": "updated", "url": changed_url}

Express Webhook Handler (Node.js + Pinecone)

import express from 'express';
import crypto from 'crypto';

const app = express();
app.use(express.raw({ type: 'application/json' }));

function verifySignature(payload, signature, secret) {
  const expected = `sha256=${crypto
    .createHmac('sha256', secret)
    .update(payload)
    .digest('hex')}`;
  return crypto.timingSafeEqual(
    Buffer.from(expected),
    Buffer.from(signature)
  );
}

app.post('/webhooks/rag-update', async (req, res) => {
  const signature = req.headers['x-knowledgesdk-signature'];

  if (!verifySignature(req.body, signature, process.env.KNOWLEDGESDK_WEBHOOK_SECRET)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const payload = JSON.parse(req.body);

  if (payload.event !== 'content_changed') {
    return res.json({ status: 'ignored' });
  }

  const changedUrl = payload.url;
  console.log(`Content changed at: ${changedUrl}, re-indexing...`);

  // Re-index only this URL
  await indexPage(changedUrl);

  res.json({ status: 'updated', url: changedUrl });
});

app.listen(3000);

Step 4: Query Your RAG Pipeline

With the index built and webhooks keeping it fresh, querying is standard:

Python:

def query_rag(question: str, top_k: int = 5) -> str:
    # Embed the question
    q_embedding = embed_text(question)

    # Search Weaviate
    results = collection.query.near_vector(
        near_vector=q_embedding,
        limit=top_k,
        return_properties=["text", "source_url", "title"],
    )

    # Build context
    context_parts = []
    for obj in results.objects:
        context_parts.append(
            f"Source: {obj.properties['source_url']}\n{obj.properties['text']}"
        )
    context = "\n\n---\n\n".join(context_parts)

    # Generate answer
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based on the provided context. Cite sources."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )

    return response.choices[0].message.content

# Example queries
answer = query_rag("What changed in the latest API version?")
print(answer)

Cost Comparison: Webhooks vs. Daily Re-Crawl

For a 500-page knowledge base where ~10 pages change per day:

Daily Re-Crawl Approach

Operation	Quantity	Cost
KnowledgeSDK extracts	500/day	$2.50/day
Embedding generation	25,000 chunks/day	$0.50/day
Pinecone upserts	25,000 vectors/day	$0.25/day
Total		$3.25/day ($1,186/year)

Webhook-Driven Approach

Operation	Quantity	Cost
KnowledgeSDK extracts	10/day (only changed)	$0.05/day
Embedding generation	500 chunks/day	$0.01/day
Pinecone upserts	500 vectors/day	$0.005/day
Webhook monitoring	500 URLs	$0.10/day
Total		$0.165/day ($60/year)

The webhook approach is ~20x cheaper and produces a more accurate knowledge base because updates are applied within minutes of a page changing, not up to 24 hours later.

Handling Edge Cases

Pages That Are Removed

When a page is removed from the site, KnowledgeSDK fires a page_removed event. Handle it by deleting those chunks:

Python:

@app.post("/webhooks/rag-update")
async def handle_rag_update(request: Request):
    payload = await request.json()

    if payload["event"] == "content_changed":
        index_page(payload["url"])
    elif payload["event"] == "page_removed":
        # Delete all chunks for this URL
        collection.data.delete_many(
            where=weaviate.classes.query.Filter.by_property("source_url").equal(payload["url"])
        )
        print(f"Removed chunks for deleted page: {payload['url']}")

    return {"status": "ok"}

Webhook Delivery Failures

If your webhook handler is down, KnowledgeSDK retries with exponential backoff. But for extra safety, run a lightweight daily reconciliation job that checks a few high-value URLs manually:

Node.js:

// Run as a cron job once a day — only for critical pages
const criticalUrls = [
  'https://docs.example.com/pricing',
  'https://docs.example.com/api-reference',
];

for (const url of criticalUrls) {
  await indexPage(url);
  console.log(`Reconciled: ${url}`);
}

Freshness Comparison Table

Approach	Update latency	Cost (500 pages, 10 changes/day)	Accuracy
Daily full re-crawl	Up to 24h	$3.25/day	Good
Hourly full re-crawl	Up to 1h	$78/day	Good
Webhook-driven	1-5 minutes	$0.17/day	Excellent
Manual updates	Days/weeks	$0 (but risk)	Poor

Conclusion

The initial RAG build is the easy part. Keeping it fresh is the operational challenge most tutorials skip. Naive daily re-crawling wastes 95% of your compute on pages that haven't changed, and still leaves you with knowledge that's up to 24 hours stale.

The webhook pattern solves both problems: lower cost and lower latency. KnowledgeSDK handles the change detection so your stack handles only what actually changed.

Build a RAG pipeline that stays fresh automatically — start your free KnowledgeSDK trial at knowledgesdk.com.

Try it now