knowledgesdk.com/blog/keep-rag-fresh
tutorialMarch 20, 2026·15 min read

How to Keep Your RAG Pipeline Fresh Without Re-Indexing Everything

Stop re-crawling your entire knowledge base every 24 hours. Use KnowledgeSDK webhooks to update only changed pages in Pinecone or Weaviate — 10x cheaper.

How to Keep Your RAG Pipeline Fresh Without Re-Indexing Everything

How to Keep Your RAG Pipeline Fresh Without Re-Indexing Everything

Every RAG tutorial covers the same ground: chunk your documents, embed them, store in a vector database, retrieve at query time. There are hundreds of guides showing you how to build the pipeline once.

Almost none of them tell you what to do on day two.

Your knowledge base goes stale. Documentation changes. Pricing pages update. Product features get added. If your RAG pipeline was built from a crawl three weeks ago, it's already wrong. And the naive fix — re-crawl everything every 24 hours — is expensive and slow.

This tutorial covers the right pattern: incremental updates using webhooks. You'll build a system that:

  1. Does a full initial crawl with KnowledgeSDK
  2. Registers webhook subscriptions for each URL
  3. When a page changes, re-scrapes only that URL and updates only those chunks in your vector store

We'll implement this with both Pinecone and Weaviate. At the end, we'll compare the cost of this approach against naive daily re-crawling.


The Problem with "Re-Crawl Everything Every 24h"

Suppose your knowledge base covers 500 documentation pages. On any given day, maybe 5-10 of them actually change. But with a daily re-crawl:

  • You make 500 API calls to KnowledgeSDK
  • You generate embeddings for ~25,000 chunks (500 pages × ~50 chunks each)
  • You upsert 25,000 vectors into Pinecone or Weaviate
  • You pay for all of it, every day

At $0.005 per 1,000 tokens for embedding generation and $0.005 per 1,000 KnowledgeSDK extract calls, a 500-page daily re-crawl costs roughly $3.50/day — $1,275/year. For content that changes 1-2% per day.

The webhook approach: you scrape 5-10 pages when they actually change. Same data quality, ~10x cheaper.


Architecture

Initial crawl
    ↓
KnowledgeSDK /v1/extract (all URLs)
    ↓
Chunk + embed + upsert to Pinecone/Weaviate
    ↓
Register webhook per URL → KnowledgeSDK
    ↓
                    [page changes]
                         ↓
             KnowledgeSDK fires webhook
                         ↓
             Your handler re-scrapes URL
                         ↓
             Delete old chunks for that URL
                         ↓
             Embed + upsert new chunks

The key insight: every chunk in your vector store should be tagged with its source URL. When a webhook fires for URL X, you delete all chunks tagged with URL X and replace them. You never touch the other 490 pages.


Step 1: Initial Full Crawl

Install the SDK:

Node.js:

npm install @knowledgesdk/node openai @pinecone-database/pinecone

Python:

pip install knowledgesdk openai pinecone-client weaviate-client

Full Crawl and Index to Pinecone

Node.js:

import KnowledgeSDK from '@knowledgesdk/node';
import OpenAI from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';

const knowledge = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index('rag-knowledge-base');

function chunkMarkdown(markdown, chunkSize = 800, overlap = 100) {
  const words = markdown.split(' ');
  const chunks = [];
  let i = 0;

  while (i < words.length) {
    const chunk = words.slice(i, i + chunkSize).join(' ');
    chunks.push(chunk);
    i += chunkSize - overlap;
  }

  return chunks;
}

async function embedText(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  });
  return response.data[0].embedding;
}

async function indexPage(url) {
  // Scrape the page
  const result = await knowledge.extract(url, { includeMarkdown: true });

  // Chunk the markdown
  const chunks = chunkMarkdown(result.markdown);

  // Delete existing chunks for this URL (handles updates)
  await index.deleteMany({ filter: { source_url: { $eq: url } } });

  // Embed and upsert new chunks
  const vectors = await Promise.all(
    chunks.map(async (chunk, i) => ({
      id: `${encodeURIComponent(url)}-chunk-${i}`,
      values: await embedText(chunk),
      metadata: {
        source_url: url,
        title: result.title,
        chunk_index: i,
        text: chunk,
      },
    }))
  );

  await index.upsert(vectors);
  console.log(`Indexed ${chunks.length} chunks from ${url}`);
}

// Initial crawl
const urls = [
  'https://docs.example.com/getting-started',
  'https://docs.example.com/api-reference',
  'https://docs.example.com/pricing',
  // ... more URLs
];

for (const url of urls) {
  await indexPage(url);
}

Python (with Weaviate):

import os
import weaviate
from knowledgesdk import KnowledgeSDK
from openai import OpenAI

knowledge = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
weaviate_client = weaviate.connect_to_local()

collection = weaviate_client.collections.get("KnowledgeBase")

def chunk_markdown(markdown: str, chunk_size: int = 800, overlap: int = 100) -> list[str]:
    words = markdown.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

def embed_text(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def index_page(url: str):
    # Scrape and extract
    result = knowledge.extract(url, include_markdown=True)
    chunks = chunk_markdown(result.markdown)

    # Delete existing objects for this URL
    collection.data.delete_many(
        where=weaviate.classes.query.Filter.by_property("source_url").equal(url)
    )

    # Upsert new chunks
    with collection.batch.dynamic() as batch:
        for i, chunk in enumerate(chunks):
            embedding = embed_text(chunk)
            batch.add_object(
                properties={
                    "source_url": url,
                    "title": result.title,
                    "chunk_index": i,
                    "text": chunk,
                },
                vector=embedding,
            )

    print(f"Indexed {len(chunks)} chunks from {url}")

# Initial crawl
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/pricing",
]

for url in urls:
    index_page(url)

Step 2: Register Webhook Subscriptions

After the initial crawl, register a webhook for each URL with KnowledgeSDK. When the content changes, KnowledgeSDK will POST to your callback URL.

Node.js:

async function registerWebhooks(urls, callbackUrl) {
  const results = [];

  for (const url of urls) {
    const response = await fetch('https://api.knowledgesdk.com/v1/webhooks', {
      method: 'POST',
      headers: {
        'x-api-key': process.env.KNOWLEDGESDK_API_KEY,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        url,
        callbackUrl,
        events: ['content_changed'],
      }),
    });

    const data = await response.json();
    results.push(data);
    console.log(`Registered webhook for: ${url}${data.webhookId}`);
  }

  return results;
}

await registerWebhooks(urls, 'https://your-app.com/webhooks/rag-update');

Python:

import httpx

def register_webhooks(urls: list[str], callback_url: str) -> list[dict]:
    results = []

    with httpx.Client() as http:
        for url in urls:
            response = http.post(
                "https://api.knowledgesdk.com/v1/webhooks",
                headers={"x-api-key": os.environ["KNOWLEDGESDK_API_KEY"]},
                json={
                    "url": url,
                    "callbackUrl": callback_url,
                    "events": ["content_changed"],
                },
            )
            data = response.json()
            results.append(data)
            print(f"Registered webhook for: {url}{data['webhookId']}")

    return results

register_webhooks(urls, "https://your-app.com/webhooks/rag-update")

Step 3: Handle Webhook Events

When KnowledgeSDK detects a change, it fires a POST to your callback URL. Your handler re-scrapes just that one URL and updates the vector store.

FastAPI Webhook Handler (Python + Weaviate)

from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib

app = FastAPI()

WEBHOOK_SECRET = os.environ["KNOWLEDGESDK_WEBHOOK_SECRET"]

def verify_webhook_signature(payload: bytes, signature: str) -> bool:
    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        payload,
        hashlib.sha256,
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

@app.post("/webhooks/rag-update")
async def handle_rag_update(request: Request):
    body = await request.body()
    signature = request.headers.get("x-knowledgesdk-signature", "")

    if not verify_webhook_signature(body, signature):
        raise HTTPException(status_code=401, detail="Invalid signature")

    payload = await request.json()

    if payload.get("event") != "content_changed":
        return {"status": "ignored"}

    changed_url = payload["url"]
    print(f"Content changed at: {changed_url}, re-indexing...")

    # Re-index only this URL
    index_page(changed_url)

    return {"status": "updated", "url": changed_url}

Express Webhook Handler (Node.js + Pinecone)

import express from 'express';
import crypto from 'crypto';

const app = express();
app.use(express.raw({ type: 'application/json' }));

function verifySignature(payload, signature, secret) {
  const expected = `sha256=${crypto
    .createHmac('sha256', secret)
    .update(payload)
    .digest('hex')}`;
  return crypto.timingSafeEqual(
    Buffer.from(expected),
    Buffer.from(signature)
  );
}

app.post('/webhooks/rag-update', async (req, res) => {
  const signature = req.headers['x-knowledgesdk-signature'];

  if (!verifySignature(req.body, signature, process.env.KNOWLEDGESDK_WEBHOOK_SECRET)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const payload = JSON.parse(req.body);

  if (payload.event !== 'content_changed') {
    return res.json({ status: 'ignored' });
  }

  const changedUrl = payload.url;
  console.log(`Content changed at: ${changedUrl}, re-indexing...`);

  // Re-index only this URL
  await indexPage(changedUrl);

  res.json({ status: 'updated', url: changedUrl });
});

app.listen(3000);

Step 4: Query Your RAG Pipeline

With the index built and webhooks keeping it fresh, querying is standard:

Python:

def query_rag(question: str, top_k: int = 5) -> str:
    # Embed the question
    q_embedding = embed_text(question)

    # Search Weaviate
    results = collection.query.near_vector(
        near_vector=q_embedding,
        limit=top_k,
        return_properties=["text", "source_url", "title"],
    )

    # Build context
    context_parts = []
    for obj in results.objects:
        context_parts.append(
            f"Source: {obj.properties['source_url']}\n{obj.properties['text']}"
        )
    context = "\n\n---\n\n".join(context_parts)

    # Generate answer
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based on the provided context. Cite sources."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )

    return response.choices[0].message.content

# Example queries
answer = query_rag("What changed in the latest API version?")
print(answer)

Cost Comparison: Webhooks vs. Daily Re-Crawl

For a 500-page knowledge base where ~10 pages change per day:

Daily Re-Crawl Approach

Operation Quantity Cost
KnowledgeSDK extracts 500/day $2.50/day
Embedding generation 25,000 chunks/day $0.50/day
Pinecone upserts 25,000 vectors/day $0.25/day
Total $3.25/day ($1,186/year)

Webhook-Driven Approach

Operation Quantity Cost
KnowledgeSDK extracts 10/day (only changed) $0.05/day
Embedding generation 500 chunks/day $0.01/day
Pinecone upserts 500 vectors/day $0.005/day
Webhook monitoring 500 URLs $0.10/day
Total $0.165/day ($60/year)

The webhook approach is ~20x cheaper and produces a more accurate knowledge base because updates are applied within minutes of a page changing, not up to 24 hours later.


Handling Edge Cases

Pages That Are Removed

When a page is removed from the site, KnowledgeSDK fires a page_removed event. Handle it by deleting those chunks:

Python:

@app.post("/webhooks/rag-update")
async def handle_rag_update(request: Request):
    payload = await request.json()

    if payload["event"] == "content_changed":
        index_page(payload["url"])
    elif payload["event"] == "page_removed":
        # Delete all chunks for this URL
        collection.data.delete_many(
            where=weaviate.classes.query.Filter.by_property("source_url").equal(payload["url"])
        )
        print(f"Removed chunks for deleted page: {payload['url']}")

    return {"status": "ok"}

Webhook Delivery Failures

If your webhook handler is down, KnowledgeSDK retries with exponential backoff. But for extra safety, run a lightweight daily reconciliation job that checks a few high-value URLs manually:

Node.js:

// Run as a cron job once a day — only for critical pages
const criticalUrls = [
  'https://docs.example.com/pricing',
  'https://docs.example.com/api-reference',
];

for (const url of criticalUrls) {
  await indexPage(url);
  console.log(`Reconciled: ${url}`);
}

Freshness Comparison Table

Approach Update latency Cost (500 pages, 10 changes/day) Accuracy
Daily full re-crawl Up to 24h $3.25/day Good
Hourly full re-crawl Up to 1h $78/day Good
Webhook-driven 1-5 minutes $0.17/day Excellent
Manual updates Days/weeks $0 (but risk) Poor

Conclusion

The initial RAG build is the easy part. Keeping it fresh is the operational challenge most tutorials skip. Naive daily re-crawling wastes 95% of your compute on pages that haven't changed, and still leaves you with knowledge that's up to 24 hours stale.

The webhook pattern solves both problems: lower cost and lower latency. KnowledgeSDK handles the change detection so your stack handles only what actually changed.

Build a RAG pipeline that stays fresh automatically — start your free KnowledgeSDK trial at knowledgesdk.com.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →

Related Articles

tutorial

Agentic RAG: Building Self-Correcting Retrieval Pipelines with Live Web Data

tutorial

Build a Searchable Knowledge Base from Any Website in Minutes

tutorial

Build a Compliance Chatbot That Reads Your Website Automatically

tutorial

Context Engineering with Live Web Data: Keep Your AI Agents Current

← Back to blog