How to Keep Your RAG Pipeline Fresh Without Re-Indexing Everything
Every RAG tutorial covers the same ground: chunk your documents, embed them, store in a vector database, retrieve at query time. There are hundreds of guides showing you how to build the pipeline once.
Almost none of them tell you what to do on day two.
Your knowledge base goes stale. Documentation changes. Pricing pages update. Product features get added. If your RAG pipeline was built from a crawl three weeks ago, it's already wrong. And the naive fix — re-crawl everything every 24 hours — is expensive and slow.
This tutorial covers the right pattern: incremental updates using webhooks. You'll build a system that:
- Does a full initial crawl with KnowledgeSDK
- Registers webhook subscriptions for each URL
- When a page changes, re-scrapes only that URL and updates only those chunks in your vector store
We'll implement this with both Pinecone and Weaviate. At the end, we'll compare the cost of this approach against naive daily re-crawling.
The Problem with "Re-Crawl Everything Every 24h"
Suppose your knowledge base covers 500 documentation pages. On any given day, maybe 5-10 of them actually change. But with a daily re-crawl:
- You make 500 API calls to KnowledgeSDK
- You generate embeddings for ~25,000 chunks (500 pages × ~50 chunks each)
- You upsert 25,000 vectors into Pinecone or Weaviate
- You pay for all of it, every day
At $0.005 per 1,000 tokens for embedding generation and $0.005 per 1,000 KnowledgeSDK extract calls, a 500-page daily re-crawl costs roughly $3.50/day — $1,275/year. For content that changes 1-2% per day.
The webhook approach: you scrape 5-10 pages when they actually change. Same data quality, ~10x cheaper.
Architecture
Initial crawl
↓
KnowledgeSDK /v1/extract (all URLs)
↓
Chunk + embed + upsert to Pinecone/Weaviate
↓
Register webhook per URL → KnowledgeSDK
↓
[page changes]
↓
KnowledgeSDK fires webhook
↓
Your handler re-scrapes URL
↓
Delete old chunks for that URL
↓
Embed + upsert new chunks
The key insight: every chunk in your vector store should be tagged with its source URL. When a webhook fires for URL X, you delete all chunks tagged with URL X and replace them. You never touch the other 490 pages.
Step 1: Initial Full Crawl
Install the SDK:
Node.js:
npm install @knowledgesdk/node openai @pinecone-database/pinecone
Python:
pip install knowledgesdk openai pinecone-client weaviate-client
Full Crawl and Index to Pinecone
Node.js:
import KnowledgeSDK from '@knowledgesdk/node';
import OpenAI from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';
const knowledge = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index('rag-knowledge-base');
function chunkMarkdown(markdown, chunkSize = 800, overlap = 100) {
const words = markdown.split(' ');
const chunks = [];
let i = 0;
while (i < words.length) {
const chunk = words.slice(i, i + chunkSize).join(' ');
chunks.push(chunk);
i += chunkSize - overlap;
}
return chunks;
}
async function embedText(text) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
}
async function indexPage(url) {
// Scrape the page
const result = await knowledge.extract(url, { includeMarkdown: true });
// Chunk the markdown
const chunks = chunkMarkdown(result.markdown);
// Delete existing chunks for this URL (handles updates)
await index.deleteMany({ filter: { source_url: { $eq: url } } });
// Embed and upsert new chunks
const vectors = await Promise.all(
chunks.map(async (chunk, i) => ({
id: `${encodeURIComponent(url)}-chunk-${i}`,
values: await embedText(chunk),
metadata: {
source_url: url,
title: result.title,
chunk_index: i,
text: chunk,
},
}))
);
await index.upsert(vectors);
console.log(`Indexed ${chunks.length} chunks from ${url}`);
}
// Initial crawl
const urls = [
'https://docs.example.com/getting-started',
'https://docs.example.com/api-reference',
'https://docs.example.com/pricing',
// ... more URLs
];
for (const url of urls) {
await indexPage(url);
}
Python (with Weaviate):
import os
import weaviate
from knowledgesdk import KnowledgeSDK
from openai import OpenAI
knowledge = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
weaviate_client = weaviate.connect_to_local()
collection = weaviate_client.collections.get("KnowledgeBase")
def chunk_markdown(markdown: str, chunk_size: int = 800, overlap: int = 100) -> list[str]:
words = markdown.split()
chunks = []
i = 0
while i < len(words):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
i += chunk_size - overlap
return chunks
def embed_text(text: str) -> list[float]:
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def index_page(url: str):
# Scrape and extract
result = knowledge.extract(url, include_markdown=True)
chunks = chunk_markdown(result.markdown)
# Delete existing objects for this URL
collection.data.delete_many(
where=weaviate.classes.query.Filter.by_property("source_url").equal(url)
)
# Upsert new chunks
with collection.batch.dynamic() as batch:
for i, chunk in enumerate(chunks):
embedding = embed_text(chunk)
batch.add_object(
properties={
"source_url": url,
"title": result.title,
"chunk_index": i,
"text": chunk,
},
vector=embedding,
)
print(f"Indexed {len(chunks)} chunks from {url}")
# Initial crawl
urls = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/pricing",
]
for url in urls:
index_page(url)
Step 2: Register Webhook Subscriptions
After the initial crawl, register a webhook for each URL with KnowledgeSDK. When the content changes, KnowledgeSDK will POST to your callback URL.
Node.js:
async function registerWebhooks(urls, callbackUrl) {
const results = [];
for (const url of urls) {
const response = await fetch('https://api.knowledgesdk.com/v1/webhooks', {
method: 'POST',
headers: {
'x-api-key': process.env.KNOWLEDGESDK_API_KEY,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
callbackUrl,
events: ['content_changed'],
}),
});
const data = await response.json();
results.push(data);
console.log(`Registered webhook for: ${url} → ${data.webhookId}`);
}
return results;
}
await registerWebhooks(urls, 'https://your-app.com/webhooks/rag-update');
Python:
import httpx
def register_webhooks(urls: list[str], callback_url: str) -> list[dict]:
results = []
with httpx.Client() as http:
for url in urls:
response = http.post(
"https://api.knowledgesdk.com/v1/webhooks",
headers={"x-api-key": os.environ["KNOWLEDGESDK_API_KEY"]},
json={
"url": url,
"callbackUrl": callback_url,
"events": ["content_changed"],
},
)
data = response.json()
results.append(data)
print(f"Registered webhook for: {url} → {data['webhookId']}")
return results
register_webhooks(urls, "https://your-app.com/webhooks/rag-update")
Step 3: Handle Webhook Events
When KnowledgeSDK detects a change, it fires a POST to your callback URL. Your handler re-scrapes just that one URL and updates the vector store.
FastAPI Webhook Handler (Python + Weaviate)
from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib
app = FastAPI()
WEBHOOK_SECRET = os.environ["KNOWLEDGESDK_WEBHOOK_SECRET"]
def verify_webhook_signature(payload: bytes, signature: str) -> bool:
expected = hmac.new(
WEBHOOK_SECRET.encode(),
payload,
hashlib.sha256,
).hexdigest()
return hmac.compare_digest(f"sha256={expected}", signature)
@app.post("/webhooks/rag-update")
async def handle_rag_update(request: Request):
body = await request.body()
signature = request.headers.get("x-knowledgesdk-signature", "")
if not verify_webhook_signature(body, signature):
raise HTTPException(status_code=401, detail="Invalid signature")
payload = await request.json()
if payload.get("event") != "content_changed":
return {"status": "ignored"}
changed_url = payload["url"]
print(f"Content changed at: {changed_url}, re-indexing...")
# Re-index only this URL
index_page(changed_url)
return {"status": "updated", "url": changed_url}
Express Webhook Handler (Node.js + Pinecone)
import express from 'express';
import crypto from 'crypto';
const app = express();
app.use(express.raw({ type: 'application/json' }));
function verifySignature(payload, signature, secret) {
const expected = `sha256=${crypto
.createHmac('sha256', secret)
.update(payload)
.digest('hex')}`;
return crypto.timingSafeEqual(
Buffer.from(expected),
Buffer.from(signature)
);
}
app.post('/webhooks/rag-update', async (req, res) => {
const signature = req.headers['x-knowledgesdk-signature'];
if (!verifySignature(req.body, signature, process.env.KNOWLEDGESDK_WEBHOOK_SECRET)) {
return res.status(401).json({ error: 'Invalid signature' });
}
const payload = JSON.parse(req.body);
if (payload.event !== 'content_changed') {
return res.json({ status: 'ignored' });
}
const changedUrl = payload.url;
console.log(`Content changed at: ${changedUrl}, re-indexing...`);
// Re-index only this URL
await indexPage(changedUrl);
res.json({ status: 'updated', url: changedUrl });
});
app.listen(3000);
Step 4: Query Your RAG Pipeline
With the index built and webhooks keeping it fresh, querying is standard:
Python:
def query_rag(question: str, top_k: int = 5) -> str:
# Embed the question
q_embedding = embed_text(question)
# Search Weaviate
results = collection.query.near_vector(
near_vector=q_embedding,
limit=top_k,
return_properties=["text", "source_url", "title"],
)
# Build context
context_parts = []
for obj in results.objects:
context_parts.append(
f"Source: {obj.properties['source_url']}\n{obj.properties['text']}"
)
context = "\n\n---\n\n".join(context_parts)
# Generate answer
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer based on the provided context. Cite sources."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return response.choices[0].message.content
# Example queries
answer = query_rag("What changed in the latest API version?")
print(answer)
Cost Comparison: Webhooks vs. Daily Re-Crawl
For a 500-page knowledge base where ~10 pages change per day:
Daily Re-Crawl Approach
| Operation | Quantity | Cost |
|---|---|---|
| KnowledgeSDK extracts | 500/day | $2.50/day |
| Embedding generation | 25,000 chunks/day | $0.50/day |
| Pinecone upserts | 25,000 vectors/day | $0.25/day |
| Total | $3.25/day ($1,186/year) |
Webhook-Driven Approach
| Operation | Quantity | Cost |
|---|---|---|
| KnowledgeSDK extracts | 10/day (only changed) | $0.05/day |
| Embedding generation | 500 chunks/day | $0.01/day |
| Pinecone upserts | 500 vectors/day | $0.005/day |
| Webhook monitoring | 500 URLs | $0.10/day |
| Total | $0.165/day ($60/year) |
The webhook approach is ~20x cheaper and produces a more accurate knowledge base because updates are applied within minutes of a page changing, not up to 24 hours later.
Handling Edge Cases
Pages That Are Removed
When a page is removed from the site, KnowledgeSDK fires a page_removed event. Handle it by deleting those chunks:
Python:
@app.post("/webhooks/rag-update")
async def handle_rag_update(request: Request):
payload = await request.json()
if payload["event"] == "content_changed":
index_page(payload["url"])
elif payload["event"] == "page_removed":
# Delete all chunks for this URL
collection.data.delete_many(
where=weaviate.classes.query.Filter.by_property("source_url").equal(payload["url"])
)
print(f"Removed chunks for deleted page: {payload['url']}")
return {"status": "ok"}
Webhook Delivery Failures
If your webhook handler is down, KnowledgeSDK retries with exponential backoff. But for extra safety, run a lightweight daily reconciliation job that checks a few high-value URLs manually:
Node.js:
// Run as a cron job once a day — only for critical pages
const criticalUrls = [
'https://docs.example.com/pricing',
'https://docs.example.com/api-reference',
];
for (const url of criticalUrls) {
await indexPage(url);
console.log(`Reconciled: ${url}`);
}
Freshness Comparison Table
| Approach | Update latency | Cost (500 pages, 10 changes/day) | Accuracy |
|---|---|---|---|
| Daily full re-crawl | Up to 24h | $3.25/day | Good |
| Hourly full re-crawl | Up to 1h | $78/day | Good |
| Webhook-driven | 1-5 minutes | $0.17/day | Excellent |
| Manual updates | Days/weeks | $0 (but risk) | Poor |
Conclusion
The initial RAG build is the easy part. Keeping it fresh is the operational challenge most tutorials skip. Naive daily re-crawling wastes 95% of your compute on pages that haven't changed, and still leaves you with knowledge that's up to 24 hours stale.
The webhook pattern solves both problems: lower cost and lower latency. KnowledgeSDK handles the change detection so your stack handles only what actually changed.
Build a RAG pipeline that stays fresh automatically — start your free KnowledgeSDK trial at knowledgesdk.com.