Web Scraping for RAG: Keep Your Knowledge Base Fresh (2026)
Retrieval-Augmented Generation (RAG) is the standard architecture for giving LLMs access to private or up-to-date information. Most RAG tutorials cover the "index a PDF" use case. But the harder problem — and the one that matters most in production — is keeping your knowledge base current with live web content.
This tutorial walks through the full lifecycle of web-scraped RAG:
- Scrape competitor documentation, news sites, or any web source
- Index the content into a searchable knowledge base
- Run semantic search to retrieve relevant context
- Pipe the context into GPT-4o to generate accurate, grounded responses
We'll show two approaches side by side: DIY with Firecrawl + Pinecone + cron (the standard stack most teams build) and knowledgeSDK built-in approach (scraping + indexing + search in one API).
Why Fresh Data Is the Hard Part of RAG
Most RAG tutorials stop at "index your documents once." The reality is that web content changes constantly:
- API documentation gets updated (breaking changes, new endpoints)
- Competitor pricing pages change monthly
- News and blog content is only valuable when recent
- Product specs get updated silently
A RAG pipeline that indexes content once and never updates is just a slower, more expensive version of a static knowledge base. The real value of RAG over fine-tuning is the ability to keep knowledge current — and that requires a proper refresh strategy.
The Three Core Problems in Web RAG:
- Extraction: Getting clean, LLM-ready text from arbitrary URLs (JS rendering, anti-bot, pagination)
- Indexing: Embedding and storing content so it's semantically searchable
- Freshness: Detecting changes and updating the index when content changes
Most teams solve #1 and #2, then patch #3 with a cron job that re-scrapes everything nightly. This is expensive (you're re-processing unchanged content) and slow (by the time you detect a change, it may be hours old).
Architecture Overview
DIY Stack
[Firecrawl API] → [S3 / PostgreSQL storage] → [OpenAI Embeddings API]
↓ ↓
[Cron scheduler] ← [Change detection (DIY diff)] → [Pinecone vector DB]
↓
[Semantic search query]
↓
[GPT-4o with context]
Components:
- Firecrawl for scraping (or Jina Reader for low-volume)
- S3 or PostgreSQL for raw content storage
- OpenAI
text-embedding-3-smallfor embeddings - Pinecone for vector storage and search
- A cron job (GitHub Actions, Inngest, etc.) for refresh scheduling
- Custom diff logic to detect changes
Approximate monthly cost at 10K pages:
- Firecrawl: ~$59/mo
- Pinecone Starter: $25/mo
- OpenAI embeddings: ~$1/mo (for 10K pages × ~2K tokens avg)
- Compute for cron + diff: ~$10-20/mo
Total: ~$95-105/mo + 3-4 weeks of initial engineering time
knowledgeSDK Built-In Approach
[knowledgeSDK /v1/scrape] → [Auto-indexed vector store]
↓ ↓
[Webhook: content.changed] → [Auto re-index on change]
↓
[knowledgeSDK /v1/search]
↓
[GPT-4o with context]
Components:
- knowledgeSDK (handles scraping, indexing, search, and change detection)
- GPT-4o for generation
Approximate monthly cost at 10K pages:
- knowledgeSDK: $29/mo (Starter) or $99/mo (Pro)
- Total: $29-99/mo + a few hours of integration
Part 1: DIY RAG Pipeline with Firecrawl + Pinecone
Let's build the full pipeline so you understand what's involved.
Step 1: Scrape Content with Firecrawl
// Node.js
import Firecrawl from '@mendable/firecrawl-js';
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';
const firecrawl = new Firecrawl({ apiKey: process.env.FIRECRAWL_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function scrapeAndStore(url) {
console.log(`Scraping: ${url}`);
const result = await firecrawl.scrapeUrl(url, {
formats: ['markdown'],
});
if (!result.success) {
throw new Error(`Scraping failed: ${result.error}`);
}
return {
url,
markdown: result.markdown,
title: result.metadata?.title,
scrapedAt: new Date().toISOString(),
};
}
# Python equivalent
import firecrawl
import openai
from pinecone import Pinecone
from datetime import datetime
fc = firecrawl.FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def scrape_and_store(url: str) -> dict:
print(f"Scraping: {url}")
result = fc.scrape_url(url, params={"formats": ["markdown"]})
return {
"url": url,
"markdown": result.get("markdown", ""),
"title": result.get("metadata", {}).get("title"),
"scraped_at": datetime.utcnow().isoformat(),
}
Step 2: Chunk and Embed Content
This is the part most tutorials gloss over. Raw markdown is often too long for a single embedding. You need to chunk it intelligently.
function chunkMarkdown(markdown, maxTokens = 800) {
// Split on H2 headers first for semantic chunks
const sections = markdown.split(/\n## /g);
const chunks = [];
for (const section of sections) {
// If section is too long, split further
if (section.split(' ').length > maxTokens) {
const paragraphs = section.split('\n\n');
let currentChunk = '';
for (const paragraph of paragraphs) {
if ((currentChunk + paragraph).split(' ').length > maxTokens) {
if (currentChunk) chunks.push(currentChunk.trim());
currentChunk = paragraph;
} else {
currentChunk += '\n\n' + paragraph;
}
}
if (currentChunk) chunks.push(currentChunk.trim());
} else {
chunks.push(section.trim());
}
}
return chunks.filter(c => c.length > 50); // Filter tiny chunks
}
async function embedChunks(chunks) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: chunks,
});
return response.data.map((item, i) => ({
embedding: item.embedding,
text: chunks[i],
}));
}
def chunk_markdown(markdown: str, max_tokens: int = 800) -> list[str]:
sections = markdown.split("\n## ")
chunks = []
for section in sections:
words = section.split()
if len(words) > max_tokens:
paragraphs = section.split("\n\n")
current_chunk = ""
for para in paragraphs:
if len((current_chunk + para).split()) > max_tokens:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para
else:
current_chunk += "\n\n" + para
if current_chunk:
chunks.append(current_chunk.strip())
else:
chunks.append(section.strip())
return [c for c in chunks if len(c) > 50]
def embed_chunks(chunks: list[str]) -> list[dict]:
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=chunks,
)
return [
{"embedding": item.embedding, "text": chunks[i]}
for i, item in enumerate(response.data)
]
Step 3: Index into Pinecone
async function indexInPinecone(url, chunks, embeddings) {
const index = pinecone.index('knowledge-base');
const vectors = embeddings.map((item, i) => ({
id: `${Buffer.from(url).toString('base64')}_${i}`,
values: item.embedding,
metadata: {
url,
text: item.text,
chunkIndex: i,
},
}));
await index.upsert(vectors);
console.log(`Indexed ${vectors.length} chunks from ${url}`);
}
def index_in_pinecone(url: str, chunks: list[str], embeddings: list[dict]):
import base64
index = pc.Index("knowledge-base")
url_id = base64.urlsafe_b64encode(url.encode()).decode()
vectors = [
{
"id": f"{url_id}_{i}",
"values": emb["embedding"],
"metadata": {"url": url, "text": chunk, "chunk_index": i},
}
for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
]
index.upsert(vectors=vectors)
print(f"Indexed {len(vectors)} chunks from {url}")
Step 4: Change Detection (The Hard Part)
This is where most DIY pipelines get messy. You need to:
- Store a hash or snapshot of the current content
- Re-scrape on a schedule
- Compare new content to old content
- Re-index only changed chunks
import crypto from 'crypto';
async function checkForChanges(url, db) {
const { markdown } = await scrapeAndStore(url);
const newHash = crypto.createHash('sha256').update(markdown).digest('hex');
const stored = await db.query(
'SELECT content_hash, scraped_at FROM pages WHERE url = $1',
[url]
);
if (stored.rows.length === 0 || stored.rows[0].content_hash !== newHash) {
console.log(`Change detected at ${url}`);
// Re-index content
const chunks = chunkMarkdown(markdown);
const embeddings = await embedChunks(chunks);
await indexInPinecone(url, chunks, embeddings);
// Update the stored hash
await db.query(
`INSERT INTO pages (url, content_hash, scraped_at)
VALUES ($1, $2, NOW())
ON CONFLICT (url) DO UPDATE
SET content_hash = $2, scraped_at = NOW()`,
[url, newHash]
);
return true; // Changed
}
return false; // No change
}
This is functional but has real problems in production:
- You're re-scraping every URL on every run, even unchanged ones (costs money)
- Hash-based diffing doesn't tell you what changed
- You need to manage the database, the scheduler, the retry logic...
Part 2: The knowledgeSDK Approach
Now let's look at the same pipeline with knowledgeSDK. The entire Part 1 above collapses into a few API calls.
Step 1: Scrape and Auto-Index
import { KnowledgeSDK } from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
// Scrape a URL — content is automatically indexed for search
const page = await client.scrape({ url: 'https://stripe.com/docs/api' });
console.log(page.markdown);
// Content is now searchable immediately — no embedding, no Pinecone, no extra steps
from knowledgesdk import KnowledgeSDK
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
# Scrape and auto-index in one call
page = client.scrape(url="https://stripe.com/docs/api")
print(page.markdown)
# Already indexed — ready to search
Step 2: Subscribe to Changes (No Polling)
// Subscribe once — receive webhook when content changes
await client.webhooks.subscribe({
url: 'https://stripe.com/docs/api',
callbackUrl: 'https://your-app.com/webhooks/knowledge-update',
events: ['content.changed'],
});
// Your webhook handler
app.post('/webhooks/knowledge-update', async (req, res) => {
const { url, diff, changedAt } = req.body;
console.log(`Content changed at ${url} on ${changedAt}`);
console.log(`Changes:`, diff);
// Content is already re-indexed automatically
// You might want to notify users or trigger a re-generation
await notifyTeamSlack(`Docs updated: ${url}`);
res.sendStatus(200);
});
# Flask webhook handler
from flask import Flask, request, jsonify
from knowledgesdk import KnowledgeSDK
app = Flask(__name__)
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
# Subscribe to changes
client.webhooks.subscribe(
url="https://stripe.com/docs/api",
callback_url="https://your-app.com/webhooks/knowledge-update",
events=["content.changed"]
)
@app.post("/webhooks/knowledge-update")
def handle_update():
data = request.json
print(f"Content changed at {data['url']} on {data['changedAt']}")
# Content is already re-indexed — no action needed
notify_slack(f"Docs updated: {data['url']}")
return jsonify({"ok": True})
Step 3: Semantic Search and GPT-4o Generation
async function answerQuestion(question) {
// Search across all indexed content
const searchResults = await client.search({
query: question,
limit: 5,
});
// Build context from search results
const context = searchResults.results
.map(r => `Source: ${r.url}\n\n${r.content}`)
.join('\n\n---\n\n');
// Generate answer with GPT-4o
const completion = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer questions based only on the provided context.
If the context doesn't contain the answer, say so.
Context:
${context}`,
},
{
role: 'user',
content: question,
},
],
});
return {
answer: completion.choices[0].message.content,
sources: searchResults.results.map(r => r.url),
};
}
// Usage
const { answer, sources } = await answerQuestion(
'How do I verify Stripe webhook signatures?'
);
console.log(answer);
console.log('Sources:', sources);
from knowledgesdk import KnowledgeSDK
import openai
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def answer_question(question: str) -> dict:
# Search across all indexed content
results = client.search(query=question, limit=5)
# Build context
context = "\n\n---\n\n".join(
f"Source: {r.url}\n\n{r.content}"
for r in results.results
)
# Generate with GPT-4o
completion = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": f"""You are a helpful assistant. Answer based only on the provided context.
If the context doesn't contain the answer, say so.
Context:
{context}""",
},
{"role": "user", "content": question},
],
)
return {
"answer": completion.choices[0].message.content,
"sources": [r.url for r in results.results],
}
# Usage
result = answer_question("How do I verify Stripe webhook signatures?")
print(result["answer"])
print("Sources:", result["sources"])
Part 3: Scraping a Full Site for RAG
For documentation sites, you usually want to index the entire site, not just one URL. knowledgeSDK's extract endpoint handles this.
// Extract an entire site — crawls all pages, returns structured knowledge
const extraction = await client.extract({
url: 'https://stripe.com/docs',
options: {
maxPages: 200,
includeSubdomains: false,
}
});
console.log(`Extracted ${extraction.pageCount} pages`);
console.log(`All content is now indexed and searchable`);
// Wait for async extraction to complete
const job = await client.jobs.get(extraction.jobId);
console.log(job.status); // 'completed'
# Full site extraction
extraction = client.extract(
url="https://stripe.com/docs",
options={"max_pages": 200, "include_subdomains": False}
)
print(f"Job ID: {extraction.job_id}")
# Poll for completion (or use webhook)
import time
while True:
job = client.jobs.get(extraction.job_id)
print(f"Status: {job.status} — {job.pages_processed} pages processed")
if job.status in ("completed", "failed"):
break
time.sleep(5)
print("All content indexed and searchable")
Comparison: DIY vs knowledgeSDK at a Glance
| Factor | DIY (Firecrawl + Pinecone) | knowledgeSDK |
|---|---|---|
| Initial setup time | 3-4 weeks | 1-2 hours |
| Components to manage | 4+ (scraper, embedder, vector DB, scheduler) | 1 |
| Change detection | Manual polling (expensive, slow) | Webhooks (instant, free) |
| Search quality | Depends on your chunking strategy | Hybrid semantic + keyword built-in |
| Cost at 10K pages/mo | ~$95-105/mo | $29/mo |
| Vendor lock-in | Lower (portable data) | Higher |
| Customization | Full control | API constraints |
| PDF support | Yes (Firecrawl) | Roadmap |
The DIY approach gives you more control and is worth it if you have specific requirements around data residency, custom embedding models, or unusual retrieval patterns. For most teams building AI applications, knowledgeSDK eliminates infrastructure that doesn't differentiate your product.
Advanced: Multi-Source RAG
Real knowledge bases often combine multiple sources. Here's how to index several sources and maintain freshness across all of them:
const sources = [
'https://stripe.com/docs/api',
'https://docs.github.com/en/rest',
'https://developers.notion.com/reference',
];
// Index all sources
await Promise.all(
sources.map(url => client.scrape({ url }))
);
// Subscribe to all sources for change detection
await Promise.all(
sources.map(url =>
client.webhooks.subscribe({
url,
callbackUrl: 'https://your-app.com/webhooks/knowledge-update',
events: ['content.changed'],
})
)
);
// Now search across all sources simultaneously
const results = await client.search({
query: 'rate limiting best practices',
limit: 10,
});
// Results will include content from Stripe, GitHub, and Notion docs
// ranked by relevance across all sources
FAQ
What's the difference between semantic search and keyword search for RAG? Keyword search matches exact words. Semantic search matches meaning — so a query for "authentication failure" returns results about "login errors" or "access denied" even without those exact words. knowledgeSDK's hybrid search uses both simultaneously, which outperforms either method alone.
How fresh is "fresh enough" for a RAG pipeline? It depends on your use case. For pricing pages, hours-old data is dangerous. For documentation, daily updates are usually fine. knowledgeSDK's webhooks can notify you within minutes of a content change, so you can set your freshness SLA based on business requirements.
Do I need to re-embed content when the page changes? With the DIY approach, yes — you re-embed and re-index changed chunks. With knowledgeSDK, the re-indexing happens automatically when a change is detected.
What embedding model does knowledgeSDK use?
knowledgeSDK uses OpenAI's text-embedding-3-small (1536 dimensions) for semantic search, combined with BM25 for keyword search in a hybrid retrieval architecture. You don't need to manage embedding model selection.
How does chunking work in knowledgeSDK? knowledgeSDK automatically chunks content by semantic section (using header structure) and applies sliding window chunking for sections without clear structure. The default chunk size is optimized for RAG retrieval. You don't need to implement your own chunking logic.
Can I export the indexed data to use with a different vector database? Yes — knowledgeSDK provides an export API that returns all indexed chunks with their embeddings, so you can migrate to Pinecone or Weaviate if needed.
What happens to my search index when a page is re-indexed after a change? Old chunks are replaced with new ones. The search index reflects the current state of the page, not historical snapshots. If you need versioning, you can store the raw markdown in your own storage alongside knowledgeSDK.
Conclusion
Web scraping for RAG is not just about getting markdown out of URLs — it's about keeping that knowledge current and making it searchable. The DIY stack works, but it requires 3-4 weeks of engineering time and 4+ ongoing services to maintain.
knowledgeSDK's approach of combining scraping, indexing, search, and change detection eliminates most of that infrastructure. For teams focused on building AI applications rather than building scraping infrastructure, it's the faster path to a production RAG pipeline.
For related reading, see our guides on LangChain web scraping and website change detection with webhooks.
Try knowledgeSDK free — get your API key at knowledgesdk.com/setup