knowledgesdk.com/blog/web-scraping-for-rag

tutorialMarch 19, 2026·15 min read

Web Scraping for RAG: Keep Your Knowledge Base Fresh (2026)

A complete tutorial for building a web-scraped RAG pipeline: from scraping competitor docs to semantic search and GPT-4o integration. Compare DIY vs knowledgeSDK approaches.

Web Scraping for RAG: Keep Your Knowledge Base Fresh (2026)

Retrieval-Augmented Generation (RAG) is the standard architecture for giving LLMs access to private or up-to-date information. Most RAG tutorials cover the "index a PDF" use case. But the harder problem — and the one that matters most in production — is keeping your knowledge base current with live web content.

This tutorial walks through the full lifecycle of web-scraped RAG:

Scrape competitor documentation, news sites, or any web source
Index the content into a searchable knowledge base
Run semantic search to retrieve relevant context
Pipe the context into GPT-4o to generate accurate, grounded responses

We'll show two approaches side by side: DIY with Firecrawl + Pinecone + cron (the standard stack most teams build) and knowledgeSDK built-in approach (scraping + indexing + search in one API).

Why Fresh Data Is the Hard Part of RAG

Most RAG tutorials stop at "index your documents once." The reality is that web content changes constantly:

API documentation gets updated (breaking changes, new endpoints)
Competitor pricing pages change monthly
News and blog content is only valuable when recent
Product specs get updated silently

A RAG pipeline that indexes content once and never updates is just a slower, more expensive version of a static knowledge base. The real value of RAG over fine-tuning is the ability to keep knowledge current — and that requires a proper refresh strategy.

The Three Core Problems in Web RAG:

Extraction: Getting clean, LLM-ready text from arbitrary URLs (JS rendering, anti-bot, pagination)
Indexing: Embedding and storing content so it's semantically searchable
Freshness: Detecting changes and updating the index when content changes

Most teams solve #1 and #2, then patch #3 with a cron job that re-scrapes everything nightly. This is expensive (you're re-processing unchanged content) and slow (by the time you detect a change, it may be hours old).

Architecture Overview

DIY Stack

[Firecrawl API] → [S3 / PostgreSQL storage] → [OpenAI Embeddings API]
      ↓                                               ↓
[Cron scheduler] ← [Change detection (DIY diff)] → [Pinecone vector DB]
                                                        ↓
                                               [Semantic search query]
                                                        ↓
                                                [GPT-4o with context]

Components:

Firecrawl for scraping (or Jina Reader for low-volume)
S3 or PostgreSQL for raw content storage
OpenAI text-embedding-3-small for embeddings
Pinecone for vector storage and search
A cron job (GitHub Actions, Inngest, etc.) for refresh scheduling
Custom diff logic to detect changes

Approximate monthly cost at 10K pages:

Firecrawl: ~$59/mo
Pinecone Starter: $25/mo
OpenAI embeddings: ~$1/mo (for 10K pages × ~2K tokens avg)
Compute for cron + diff: ~$10-20/mo

Total: ~$95-105/mo + 3-4 weeks of initial engineering time

knowledgeSDK Built-In Approach

[knowledgeSDK /v1/extract] → [Auto-indexed vector store]
          ↓                          ↓
[Webhook: content.changed] → [Auto re-index on change]
                                     ↓
                          [knowledgeSDK /v1/search]
                                     ↓
                          [GPT-4o with context]

Components:

knowledgeSDK (handles scraping, indexing, search, and change detection)
GPT-4o for generation

Approximate monthly cost at 10K pages:

knowledgeSDK: $29/mo (Starter) or $99/mo (Pro)
Total: $29-99/mo + a few hours of integration

Part 1: DIY RAG Pipeline with Firecrawl + Pinecone

Let's build the full pipeline so you understand what's involved.

Step 1: Scrape Content with Firecrawl

// Node.js
import Firecrawl from '@mendable/firecrawl-js';
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';

const firecrawl = new Firecrawl({ apiKey: process.env.FIRECRAWL_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function scrapeAndStore(url) {
  console.log(`Scraping: ${url}`);

  const result = await firecrawl.scrapeUrl(url, {
    formats: ['markdown'],
  });

  if (!result.success) {
    throw new Error(`Scraping failed: ${result.error}`);
  }

  return {
    url,
    markdown: result.markdown,
    title: result.metadata?.title,
    scrapedAt: new Date().toISOString(),
  };
}

# Python equivalent
import firecrawl
import openai
from pinecone import Pinecone
from datetime import datetime

fc = firecrawl.FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def scrape_and_store(url: str) -> dict:
    print(f"Scraping: {url}")

    result = fc.scrape_url(url, params={"formats": ["markdown"]})

    return {
        "url": url,
        "markdown": result.get("markdown", ""),
        "title": result.get("metadata", {}).get("title"),
        "scraped_at": datetime.utcnow().isoformat(),
    }

Step 2: Chunk and Embed Content

This is the part most tutorials gloss over. Raw markdown is often too long for a single embedding. You need to chunk it intelligently.

function chunkMarkdown(markdown, maxTokens = 800) {
  // Split on H2 headers first for semantic chunks
  const sections = markdown.split(/\n## /g);

  const chunks = [];
  for (const section of sections) {
    // If section is too long, split further
    if (section.split(' ').length > maxTokens) {
      const paragraphs = section.split('\n\n');
      let currentChunk = '';

      for (const paragraph of paragraphs) {
        if ((currentChunk + paragraph).split(' ').length > maxTokens) {
          if (currentChunk) chunks.push(currentChunk.trim());
          currentChunk = paragraph;
        } else {
          currentChunk += '\n\n' + paragraph;
        }
      }
      if (currentChunk) chunks.push(currentChunk.trim());
    } else {
      chunks.push(section.trim());
    }
  }

  return chunks.filter(c => c.length > 50); // Filter tiny chunks
}

async function embedChunks(chunks) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: chunks,
  });

  return response.data.map((item, i) => ({
    embedding: item.embedding,
    text: chunks[i],
  }));
}

def chunk_markdown(markdown: str, max_tokens: int = 800) -> list[str]:
    sections = markdown.split("\n## ")
    chunks = []

    for section in sections:
        words = section.split()
        if len(words) > max_tokens:
            paragraphs = section.split("\n\n")
            current_chunk = ""
            for para in paragraphs:
                if len((current_chunk + para).split()) > max_tokens:
                    if current_chunk:
                        chunks.append(current_chunk.strip())
                    current_chunk = para
                else:
                    current_chunk += "\n\n" + para
            if current_chunk:
                chunks.append(current_chunk.strip())
        else:
            chunks.append(section.strip())

    return [c for c in chunks if len(c) > 50]

def embed_chunks(chunks: list[str]) -> list[dict]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=chunks,
    )
    return [
        {"embedding": item.embedding, "text": chunks[i]}
        for i, item in enumerate(response.data)
    ]

Step 3: Index into Pinecone

async function indexInPinecone(url, chunks, embeddings) {
  const index = pinecone.index('knowledge-base');

  const vectors = embeddings.map((item, i) => ({
    id: `${Buffer.from(url).toString('base64')}_${i}`,
    values: item.embedding,
    metadata: {
      url,
      text: item.text,
      chunkIndex: i,
    },
  }));

  await index.upsert(vectors);
  console.log(`Indexed ${vectors.length} chunks from ${url}`);
}

def index_in_pinecone(url: str, chunks: list[str], embeddings: list[dict]):
    import base64
    index = pc.Index("knowledge-base")

    url_id = base64.urlsafe_b64encode(url.encode()).decode()
    vectors = [
        {
            "id": f"{url_id}_{i}",
            "values": emb["embedding"],
            "metadata": {"url": url, "text": chunk, "chunk_index": i},
        }
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
    ]

    index.upsert(vectors=vectors)
    print(f"Indexed {len(vectors)} chunks from {url}")

Step 4: Change Detection (The Hard Part)

This is where most DIY pipelines get messy. You need to:

Store a hash or snapshot of the current content
Re-scrape on a schedule
Compare new content to old content
Re-index only changed chunks

import crypto from 'crypto';

async function checkForChanges(url, db) {
  const { markdown } = await scrapeAndStore(url);
  const newHash = crypto.createHash('sha256').update(markdown).digest('hex');

  const stored = await db.query(
    'SELECT content_hash, scraped_at FROM pages WHERE url = $1',
    [url]
  );

  if (stored.rows.length === 0 || stored.rows[0].content_hash !== newHash) {
    console.log(`Change detected at ${url}`);
    // Re-index content
    const chunks = chunkMarkdown(markdown);
    const embeddings = await embedChunks(chunks);
    await indexInPinecone(url, chunks, embeddings);

    // Update the stored hash
    await db.query(
      `INSERT INTO pages (url, content_hash, scraped_at)
       VALUES ($1, $2, NOW())
       ON CONFLICT (url) DO UPDATE
       SET content_hash = $2, scraped_at = NOW()`,
      [url, newHash]
    );

    return true; // Changed
  }

  return false; // No change
}

This is functional but has real problems in production:

You're re-scraping every URL on every run, even unchanged ones (costs money)
Hash-based diffing doesn't tell you what changed
You need to manage the database, the scheduler, the retry logic...

Part 2: The knowledgeSDK Approach

Now let's look at the same pipeline with knowledgeSDK. The entire Part 1 above collapses into a few API calls.

Step 1: Scrape and Auto-Index

import { KnowledgeSDK } from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

// Scrape a URL — content is automatically indexed for search
const page = await client.scrape({ url: 'https://stripe.com/docs/api' });

console.log(page.markdown);
// Content is now searchable immediately — no embedding, no Pinecone, no extra steps

from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

# Scrape and auto-index in one call
page = client.scrape(url="https://stripe.com/docs/api")

print(page.markdown)
# Already indexed — ready to search

Step 2: Subscribe to Changes (No Polling)

// Subscribe once — receive webhook when content changes
await client.webhooks.subscribe({
  url: 'https://stripe.com/docs/api',
  callbackUrl: 'https://your-app.com/webhooks/knowledge-update',
  events: ['content.changed'],
});

// Your webhook handler
app.post('/webhooks/knowledge-update', async (req, res) => {
  const { url, diff, changedAt } = req.body;

  console.log(`Content changed at ${url} on ${changedAt}`);
  console.log(`Changes:`, diff);

  // Content is already re-indexed automatically
  // You might want to notify users or trigger a re-generation
  await notifyTeamSlack(`Docs updated: ${url}`);

  res.sendStatus(200);
});

# Flask webhook handler
from flask import Flask, request, jsonify
from knowledgesdk import KnowledgeSDK

app = Flask(__name__)
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

# Subscribe to changes
client.webhooks.subscribe(
    url="https://stripe.com/docs/api",
    callback_url="https://your-app.com/webhooks/knowledge-update",
    events=["content.changed"]
)

@app.post("/webhooks/knowledge-update")
def handle_update():
    data = request.json
    print(f"Content changed at {data['url']} on {data['changedAt']}")
    # Content is already re-indexed — no action needed
    notify_slack(f"Docs updated: {data['url']}")
    return jsonify({"ok": True})

Step 3: Semantic Search and GPT-4o Generation

async function answerQuestion(question) {
  // Search across all indexed content
  const searchResults = await client.search({
    query: question,
    limit: 5,
  });

  // Build context from search results
  const context = searchResults.results
    .map(r => `Source: ${r.url}\n\n${r.content}`)
    .join('\n\n---\n\n');

  // Generate answer with GPT-4o
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are a helpful assistant. Answer questions based only on the provided context.
If the context doesn't contain the answer, say so.

Context:
${context}`,
      },
      {
        role: 'user',
        content: question,
      },
    ],
  });

  return {
    answer: completion.choices[0].message.content,
    sources: searchResults.results.map(r => r.url),
  };
}

// Usage
const { answer, sources } = await answerQuestion(
  'How do I verify Stripe webhook signatures?'
);
console.log(answer);
console.log('Sources:', sources);

from knowledgesdk import KnowledgeSDK
import openai

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def answer_question(question: str) -> dict:
    # Search across all indexed content
    results = client.search(query=question, limit=5)

    # Build context
    context = "\n\n---\n\n".join(
        f"Source: {r.url}\n\n{r.content}"
        for r in results.results
    )

    # Generate with GPT-4o
    completion = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"""You are a helpful assistant. Answer based only on the provided context.
If the context doesn't contain the answer, say so.

Context:
{context}""",
            },
            {"role": "user", "content": question},
        ],
    )

    return {
        "answer": completion.choices[0].message.content,
        "sources": [r.url for r in results.results],
    }

# Usage
result = answer_question("How do I verify Stripe webhook signatures?")
print(result["answer"])
print("Sources:", result["sources"])

Part 3: Scraping a Full Site for RAG

For documentation sites, you usually want to index the entire site, not just one URL. knowledgeSDK's extract endpoint handles this.

// Extract an entire site — crawls all pages, returns structured knowledge
const extraction = await client.extract({
  url: 'https://stripe.com/docs',
  options: {
    maxPages: 200,
    includeSubdomains: false,
  }
});

console.log(`Extracted ${extraction.pageCount} pages`);
console.log(`All content is now indexed and searchable`);

// Wait for async extraction to complete
const job = await client.jobs.get(extraction.jobId);
console.log(job.status); // 'completed'

# Full site extraction
extraction = client.extract(
    url="https://stripe.com/docs",
    options={"max_pages": 200, "include_subdomains": False}
)

print(f"Job ID: {extraction.job_id}")

# Poll for completion (or use webhook)
import time
while True:
    job = client.jobs.get(extraction.job_id)
    print(f"Status: {job.status} — {job.pages_processed} pages processed")
    if job.status in ("completed", "failed"):
        break
    time.sleep(5)

print("All content indexed and searchable")

Comparison: DIY vs knowledgeSDK at a Glance

Factor	DIY (Firecrawl + Pinecone)	knowledgeSDK
Initial setup time	3-4 weeks	1-2 hours
Components to manage	4+ (scraper, embedder, vector DB, scheduler)	1
Change detection	Manual polling (expensive, slow)	Webhooks (instant, free)
Search quality	Depends on your chunking strategy	Hybrid semantic + keyword built-in
Cost at 10K pages/mo	~$95-105/mo	$29/mo
Vendor lock-in	Lower (portable data)	Higher
Customization	Full control	API constraints
PDF support	Yes (Firecrawl)	Roadmap

The DIY approach gives you more control and is worth it if you have specific requirements around data residency, custom embedding models, or unusual retrieval patterns. For most teams building AI applications, knowledgeSDK eliminates infrastructure that doesn't differentiate your product.

Advanced: Multi-Source RAG

Real knowledge bases often combine multiple sources. Here's how to index several sources and maintain freshness across all of them:

const sources = [
  'https://stripe.com/docs/api',
  'https://docs.github.com/en/rest',
  'https://developers.notion.com/reference',
];

// Index all sources
await Promise.all(
  sources.map(url => client.scrape({ url }))
);

// Subscribe to all sources for change detection
await Promise.all(
  sources.map(url =>
    client.webhooks.subscribe({
      url,
      callbackUrl: 'https://your-app.com/webhooks/knowledge-update',
      events: ['content.changed'],
    })
  )
);

// Now search across all sources simultaneously
const results = await client.search({
  query: 'rate limiting best practices',
  limit: 10,
});

// Results will include content from Stripe, GitHub, and Notion docs
// ranked by relevance across all sources

FAQ

What's the difference between semantic search and keyword search for RAG? Keyword search matches exact words. Semantic search matches meaning — so a query for "authentication failure" returns results about "login errors" or "access denied" even without those exact words. knowledgeSDK's hybrid search uses both simultaneously, which outperforms either method alone.

How fresh is "fresh enough" for a RAG pipeline? It depends on your use case. For pricing pages, hours-old data is dangerous. For documentation, daily updates are usually fine. knowledgeSDK's webhooks can notify you within minutes of a content change, so you can set your freshness SLA based on business requirements.

Do I need to re-embed content when the page changes? With the DIY approach, yes — you re-embed and re-index changed chunks. With knowledgeSDK, the re-indexing happens automatically when a change is detected.

What embedding model does knowledgeSDK use? knowledgeSDK uses OpenAI's text-embedding-3-small (1536 dimensions) for semantic search, combined with BM25 for keyword search in a hybrid retrieval architecture. You don't need to manage embedding model selection.

How does chunking work in knowledgeSDK? knowledgeSDK automatically chunks content by semantic section (using header structure) and applies sliding window chunking for sections without clear structure. The default chunk size is optimized for RAG retrieval. You don't need to implement your own chunking logic.

Can I export the indexed data to use with a different vector database? Yes — knowledgeSDK provides an export API that returns all indexed chunks with their embeddings, so you can migrate to Pinecone or Weaviate if needed.

What happens to my search index when a page is re-indexed after a change? Old chunks are replaced with new ones. The search index reflects the current state of the page, not historical snapshots. If you need versioning, you can store the raw markdown in your own storage alongside knowledgeSDK.

Conclusion

Web scraping for RAG is not just about getting markdown out of URLs — it's about keeping that knowledge current and making it searchable. The DIY stack works, but it requires 3-4 weeks of engineering time and 4+ ongoing services to maintain.

knowledgeSDK's approach of combining scraping, indexing, search, and change detection eliminates most of that infrastructure. For teams focused on building AI applications rather than building scraping infrastructure, it's the faster path to a production RAG pipeline.

For related reading, see our guides on LangChain web scraping and website change detection with webhooks.

Try knowledgeSDK free — get your API key at knowledgesdk.com/setup

Try it now