Every team building an LLM-powered product eventually faces this question: should we fine-tune a model on our data, or should we use retrieval-augmented generation?
The wrong answer — usually defaulting to fine-tuning because it sounds more "AI-native" — leads to expensive models that go stale immediately and can't be updated without another training run. The right answer depends on what problem you're actually trying to solve.
This guide explains the fundamental difference between the two approaches, when each is appropriate, and why web scraping is a critical input for RAG-based systems.
What Fine-Tuning Actually Does
Fine-tuning takes a pre-trained model and continues training it on a new dataset. The result is a model whose weights have been adjusted to better reflect the patterns in your fine-tuning data.
Think of it as updating the model's "memory" — but that memory is static. Once training is complete, the model knows what was in the training data at training time and nothing more.
What fine-tuning is good at:
- Style and tone: Making a model consistently respond in your brand voice, use specific terminology, or follow particular formatting conventions
- Task specialization: Teaching a model to do a specific type of task (code completion in a proprietary DSL, medical entity extraction, legal document classification)
- Behavior modification: Adjusting how the model responds (more concise, more conversational, in a specific language)
What fine-tuning is bad at:
- Keeping up with changing information: A fine-tuned model that knew your product docs in January doesn't know about the features you shipped in February
- Long-tail facts: Rare, specific facts (exact API parameters, current pricing, customer-specific configurations) often don't get reliably encoded in weights even after fine-tuning
- Cost efficiency at scale: Fine-tuning GPT-4 on 100,000 documents costs thousands of dollars and takes days. Doing it monthly to stay current is prohibitively expensive for most teams
What RAG Actually Does
Retrieval-augmented generation is a pattern where the LLM retrieves relevant documents at inference time and uses them as context. The model's weights aren't changed — instead, you give it the information it needs as part of each prompt.
User query: "What's the rate limit for the /v1/search endpoint?"
↓
Retrieval: Search knowledge base for relevant docs
↓
Top result: "The search endpoint is rate limited to 100 requests per minute
on the Starter plan and 1,000 per minute on the Pro plan."
↓
LLM with context: Accurate answer using retrieved information
The model doesn't need to "know" the rate limit — it reads the relevant document and tells the user what it says. Update the document, the answer updates automatically.
What RAG is good at:
- Factual accuracy on current information: Up-to-date content is retrieved, not remembered from stale training
- Long-tail and specific facts: Exact product specs, current pricing, specific customer configurations
- Transparency and citation: You can show which documents were retrieved, enabling users to verify answers
- Cost-effective updates: Update the knowledge base (cheap), not the model (expensive)
What RAG is bad at:
- Style and behavior: Retrieval doesn't change how the model talks, only what it knows
- Deep domain reasoning: Complex reasoning that requires integrated domain knowledge (not just facts) can be better served by fine-tuning
- Latency-critical applications: Every RAG request requires a retrieval step (10-100ms latency) plus the LLM call. For extremely latency-sensitive applications, this adds up
The Role of Web Scraping in RAG
For RAG to work, you need a knowledge base. That knowledge base has to come from somewhere. This is where web scraping enters the picture.
The most common RAG knowledge sources fall into three categories:
Internal Documents
Company wikis, Notion pages, Google Docs, Confluence. These usually have proper APIs or can be exported directly. Web scraping isn't typically needed here.
Public Web Content
Documentation sites, blog posts, competitor pages, product pages, news articles, research papers, regulatory filings. This is the domain of web scraping — there's no official API, just HTML.
Dynamic/Changing Content
Pricing pages that update quarterly, documentation that ships with new releases, regulatory databases that update when regulations change. This requires ongoing scraping with change detection.
Web scraping is how RAG pipelines stay current with public web content. Without it, your knowledge base is either static (based on a one-time data dump) or limited to sources with official APIs.
import { KnowledgeSDK } from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGE_API_KEY });
// Build a RAG knowledge base from your docs site
const extraction = await client.extract({
url: 'https://docs.yourproduct.com',
crawlSubpages: true,
maxPages: 1000,
});
// Now search it for RAG context
const context = await client.search({
query: 'How do I authenticate with the API?',
limit: 5,
});
// Use context in your LLM prompt
const prompt = `
Answer the user's question using only the following documentation:
${context.items.map(item => `## ${item.title}\n${item.content}`).join('\n\n')}
Question: How do I authenticate with the API?
`;
The Decision Framework
Here's how to think about the RAG vs. fine-tuning decision:
Use Fine-Tuning When:
The task requires consistent behavioral patterns, not just facts. Example: a customer service bot that must always respond in a specific format, follow specific escalation patterns, or use company-specific terminology. Fine-tuning makes these behaviors consistent and reliable.
The domain has specialized language the base model doesn't understand well. Example: medical coding (ICD-10), legal terminology in a specific jurisdiction, a proprietary DSL. Fine-tuning teaches the model the vocabulary and usage patterns.
Latency is critical and RAG's retrieval overhead is unacceptable. Example: autocomplete systems where even 50ms additional latency degrades user experience. A fine-tuned model with baked-in knowledge can respond faster than a RAG pipeline.
The information is stable and won't change. Example: historical facts, scientific constants, established procedures that don't change over time.
Use RAG (with Web Scraping) When:
The information changes over time. If your knowledge base updates more than once a quarter, fine-tuning to keep up is uneconomical. RAG lets you update the knowledge base without touching the model.
You need exact, specific facts. Fine-tuning doesn't reliably memorize specific facts (product IDs, exact prices, exact API parameters). RAG retrieves the exact document and puts it in context.
You need to cite sources. If your users need to verify answers, RAG provides the retrieved documents. Fine-tuned models can't tell you where they learned something.
The data is on the public web. Your competitor's documentation, industry news, regulatory databases, academic papers — this data is only accessible via web scraping, making it RAG territory by default.
You're building a product with a large, varied knowledge domain. A customer support bot covering thousands of products, a research assistant covering millions of documents — these don't fit in a model's context window and can't be fully fine-tuned into weights efficiently.
The Decision Matrix
| Factor | Fine-Tuning | RAG + Scraping |
|---|---|---|
| Information changes frequently | Poor fit | Excellent fit |
| Exact facts matter | Poor fit | Excellent fit |
| Style/tone consistency | Excellent fit | Poor fit |
| Proprietary language/tasks | Excellent fit | Poor fit |
| Source citation needed | Poor fit | Excellent fit |
| Budget for updates | High ongoing cost | Low ongoing cost |
| Data is on the public web | Not applicable | Natural fit |
| Latency is critical | Better | Worse |
| Domain: current events | Poor fit | Excellent fit |
| Domain: specialized reasoning | Excellent fit | Poor fit |
When to Use Both
The RAG vs. fine-tuning framing can be misleading because in practice, many production systems use both.
A common pattern:
- Fine-tune for behavior: Train the model to respond in your brand voice, follow your specific format, use your domain's terminology
- RAG for current knowledge: Use retrieval to provide up-to-date facts the fine-tuned model can cite accurately
This gives you the best of both worlds: consistent behavior (fine-tuning) with current, accurate facts (RAG). The web scraping pipeline feeds the RAG half.
// Fine-tuned model + RAG context
const context = await client.search({
query: userQuestion,
limit: 5,
});
const systemPrompt = `You are a support agent for Acme Corp. Always respond in a
professional, friendly tone. Address customers by name when known. Cite your sources
when providing technical information.
Relevant documentation:
${context.items.map(item => `Source: ${item.url}\n${item.content}`).join('\n\n')}`;
const response = await openai.chat.completions.create({
model: 'ft:gpt-4o-mini:your-org:acme-support-v2:abc123', // fine-tuned model
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userQuestion },
],
});
The Economics of Each Approach
Understanding the cost structures helps make the right call.
Fine-Tuning Costs
- One-time training: $500-$5,000 depending on dataset size and model
- Re-training cadence: Every time information changes significantly
- Ongoing inference: Similar to base model inference (no retrieval overhead)
- Total cost for 100k documents, updated monthly: $2,000-$10,000/month
RAG Costs
- Knowledge base construction: Web scraping API cost + embedding cost (~$0.01-0.05 per page initially)
- Knowledge base updates: Scraping only changed pages (often 5-20% of total per update cycle)
- Ongoing inference: Base model inference + retrieval (embedding query + vector search, ~5-20ms and <$0.001 per query)
- Total cost for 100k documents, updated monthly: $100-$500/month
For most knowledge-intensive applications, RAG is 10-50x cheaper to operate than fine-tuning for the same knowledge coverage.
Building a Web Scraping Pipeline for RAG
Here's a production-ready pattern for building and maintaining a RAG knowledge base from web content:
import { KnowledgeSDK } from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGE_API_KEY });
class RagKnowledgeBase {
// Initial build: scrape all pages
async initialBuild(urls: string[]): Promise<void> {
for (const url of urls) {
await client.extract({
url,
crawlSubpages: true,
maxPages: 500,
});
}
console.log('Knowledge base built');
}
// Ongoing updates: re-scrape and detect changes
async update(urls: string[]): Promise<{ updated: number; unchanged: number }> {
let updated = 0;
let unchanged = 0;
for (const url of urls) {
const result = await client.scrape({ url });
// KnowledgeSDK tracks changes automatically
// Re-extract triggers re-indexing only for changed content
const changed = await this.hasChanged(url, result.markdown);
if (changed) {
updated++;
} else {
unchanged++;
}
}
return { updated, unchanged };
}
// Query for RAG context
async getContext(query: string, limit = 5): Promise<string> {
const results = await client.search({ query, limit });
return results.items
.map(item => `## ${item.title}\nSource: ${item.url}\n\n${item.content}`)
.join('\n\n---\n\n');
}
private async hasChanged(url: string, newMarkdown: string): Promise<boolean> {
// Implementation depends on your storage layer
return true; // simplified
}
}
// Usage in a RAG query handler
async function handleUserQuery(question: string): Promise<string> {
const kb = new RagKnowledgeBase();
const context = await kb.getContext(question);
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: `Answer questions using only the provided documentation.
Always cite the source URL when providing specific facts.\n\n${context}`,
},
{ role: 'user', content: question },
],
});
return response.choices[0].message.content ?? '';
}
Frequently Asked Questions
Q: Can fine-tuning improve retrieval quality in a RAG pipeline?
Yes, but in a specific way. You can fine-tune the embedding model (not the generative model) to better represent your domain's concepts in vector space. This improves retrieval without changing the generative model. It's an advanced technique worth exploring if you have domain-specific vocabulary that general embedding models handle poorly.
Q: How frequently should I update my RAG knowledge base?
Depends on content volatility. Documentation sites: re-scrape weekly or use webhooks for change detection. News sites: hourly. Regulatory databases: whenever they publish updates. Pricing pages: daily. The beauty of RAG is you can tune the freshness per source based on your needs.
Q: What if I need both current knowledge AND proprietary task behavior?
Use the combined approach: fine-tune a smaller base model for your behavioral requirements, then use it as the generative model in your RAG pipeline. The fine-tuned model follows your format and style; the RAG context provides current facts.
Q: Is there a minimum viable dataset size for fine-tuning?
OpenAI recommends at least 50-100 examples for meaningful behavioral changes, with quality mattering more than quantity. However, for consistent style and behavior changes, you often need 500-2000 high-quality examples to see reliable improvement.
Q: My team argues we should fine-tune so we don't need to build a retrieval system. Is that valid?
It's a valid trade-off for simple, stable domains. If your knowledge is truly static and you don't need citations, fine-tuning is simpler to operate (no vector database, no retrieval pipeline). But "truly static" is rarer than people think — if there's any chance information will change, RAG's update economics win quickly.
Q: Does KnowledgeSDK work well as the data layer for RAG?
Yes. KnowledgeSDK's extraction endpoint builds a clean, searchable knowledge base from any set of URLs. The /v1/search endpoint provides hybrid semantic search over all your scraped content. You can use this directly as the retrieval layer in your RAG pipeline, or export the content to your own vector database.
Conclusion
Fine-tuning and RAG solve different problems. Fine-tuning changes how a model behaves and reasons. RAG changes what a model knows at the moment of inference. For most knowledge-intensive applications — especially those that depend on public web content — RAG with web scraping is the right foundation: cheaper to operate, easier to update, and more accurate on specific facts.
The web scraping layer is the unsung hero of RAG pipelines. It's how current, public information gets into your knowledge base in the first place. KnowledgeSDK provides that layer: scrape any URL, index the content, search it semantically, and keep it fresh with webhooks.
Get your API key at knowledgesdk.com/setup and start building your RAG knowledge base with @knowledgesdk/node or the knowledgesdk Python SDK.