How Web Extraction Cuts Your LLM Costs by 60%
Long context windows are one of the most impressive capabilities in modern LLMs — and one of the most expensive ways to use them. The ability to stuff a million tokens into a single prompt is powerful, but the cost math only works if you actually need all of that context for every query.
For most production workloads, you do not. And the cost difference between the naive approach and a well-architected RAG pipeline is not marginal. It is dramatic.
The Context Problem
GPT-4o is priced at $5 per million input tokens. Claude Sonnet is $3 per million input tokens. These are the models most production applications are running on today.
A medium-sized documentation site — say, 200 pages of technical reference — weighs in at roughly 500,000 tokens when converted to clean text. That is a number comfortably within the context window of modern frontier models.
So here is the tempting but expensive approach: every time a user asks a question, stuff the entire 500,000 tokens of documentation into the prompt. The LLM will definitely have the answer somewhere in its context. Simple, effective, and catastrophically expensive at scale.
Scenario A: Stuffing Context into Every Query
Let us do the math precisely.
500,000 tokens per query at GPT-4o pricing ($5 per million input tokens):
Cost per query = 500,000 / 1,000,000 × $5 = $2.50
At 1,000 queries per day — a modest number for any production application — that is:
Daily cost = $2,500 Monthly cost = $75,000
This is not a hypothetical. Teams running naive RAG pipelines with large context stuffing hit bills like this and do not understand why until someone does the math.
Scenario B: Extract, Index, and Retrieve
The alternative: extract the website content once, index it as a searchable knowledge base, and at query time retrieve only the relevant chunks.
With a well-tuned hybrid search retrieval, the average query returns 2,000–4,000 tokens of relevant context. Let us use 3,000 tokens as a working average.
Cost per query = 3,000 / 1,000,000 × $5 = $0.015
At 1,000 queries per day:
Daily cost = $15 Monthly cost = $450
The cost difference: $75,000/month vs $450/month. That is not a 60% reduction — it is a 250x reduction. The 60% figure understates the reality for most workloads significantly.
The Full Cost Comparison
| Approach | Tokens per Query | Cost per Query | Cost per 1,000 Queries |
|---|---|---|---|
| Full context stuffing (500K tokens) | 500,000 | $2.50 | $2,500 |
| Moderate context (50K tokens) | 50,000 | $0.25 | $250 |
| RAG with hybrid retrieval (3K tokens) | 3,000 | $0.015 | $15 |
| Fine-tuned model (0 retrieval tokens) | ~500 (query only) | ~$0.003 | $3 |
Fine-tuning looks cheapest at the bottom of this table, but it comes with hidden costs that the token price does not capture — which brings us to why fine-tuning is not the answer either.
Why Fine-Tuning Is Not the Answer
The obvious question looking at that table: if fine-tuning is cheapest per query, why not fine-tune?
Three reasons:
Training cost. Fine-tuning a frontier model on a substantial knowledge base costs thousands to tens of thousands of dollars in compute. For a documentation site that changes monthly, that training cost recurs every time the content updates.
Staleness. The moment you fine-tune, the model's knowledge is frozen. Product documentation, pricing pages, API references — this content changes frequently. A fine-tuned model does not know about changes made after its training run. A RAG pipeline knows about changes the moment you re-index.
Labeled data requirements. Good fine-tuning requires labeled examples, not just raw content. Creating and maintaining those examples is expensive and slow.
Fine-tuning is a strong technique for teaching a model how to behave — not for keeping it up to date on what is currently true.
Why RAG Beats Long Context Beyond Cost
The cost argument is compelling on its own, but RAG has other advantages over long context stuffing:
Freshness. When the documentation changes, you re-extract and re-index that page. The next query immediately gets the updated content. With long context stuffing, you would need to re-generate your context blob every time content changes.
Latency. Shorter context windows process faster. Inference time scales with context length — a 3,000-token context processes significantly faster than a 500,000-token context. For user-facing applications, this difference is felt.
Relevance. Counterintuitively, shorter context often produces better answers. When you give an LLM 500,000 tokens of documentation, it may struggle to focus on the relevant section. When you give it the 3 most relevant chunks, it generates a more precise response. Long context dilution is a real phenomenon in production.
Scalability. Your documentation is not fixed at 500 pages. As it grows, long context stuffing becomes increasingly expensive and eventually impractical. A RAG pipeline scales independently of documentation size.
The Math with KnowledgeSDK
Here is a complete cost-optimized flow using KnowledgeSDK's extraction and search APIs, with the total token consumption calculated:
import Knowledgesdk from "@knowledgesdk/node";
import OpenAI from "openai";
const ks = new Knowledgesdk({ apiKey: "knowledgesdk_live_..." });
const openai = new OpenAI({ apiKey: "..." });
async function answerQuestion(question: string): Promise<string> {
// Step 1: Retrieve relevant chunks (~3K tokens total)
const results = await ks.search({ query: question, limit: 3 });
const context = results.items
.map((item) => `## ${item.title}\n\n${item.content}`)
.join("\n\n---\n\n");
// Step 2: Build prompt (~3.5K tokens including question + instructions)
const prompt = `You are a helpful assistant. Answer the user's question based on the following documentation excerpts.
Documentation:
${context}
Question: ${question}
Answer:`;
// Step 3: Query LLM with minimal context
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
max_tokens: 500,
});
return response.choices[0].message.content ?? "";
}
// Total input tokens per query: ~3,500
// Cost: 3,500 / 1,000,000 × $5 = $0.0175 per query
const answer = await answerQuestion("How do I implement webhook signature verification?");
console.log(answer);
The search call retrieves 3 chunks averaging about 1,000 tokens each. Add the question, system instructions, and prompt wrapper, and you are at roughly 3,500 input tokens per query — compared to 500,000 for the naive approach.
Re-Indexing Only What Changed
The ongoing extraction cost is also minimal when handled intelligently. Rather than re-extracting your entire knowledge base on a schedule, use KnowledgeSDK's async extraction with webhooks to re-index pages only when their content changes:
// Re-extract high-priority pages on a schedule
// Only pages that have changed will meaningfully update the index
async function refreshKnowledge(urls: string[]) {
const jobs = await Promise.all(
urls.map((url) => ks.extract.async({ url }))
);
console.log(`Submitted ${jobs.length} re-extraction jobs`);
// Poll GET /v1/jobs/{jobId} for completion
}
// Run weekly for your core documentation pages
await refreshKnowledge([
"https://docs.example.com/pricing",
"https://docs.example.com/api-reference",
"https://docs.example.com/changelog",
]);
The extraction cost is a one-time investment per page update, not a per-query cost. Once extracted and indexed, the knowledge is queried at negligible cost.
Summary
The economics of LLM-powered applications heavily favor retrieval-augmented approaches over large context stuffing:
- Long context: $2.50 per query, scales linearly with documentation size, stale on content change
- RAG with web extraction: ~$0.015 per query, scales independently, fresh on re-extraction
The infrastructure investment is the extraction and indexing pipeline — which KnowledgeSDK handles for you. The ongoing cost is retrieval queries, which are fast and cheap.
At any meaningful query volume, the choice is not really a tradeoff. It is the difference between a product that is economically viable and one that is not.