What Is Long Context?
Long context refers to the ability of a large language model to accept and coherently process very large amounts of text in a single inference call. While early Transformer models were limited to 512 or 2,048 tokens, modern frontier models have expanded this dramatically:
| Model | Context window |
|---|---|
| GPT-4o | 128,000 tokens (~96,000 words) |
| Claude 3.5 Sonnet | 200,000 tokens (~150,000 words) |
| Gemini 1.5 Pro | 1,000,000 tokens (~750,000 words) |
| Gemini 1.5 Flash | 1,000,000 tokens (~750,000 words) |
A 1 million token context window can hold the entire text of the Harry Potter series, or thousands of pages of documentation, in a single prompt.
Why Long Context Matters
Long context unlocks use cases that were previously impossible or required complex workarounds:
- Document-level understanding — Analyze an entire legal contract, annual report, or codebase in one shot.
- Whole-book summarization — Pass the entire source material rather than chunking and aggregating.
- Long-running conversations — Maintain coherent context across extended customer interactions.
- Multi-document reasoning — Compare multiple sources simultaneously without retrieval tricks.
- Code repository analysis — Feed an entire codebase to reason about architecture, bugs, and dependencies.
Long Context vs. RAG
Long context and Retrieval-Augmented Generation (RAG) are complementary, not competing approaches:
| Dimension | Long Context | RAG |
|---|---|---|
| Latency | Higher (more tokens to process) | Lower (only relevant chunks) |
| Cost | Higher (proportional to input tokens) | Lower (small context) |
| Precision | Model must attend across entire input | Retrieval focuses context |
| Freshness | Still limited by model training cutoff | Dynamic, real-time data |
| Best for | Bounded, known document sets | Large or dynamic knowledge bases |
For most production systems with large or frequently updated knowledge bases, RAG remains more cost-efficient. Long context shines for bounded tasks where you know exactly what documents are relevant upfront.
The "Lost in the Middle" Problem
Despite impressive context lengths, research has shown LLMs perform worse on information placed in the middle of a very long context compared to information at the beginning or end. Key findings:
- Models reliably find information in the first ~20% and last ~20% of the context.
- Performance degrades significantly for information buried in the middle of long contexts.
- This means naively concatenating 100 documents does not give you "100-document understanding."
Mitigation strategies:
- Place the most important information at the beginning or end of the context.
- Use RAG to pre-select the most relevant chunks rather than including everything.
- Apply re-ranking to put the highest-relevance content near the edges.
Long Context in Practice
// Feeding an entire scraped documentation site into a long-context model
import KnowledgeSDK from "@knowledgesdk/node";
const sdk = new KnowledgeSDK({ apiKey: "knowledgesdk_live_..." });
// Get all URLs from a documentation site
const { urls } = await sdk.sitemap("https://docs.example.com");
// Scrape and concatenate (feasible for small doc sites with long-context models)
const pages = await Promise.all(urls.slice(0, 20).map(url => sdk.scrape(url)));
const fullContext = pages.map(p => p.content).join("\n\n---\n\n");
// fullContext is clean markdown — efficient token usage vs. raw HTML
For larger sites, combine KnowledgeSDK's /v1/search for RAG-style retrieval with /v1/extract for per-page cleaning, reserving long-context calls for final synthesis steps where the full picture matters.
Cost Considerations
Long context calls are expensive. A 200,000 token input to Claude Opus 4 costs approximately $3.00 per call at current pricing. Design your architecture to use the minimum context required — let RAG narrow the field, then use long context for final reasoning over the retrieved subset.