Long Context

The capability of modern LLMs to process very large input texts — from tens of thousands to millions of tokens — in a single call.

What Is Long Context?

Long context refers to the ability of a large language model to accept and coherently process very large amounts of text in a single inference call. While early Transformer models were limited to 512 or 2,048 tokens, modern frontier models have expanded this dramatically:

Model	Context window
GPT-4o	128,000 tokens (~96,000 words)
Claude 3.5 Sonnet	200,000 tokens (~150,000 words)
Gemini 1.5 Pro	1,000,000 tokens (~750,000 words)
Gemini 1.5 Flash	1,000,000 tokens (~750,000 words)

A 1 million token context window can hold the entire text of the Harry Potter series, or thousands of pages of documentation, in a single prompt.

Why Long Context Matters

Long context unlocks use cases that were previously impossible or required complex workarounds:

Document-level understanding — Analyze an entire legal contract, annual report, or codebase in one shot.
Whole-book summarization — Pass the entire source material rather than chunking and aggregating.
Long-running conversations — Maintain coherent context across extended customer interactions.
Multi-document reasoning — Compare multiple sources simultaneously without retrieval tricks.
Code repository analysis — Feed an entire codebase to reason about architecture, bugs, and dependencies.

Long Context vs. RAG

Long context and Retrieval-Augmented Generation (RAG) are complementary, not competing approaches:

Dimension	Long Context	RAG
Latency	Higher (more tokens to process)	Lower (only relevant chunks)
Cost	Higher (proportional to input tokens)	Lower (small context)
Precision	Model must attend across entire input	Retrieval focuses context
Freshness	Still limited by model training cutoff	Dynamic, real-time data
Best for	Bounded, known document sets	Large or dynamic knowledge bases

For most production systems with large or frequently updated knowledge bases, RAG remains more cost-efficient. Long context shines for bounded tasks where you know exactly what documents are relevant upfront.

The "Lost in the Middle" Problem

Despite impressive context lengths, research has shown LLMs perform worse on information placed in the middle of a very long context compared to information at the beginning or end. Key findings:

Models reliably find information in the first ~20% and last ~20% of the context.
Performance degrades significantly for information buried in the middle of long contexts.
This means naively concatenating 100 documents does not give you "100-document understanding."

Mitigation strategies:

Place the most important information at the beginning or end of the context.
Use RAG to pre-select the most relevant chunks rather than including everything.
Apply re-ranking to put the highest-relevance content near the edges.

Long Context in Practice

// Feeding an entire scraped documentation site into a long-context model
import KnowledgeSDK from "@knowledgesdk/node";

const sdk = new KnowledgeSDK({ apiKey: "knowledgesdk_live_..." });

// Get all URLs from a documentation site
const { urls } = await sdk.sitemap("https://docs.example.com");

// Scrape and concatenate (feasible for small doc sites with long-context models)
const pages = await Promise.all(urls.slice(0, 20).map(url => sdk.scrape(url)));
const fullContext = pages.map(p => p.content).join("\n\n---\n\n");

// fullContext is clean markdown — efficient token usage vs. raw HTML

For larger sites, combine KnowledgeSDK's /v1/search for RAG-style retrieval with /v1/extract for per-page cleaning, reserving long-context calls for final synthesis steps where the full picture matters.

Cost Considerations

Long context calls are expensive. A 200,000 token input to Claude Opus 4 costs approximately $3.00 per call at current pricing. Design your architecture to use the minimum context required — let RAG narrow the field, then use long context for final reasoning over the retrieved subset.

Related Terms

RAG & Retrievalbeginner

Context Window

The maximum number of tokens an LLM can process in a single inference call, including both input and output.

RAG & Retrievalbeginner

Retrieval-Augmented Generation

A technique that grounds LLM responses by retrieving relevant documents from an external knowledge base before generation.

LLMsbeginner

Token

The basic unit of text processed by an LLM — roughly 3/4 of a word in English — that models use to read and generate language.

← Latency Long-Term Memory →

Try it now

Build with Long Context using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary