What Is a Context Window?
The context window is the maximum number of tokens — the basic units of text a model processes — that a large language model can handle in a single inference call. It includes everything: the system prompt, retrieved documents, conversation history, the user's message, and the generated response.
If the total token count exceeds the context window, the model either errors out or silently truncates the input, leading to lost information and degraded responses.
Token Count vs Word Count
Tokens are not words. Most tokenizers split text into sub-word units:
- "unsubscribe" → 2–3 tokens
- "hello world" → 2 tokens
- Code and punctuation → often 1 token each
- A rough rule: 1 token ≈ 0.75 words in English
A 128,000-token context window holds roughly 96,000 words — about the length of a novel.
Context Windows by Model (as of 2025)
| Model | Context Window |
|---|---|
| GPT-4o | 128,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| Gemini 1.5 Pro | 1,000,000 tokens |
| Llama 3.1 70B | 128,000 tokens |
| GPT-3.5 Turbo | 16,385 tokens |
Larger context windows have expanded what is possible, but they also introduce new challenges (see "Lost in the Middle" below).
Context Window and RAG
In a RAG pipeline, the context window constrains how much retrieved content you can inject into the prompt. With a typical setup:
System prompt: ~500 tokens
Conversation history: ~2,000 tokens
Retrieved chunks (5×): ~2,500 tokens
User query: ~50 tokens
Reserve for output: ~1,000 tokens
─────────────────────────────────────
Total needed: ~6,050 tokens
This means even a 16K context model comfortably fits a standard RAG payload, but you must budget carefully to avoid truncation.
The "Lost in the Middle" Problem
Research has shown that LLMs pay less attention to information placed in the middle of a long context. Performance is highest when relevant information is at the beginning or end of the context.
Implications for RAG:
- Place the most relevant chunks first or last in your prompt
- Do not fill the context window with marginally relevant material
- Re-rank retrieved chunks before prompt assembly and drop low-scoring ones
Practical Context Management
MAX_CONTEXT_TOKENS = 4000 # budget for retrieved chunks
def build_context(chunks, max_tokens):
context = []
used = 0
for chunk in chunks: # already ranked by relevance
chunk_tokens = count_tokens(chunk.content)
if used + chunk_tokens > max_tokens:
break
context.append(chunk.content)
used += chunk_tokens
return "\n\n".join(context)
Context Window and KnowledgeSDK
POST /v1/search returns scored, relevance-ranked chunks that are ready to be assembled into a context window. By controlling the limit parameter and filtering by score threshold, you can precisely control how many tokens of retrieved context you inject:
curl -X POST https://api.knowledgesdk.com/v1/search \
-H "x-api-key: knowledgesdk_live_..." \
-d '{"query": "how does billing work?", "limit": 3}'
Limiting to 3 high-quality chunks keeps your prompt compact and avoids the lost-in-the-middle problem.
Key Takeaways
- Context window = hard limit on all tokens in a single model call
- Budget your context: system prompt + history + retrieved chunks + output reserve
- More is not always better — fewer, higher-quality chunks outperform many mediocre ones
- Always place the most critical information at the beginning or end of your context