knowledgesdk.com/glossary/context-window
RAG & Retrievalbeginner

Also known as: context length, input context

Context Window

The maximum number of tokens an LLM can process in a single inference call, including both input and output.

What Is a Context Window?

The context window is the maximum number of tokens — the basic units of text a model processes — that a large language model can handle in a single inference call. It includes everything: the system prompt, retrieved documents, conversation history, the user's message, and the generated response.

If the total token count exceeds the context window, the model either errors out or silently truncates the input, leading to lost information and degraded responses.

Token Count vs Word Count

Tokens are not words. Most tokenizers split text into sub-word units:

  • "unsubscribe" → 2–3 tokens
  • "hello world" → 2 tokens
  • Code and punctuation → often 1 token each
  • A rough rule: 1 token ≈ 0.75 words in English

A 128,000-token context window holds roughly 96,000 words — about the length of a novel.

Context Windows by Model (as of 2025)

Model Context Window
GPT-4o 128,000 tokens
Claude 3.5 Sonnet 200,000 tokens
Gemini 1.5 Pro 1,000,000 tokens
Llama 3.1 70B 128,000 tokens
GPT-3.5 Turbo 16,385 tokens

Larger context windows have expanded what is possible, but they also introduce new challenges (see "Lost in the Middle" below).

Context Window and RAG

In a RAG pipeline, the context window constrains how much retrieved content you can inject into the prompt. With a typical setup:

System prompt:          ~500 tokens
Conversation history:   ~2,000 tokens
Retrieved chunks (5×):  ~2,500 tokens
User query:             ~50 tokens
Reserve for output:     ~1,000 tokens
─────────────────────────────────────
Total needed:           ~6,050 tokens

This means even a 16K context model comfortably fits a standard RAG payload, but you must budget carefully to avoid truncation.

The "Lost in the Middle" Problem

Research has shown that LLMs pay less attention to information placed in the middle of a long context. Performance is highest when relevant information is at the beginning or end of the context.

Implications for RAG:

  • Place the most relevant chunks first or last in your prompt
  • Do not fill the context window with marginally relevant material
  • Re-rank retrieved chunks before prompt assembly and drop low-scoring ones

Practical Context Management

MAX_CONTEXT_TOKENS = 4000  # budget for retrieved chunks

def build_context(chunks, max_tokens):
    context = []
    used = 0
    for chunk in chunks:  # already ranked by relevance
        chunk_tokens = count_tokens(chunk.content)
        if used + chunk_tokens > max_tokens:
            break
        context.append(chunk.content)
        used += chunk_tokens
    return "\n\n".join(context)

Context Window and KnowledgeSDK

POST /v1/search returns scored, relevance-ranked chunks that are ready to be assembled into a context window. By controlling the limit parameter and filtering by score threshold, you can precisely control how many tokens of retrieved context you inject:

curl -X POST https://api.knowledgesdk.com/v1/search \
  -H "x-api-key: knowledgesdk_live_..." \
  -d '{"query": "how does billing work?", "limit": 3}'

Limiting to 3 high-quality chunks keeps your prompt compact and avoids the lost-in-the-middle problem.

Key Takeaways

  • Context window = hard limit on all tokens in a single model call
  • Budget your context: system prompt + history + retrieved chunks + output reserve
  • More is not always better — fewer, higher-quality chunks outperform many mediocre ones
  • Always place the most critical information at the beginning or end of your context

Related Terms

RAG & Retrievalbeginner
Retrieval-Augmented Generation
A technique that grounds LLM responses by retrieving relevant documents from an external knowledge base before generation.
RAG & Retrievalbeginner
Chunking
The process of splitting long documents into smaller, overlapping or non-overlapping segments before embedding and indexing.
LLMsbeginner
Token
The basic unit of text processed by an LLM — roughly 3/4 of a word in English — that models use to read and generate language.
Context EngineeringCosine Similarity

Try it now

Build with Context Window using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary