Context Engineering: The Developer's Complete Guide (2026)
In late 2024, Andrej Karpathy posted a tweet that quietly reframed how the best AI developers think about their work. The gist: "prompt engineering" was never really the skill. The real skill is context engineering — deliberately deciding what information goes into an LLM's context window at every step.
That idea has since become the organizing principle behind most production AI systems worth talking about. If you're still thinking about AI development primarily in terms of prompt wording, this guide is for you.
What Context Engineering Actually Is
Prompt engineering focuses on how you phrase a question. Context engineering focuses on everything the model can see when it answers.
The distinction matters because LLMs don't have opinions — they have attention. They process whatever is in their context window and generate a response based on that. The quality of that response is almost entirely determined by what you put in front of them.
Context engineering is the discipline of designing that input deliberately. It covers:
- What information sources to include
- How much of each source to include
- In what order and format
- What to exclude entirely
- How to keep it all fresh
The Four Context Sources
Every production LLM system draws from some combination of four source types:
1. System prompt / instructions This is where you define the model's role, constraints, and behavior rules. Most developers treat this as boilerplate. Context engineers treat it as prime real estate — kept tight, versioned, and tested.
2. Conversation history The running thread of what the user and assistant have said. In a multi-turn application, this grows fast. Left unmanaged, it fills the context window with old, low-value turns before the model ever sees the current question.
3. Retrieved knowledge (RAG) Information fetched at runtime from a data store — vector database, search index, document store — based on the current query. This is where most context engineering complexity lives.
4. Tool outputs Results from function calls made mid-conversation. A web search, a calculator, a database lookup — the outputs get injected back into the context for the model to reason over.
Great context engineers know which sources to activate for a given query and how much to pull from each.
The Context Budget Problem
128K tokens sounds generous. 1 million tokens sounds almost unlimited. Neither holds up against a real production application.
Do the math for a typical research agent:
- System prompt: 800 tokens
- Last 10 conversation turns: 4,000 tokens
- 5 retrieved knowledge chunks (500 tokens each): 2,500 tokens
- Tool call outputs: 1,500 tokens
- Current user message: 200 tokens
That's roughly 9,000 tokens per turn. Fine. But then consider:
- Users who want 20-turn conversations
- Agents that make 8-10 tool calls before answering
- Retrieved chunks that are 2,000 tokens each because you didn't chunk properly
- System prompts that grew to 3,000 tokens because nobody cleaned them
The budget pressure is real, and it forces precision. Every token you include is a trade-off against something else the model could have seen.
Strategies for Managing Context
Selective retrieval Don't retrieve everything — retrieve what's relevant to the current query. A well-tuned retrieval step should return 2-4 focused chunks, not 20 loosely related ones. Hybrid search (keyword + semantic) dramatically improves precision.
Summarization Long conversation histories can be summarized into a compact memory object. Instead of 30 raw turns, inject a 300-token summary of what's been established, plus the last 3 turns verbatim.
Sliding windows For very long sessions, only include the N most recent turns plus any turns explicitly marked as important (e.g., the user's initial goal statement).
Tool call caching If a tool call returned a result 2 turns ago and nothing has changed, don't re-run it. Cache the output and re-inject it directly. This is especially important for web extraction calls that take 30+ seconds.
Structured injection
Format matters. A model reading a clean ## Competitor Pricing\n$29/month starter plan chunk will extract that fact more reliably than the same information buried in raw HTML.
Web Data as a Context Source
One of the most underused context sources is live web content — and it's also one of the most valuable.
When your agent needs to answer questions about competitors, current documentation, recent news, or any content that changes over time, you have two options:
- Hope the model's training data is recent enough
- Fetch the live content and inject it
Option 1 fails constantly. Training data has a cutoff date, and even before that date it wasn't exhaustive. Option 2 works, but it introduces complexity: how do you fetch web content cleanly enough to inject it into a context window without blowing your budget?
Raw HTML is a disaster — a typical webpage is 50,000+ tokens of tags, scripts, and navigation noise. You need extraction that returns clean, structured, chunked content.
Where KnowledgeSDK Fits
This is the problem KnowledgeSDK solves. You point it at a URL, and instead of a wall of HTML, you get back indexed, searchable chunks of that page's actual content.
The workflow looks like this:
import KnowledgeSDK from "@knowledgesdk/node";
const ks = new KnowledgeSDK({ apiKey: "knowledgesdk_live_..." });
// Index competitor's pricing page
await ks.extract({ url: "https://competitor.com/pricing" });
// Later, when a user asks about pricing, retrieve the relevant chunk
const context = await ks.search({ query: "competitor starter plan pricing" });
// Returns: 2-3 focused chunks, ~600 tokens total
The POST /v1/search endpoint returns hybrid keyword + semantic results. Instead of stuffing an entire website into context, you get the 2-4 chunks that actually answer the current question.
Assembling Context from Multiple Sources
A production context assembly pattern might look like this:
async function buildContext(userMessage: string, history: Message[]) {
// 1. System prompt (static, versioned)
const systemPrompt = loadSystemPrompt();
// 2. Summarized conversation history
const memory = summarizeHistory(history);
// 3. Retrieved web knowledge
const webChunks = await ks.search({ query: userMessage, limit: 3 });
const webContext = webChunks.results
.map(r => `## ${r.title}\n${r.content}`)
.join("\n\n");
// 4. Assemble in priority order
return {
system: systemPrompt,
context: `${memory}\n\n## Relevant Knowledge\n${webContext}`,
userMessage,
};
}
Notice the deliberate ordering: system instructions first, then memory, then retrieved knowledge, then the user's message. This isn't arbitrary — models attend to position, and the structure signals what's ground truth vs. what's supplementary.
Anti-Patterns to Avoid
Stuffing the entire website into context. A 200-page documentation site is not a context source. It's a data dump. Extract and index it first, then search for what's relevant at query time.
Not updating stale knowledge. If you extracted a competitor's pricing page 6 months ago and haven't re-crawled it since, your agent is giving users outdated information. Set up re-extraction on a schedule or on webhook triggers.
Mixing retrieval strategies poorly. Using pure semantic search for a query like "what is the exact price of the pro plan" will return poor results. Hybrid search handles both fact-retrieval and conceptual queries. Don't pick one strategy and apply it everywhere.
No context budget accounting. If you don't measure how many tokens each source consumes, you'll hit limits in production unexpectedly. Track token counts per source and set hard limits per category.
Treating conversation history as append-only. Long histories degrade performance. Summarize aggressively, prune low-value turns, and always test what happens at turn 50, not just turn 5.
Conclusion
The shift from "prompt engineering" to "context engineering" is really a shift from thinking about AI as magic words to thinking about AI as a reasoning system that's only as capable as the information it can see.
Karpathy was right. Phrasing your prompt better is a rounding error. Deciding what your model can see — and managing that precisely — is the actual work.
The developers building the most capable AI systems in 2026 are the ones who treat context as a first-class engineering problem: measured, optimized, versioned, and continuously improved. Start treating your context window like the scarce resource it is, and your agents will start behaving like they actually know things.