What Is Working Memory?
Working memory is the cognitive science term for the small amount of information an agent holds "in mind" and actively manipulates during a task. In AI systems, working memory maps directly to the context window — everything the model can "see" at the moment it generates a response.
Unlike long-term or episodic memory, which persists across sessions and is stored externally, working memory is ephemeral. When the conversation ends or the context is cleared, the working memory is gone.
What Lives in Working Memory
During a typical agent interaction, the context window holds:
- The system prompt (agent instructions, persona, tools)
- The full conversation history so far
- Retrieved documents injected by RAG
- Tool call results returned from external APIs
- The user's current message
Everything the model "knows" in the moment it generates a reply comes from this window. There is no other mechanism for an LLM to access information during inference.
The Working Memory Bottleneck
Context windows have grown dramatically — from 4K tokens in early GPT-3 to 200K+ in modern models — but they remain finite. This creates practical constraints:
- Long conversations: Eventually the conversation history exceeds the window and must be truncated or summarized.
- Large documents: A 300-page PDF cannot fit in context; it must be chunked and retrieved selectively.
- Multi-step tasks: A complex agent task with many tool calls accumulates context quickly.
Working Memory Management Strategies
Since working memory is scarce, sophisticated agent systems manage it deliberately:
- Summarization: Compress older conversation turns into a summary to reclaim tokens while preserving meaning.
- Selective retrieval: Use RAG to pull only the most relevant chunks into context rather than loading everything.
- Memory offloading: Move completed sub-task results out of context and into external storage, referencing them by ID if needed again.
- Sliding window: Keep only the N most recent turns in context and let older turns roll off.
Practical Example
Imagine an agent helping a developer debug a large codebase. The entire codebase does not fit in context. The working memory at any given moment contains:
- The agent's system prompt
- The last 10 conversation turns
- The 3 most relevant code files retrieved by semantic search
- The output of the last tool call (e.g., a test run result)
The agent reasons over this slice of information, then decides which files to retrieve next, updating its working memory iteratively.
Working Memory vs. Long-Term Memory
| Dimension | Working Memory | Long-Term Memory |
|---|---|---|
| Location | Context window (in-model) | External database |
| Persistence | Single session | Across sessions |
| Capacity | Tokens (finite) | Effectively unlimited |
| Access speed | Instant | Query latency |
| Update | Append to context | Write to database |
Understanding working memory constraints is essential for designing agents that handle complex, long-horizon tasks without losing coherence or hitting context limits.