Working Memory

The information held in an AI agent's active context window during a single task or conversation turn.

What Is Working Memory?

Working memory is the cognitive science term for the small amount of information an agent holds "in mind" and actively manipulates during a task. In AI systems, working memory maps directly to the context window — everything the model can "see" at the moment it generates a response.

Unlike long-term or episodic memory, which persists across sessions and is stored externally, working memory is ephemeral. When the conversation ends or the context is cleared, the working memory is gone.

What Lives in Working Memory

During a typical agent interaction, the context window holds:

The system prompt (agent instructions, persona, tools)
The full conversation history so far
Retrieved documents injected by RAG
Tool call results returned from external APIs
The user's current message

Everything the model "knows" in the moment it generates a reply comes from this window. There is no other mechanism for an LLM to access information during inference.

The Working Memory Bottleneck

Context windows have grown dramatically — from 4K tokens in early GPT-3 to 200K+ in modern models — but they remain finite. This creates practical constraints:

Long conversations: Eventually the conversation history exceeds the window and must be truncated or summarized.
Large documents: A 300-page PDF cannot fit in context; it must be chunked and retrieved selectively.
Multi-step tasks: A complex agent task with many tool calls accumulates context quickly.

Working Memory Management Strategies

Since working memory is scarce, sophisticated agent systems manage it deliberately:

Summarization: Compress older conversation turns into a summary to reclaim tokens while preserving meaning.
Selective retrieval: Use RAG to pull only the most relevant chunks into context rather than loading everything.
Memory offloading: Move completed sub-task results out of context and into external storage, referencing them by ID if needed again.
Sliding window: Keep only the N most recent turns in context and let older turns roll off.

Practical Example

Imagine an agent helping a developer debug a large codebase. The entire codebase does not fit in context. The working memory at any given moment contains:

The agent's system prompt
The last 10 conversation turns
The 3 most relevant code files retrieved by semantic search
The output of the last tool call (e.g., a test run result)

The agent reasons over this slice of information, then decides which files to retrieve next, updating its working memory iteratively.

Working Memory vs. Long-Term Memory

Dimension	Working Memory	Long-Term Memory
Location	Context window (in-model)	External database
Persistence	Single session	Across sessions
Capacity	Tokens (finite)	Effectively unlimited
Access speed	Instant	Query latency
Update	Append to context	Write to database

Understanding working memory constraints is essential for designing agents that handle complex, long-horizon tasks without losing coherence or hitting context limits.

Related Terms

RAG & Retrievalbeginner

Context Window

The maximum number of tokens an LLM can process in a single inference call, including both input and output.

AI Agentsbeginner

Memory (AI Agents)

The mechanisms by which an AI agent stores and retrieves information across turns, sessions, or tasks to maintain continuity.

Knowledge & Memoryintermediate

Episodic Memory

An AI agent's memory of specific past events and interactions, stored and retrieved to inform future decisions.

← Webhook

Try it now

Build with Working Memory using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary