Ai Ml

Memory — How Agents Build Continuity

Context windows forget. Production agents don't. The difference is a layered architecture: working memory, session memory, and long-term memory split into facts, events, and skills. Here's how real agents remember — and why forgetting on purpose matters.

Tin Dang avatar
Tin Dang
Hand-drawn vertical layered diagram showing four memory tiers — working, session, long-term (split into semantic, episodic, procedural), and retrieval — with annotations

An agent that forgets everything at the end of each session is not an assistant. It is a very clever stranger that you meet, brief, and never see again. The chat baseline pretends to remember within a single conversation by re-sending the transcript — we covered that in post 2 — but the moment you close the tab, the costume comes off. Nothing persists.

A real agent, in the sense we mean on this rung, persists. It remembers your name, your preferences, what you were working on last week, which of its past attempts succeeded, which failed, and what it learned in the process. It uses that memory the next time you show up.

This post is about how that actually works. Not the marketing version (“our AI has memory!”) — the architectural version. What gets stored where, what gets retrieved when, and — the part nobody mentions but matters most — what gets deliberately forgotten so the whole system doesn’t collapse under its own weight.

Hand-drawn four-tier memory pyramid — working memory at top, session memory, long-term memory (semantic, episodic, procedural), and retrieval layer at bottom
Four tiers of agent memory. Forgetting on purpose is as important as remembering.

Memory is architecture, not magic

The model itself has no memory. This is worth repeating until it’s muscle memory: every rung above this one is the harness pretending to have a memory, by storing information externally and handing relevant pieces back to the model through the context window each turn.

So when we talk about “an agent’s memory,” we are really talking about four things in combination:

  1. A store where memory is kept across time.
  2. A writer that decides what to put in the store.
  3. A retriever that decides what to pull out at the right moment.
  4. A budget — because everything retrieved costs context window space.

Get any one of these wrong and the whole scheme falls apart. A store with no retriever is a drawer of notes nobody reads. A retriever with no writer has nothing to find. A writer and retriever with no budget discipline blow up the context window and push out the user’s actual message.

Real agent memory is an engineered composition of all four.

The four tiers of memory

Think of memory in an agent as four stacked time horizons, each with a different purpose and cost.

Tier 1 — Working memory (seconds to minutes)

This is just the current context window. The model’s “short-term” memory is whatever happens to be inside the text it’s reading right now. The most recent user message, the last few tool calls, the result of the last retrieval — all of it lives here, and all of it disappears the moment the request ends.

Working memory is free (it’s what the model already has) but limited and volatile. Nothing survives without being explicitly saved somewhere else.

Tier 2 — Session memory (one conversation)

This is the state kept alive across turns within a single conversation, but not beyond it. In practice it includes:

  • The running transcript.
  • A rolling summary of earlier turns once the transcript gets too long — the harness periodically asks the model to summarize the first half of the conversation so it can be compacted before it falls off the context window.
  • Session-scoped variables the agent has accumulated: “the user is looking for a flight to Tokyo.”

Session memory is the cheap, stateful layer that makes a conversation feel coherent. It lives in memory (literally — in a cache or a database row keyed by conversation ID) and is cleared or expired when the session ends.

Tier 3 — Long-term memory (across sessions)

This is where the real work is. Long-term memory survives across sessions, conversations, weeks, months. It is what makes an agent feel like a continuing presence rather than a series of disconnected strangers.

Crucially, long-term memory is not one thing. Psychologists split human long-term memory into three categories, and the same taxonomy turns out to be a useful design guide for agents.

Semantic memory — facts the agent has learned. Your name. Your role. Your project’s architecture. Your code style. Your preferred response length. These are static-ish facts that don’t belong to any one conversation but apply across many. They’re usually stored as short natural-language statements (“the user prefers TypeScript over JavaScript”) with metadata (when learned, confidence, source).

Episodic memory — events that happened. “On April 12, the agent and the user debugged the checkout bug together. The root cause was a stale cache.” These are narrative records of what happened, often including the agent’s own actions and observations. Episodic memory is what lets an agent say, correctly, “we looked at this last week — do you want to pick up where we left off?”

Procedural memory — how-to skills. “When the user asks to deploy, run these three tools in this order, wait for CI, then report back.” Procedural memory is the agent’s accumulated repertoire of routines — patterns that worked well and should be repeated. It is the slowest-growing, longest-lived kind of memory, and often the most valuable.

Production systems often store these three kinds separately, with different writers, different retrievers, and different retention policies. Mixing them is possible but usually makes retrieval worse.

Tier 4 — Memory retrieval at inference

All the memory in the world is useless if it can’t be pulled into the context window at the right moment. This tier is the retriever — and mechanically, it is exactly the same idea as RAG from post 4.

When a new user message arrives, the harness:

  1. Embeds the message (and maybe the recent conversation) into a vector.
  2. Searches the long-term memory store for the most relevant facts, events, and skills.
  3. Injects the top matches into the system prompt or an early message, as a compact block of “things the agent remembers.”

This is RAG for the agent’s own past. Everything you know about retrieval — chunking, hybrid search, re-ranking, failure modes — applies here too.

What to remember — and what to forget

This is the part of the design that’s easy to underweight. You cannot simply save everything. You have a budget (context window and storage), and you have a cost (irrelevant memories distract the model). Every production memory system is, fundamentally, a deliberate forgetting system.

A few common policies, in rough order of sophistication:

1. Write-everything, retrieve-on-relevance. Save every turn’s raw content to the store, and rely on the retriever to pull only what’s relevant. Works for small scales, breaks at larger ones because retrieval noise grows.

2. Summarize-then-save. After each session, have a model distill what happened into a compact note (a few sentences per fact, per event, per learned routine) and save those rather than the raw transcript. Storage stays small, retrieval stays sharp.

3. Explicit memory tools. Give the model a tool it can call to save something: remember(statement). The model decides what’s memorable. This sounds fragile and in practice works surprisingly well — models are reasonable judges of what’s worth saving when asked directly.

4. Time decay. Older memories gradually weigh less in retrieval, and eventually get pruned. “Active last month” beats “active two years ago” for most purposes.

5. Consolidation jobs. Periodically, an offline process re-reads the memory store and collapses duplicates, resolves contradictions, and abstracts patterns from episodic memories into procedural ones. This is the closest thing to a “sleep cycle” in agent design, and it’s where the most durable long-term memory comes from.

Small agent products can skip most of this. Large ones that stay useful for years have all of it.

The subtle danger of remembering too much

Every item stored in long-term memory is an item that might be retrieved and injected into a future prompt. Sometimes that’s what you want. Often it isn’t.

Three real failure modes:

1. Stale memory. The agent learned last year that you were a Python developer. You’ve been writing Rust for six months. The stale fact is still being retrieved and is subtly shaping every reply. Fix: time decay, explicit overwriting, or periodic re-validation (“is this still true?”).

2. Contradictory memory. The agent remembers that your preferred deploy command is ./scripts/deploy.sh. It also remembers, from a later session, that you now use make deploy. Which gets retrieved? Both, sometimes. The model gets confused. Fix: store timestamps, prefer newer, consolidate.

3. Privacy surface. Everything saved to long-term memory is something that can, in principle, leak back out in a future response. “Write me a cover letter” is a safe thing to remember. Medical details, financial credentials, or personal content from another conversation are not. Fix: explicit categories for what’s memorable (never save PII by default), and a clear path for the user to inspect and delete.

Memory for tools and skills, not just chat

It’s easy to think about memory as being about chat history. The more interesting kind, for agents, is memory about the agent’s own actions.

  • “Last time I tried this approach, the tool returned X.” (Prevents re-doing dead ends.)
  • “The deploy tool often returns a spurious timeout; retry once before reporting failure.” (Learned operational knowledge.)
  • “The user has approved edits in this file before without re-asking.” (Reduces friction.)

This kind of memory is what turns an agent from “a chatty worker with a journal” into something that actually gets better at its job over time. It’s also the least-common kind in shipping products, because it requires clear write paths and a principled retrieval story — exactly the parts that are easy to skip in the first version of a product.

Where this rung ends

With working, session, and long-term memory in place, a single agent can carry out a multi-step job, remember across sessions, and improve over time. That’s already a remarkable thing. It is also, roughly, where most well-engineered products in 2026 live.

But there is a ceiling. A single agent can only hold so many skills at once before it becomes a jack of all trades and master of none. For big, differentiated work, the next move is to split the problem — give it to a team of agents with different specialties, supervised and monitored by a platform around them.

That’s the finale.

Read next: Multi-Agent Systems & Production Platforms.

0

Next in this series

Multi-Agent Systems & Production Platforms

Continue reading