Open any AI product in 2026 and you’ll find a chat box. Type a question, read a reply, move on. The interaction feels complete — until the day you ask it to do something instead of explain something, and the cracks appear.
“Book the flight.” “Reconcile these invoices.” “Ship the PR once CI is green.” “Plan the trip and buy the tickets.”
The chat box handles none of these well. Not because the model underneath is weak — the same model will happily describe, in crisp paragraphs, exactly how each of those tasks should be done. The gap is not in the model. The gap is in everything around it.
That gap has a name now. People call the thing on the far side of it an agent. And if you’ve been following the field, you’ve probably been told — many times, by many different product pages — that you’re now using one. Sometimes you actually are. Often you aren’t. The word has been stretched to cover everything from a better autocomplete to a full autonomous worker, and most people are left with a vague feeling that something changed without a clear picture of what.
This series is that picture.
What a chatbot actually is
At the bottom of the ladder is the thing almost everyone already has a mental model for: a chatbot.
You type a question. A large language model reads your question (along with a bit of history, so it seems to remember). It produces an answer, one word at a time. Then it waits for you to type again. When you close the tab, it forgets you existed.
That’s the floor. It is genuinely useful — genuinely a new thing in the world — and also genuinely limited:
- It answers. It does not act.
- It forgets. Each new conversation starts fresh.
- It hallucinates when it doesn’t know. Nothing is checking its claims against reality.
- It can’t reach beyond its own knowledge. It has never read your documents. It has no idea what happened yesterday.
Every product built only on this floor — and many, many are — inherits all four of those limits. No amount of prompt engineering fixes them. They are properties of the substrate.
What an agent actually is
At the top of the ladder is something that looks superficially similar but behaves in a different category.
An agent is given a goal, not just a question. It plans. It calls tools to do real work. It looks at what came back. If the result is wrong, it tries something else. It remembers what happened yesterday, last week, last conversation. If the job is big, it delegates pieces to other agents with different skills. When it’s done — or genuinely stuck — it tells you so, with receipts.
That’s the ceiling. The gap from “chat” to this is not a single feature. It is not “GPT but better.” It is not a prompt.
It is a stack of seven distinct capabilities, each of which has to be engineered on top of the raw model. Skip any rung and the tower above it falls over.
The seven rungs
Here is the map of the ladder. It is the map this entire series follows:
- LLM Core — The brain. Guesses the next word, one at a time. Nothing more.
- System Prompt — The instructions. Tells the model how to behave, what voice to use, what rules to follow.
- RAG (Retrieval) — The library. Lets the model open your documents and read them before answering.
- Tool Use — The hands. Lets the model press real buttons — send an email, run a search, update a file.
- Agent Loop — The thinking pattern. Try something, look at the result, try again — until the job is done.
- Memory — The journal. Remembers what happened yesterday, last week, last conversation.
- Multi-Agent & Platform — The team. Many workers with a manager, plus the scaffolding (evals, tracing, guardrails, cost control) that makes the whole thing safe to run.
Each rung adds exactly one thing the rung below cannot do. Each rung assumes the rungs below it are solid. You can, in principle, skip any of them — and you can see the wreckage in products that did.
| Rung skipped | Symptom |
|---|---|
| Better system prompt | Inconsistent tone, surprising refusals, runs over its own guardrails |
| RAG | Confident-sounding answers that are wrong about your own data |
| Tool use | Beautiful summaries of work that never got done |
| Agent loop | One-shot attempts that give up after the first wrong step |
| Memory | A “personal assistant” that can’t remember your name |
| Platform scaffolding | A demo that breaks on day two and nobody can tell why |
Most products in the wild are somewhere between rungs 2 and 5. A handful of well-engineered systems reach 6 or 7. The distribution is not because the top is impossible — it’s because each rung is work.
Why “just a better model” doesn’t close the gap
A natural response to this ladder is: “But models keep getting better. Won’t the next generation just handle all of this natively?”
The short answer is no — and the reason is architectural, not about model capability.
A model, no matter how strong, is a pure function: tokens in, tokens out. It has no hands, no persistent memory, no ability to start a loop or call a tool on its own. Those are properties of the harness — the system wrapped around the model. You can swap in a stronger model and get smarter decisions inside each turn, but you don’t get tool use, persistent memory, or a planning loop for free. Someone has to build them, and those systems look roughly the same regardless of which model sits in the middle.
The counterintuitive consequence: the quality of an AI product depends more on the harness than on the model. A mediocre model with a thoughtful harness will beat a state-of-the-art model in a naive chat loop on almost any real task. This is why coding assistants, research assistants, and customer-support bots differ so dramatically in quality even when they’re all running on the same underlying API.
The rungs are where the real engineering lives.
How to read this series
The rest of this series is eight posts, each roughly twenty minutes to read. Each post takes one rung and explains it from first principles, with plain-English analogies and hand-drawn diagrams designed so a non-technical reader can follow along. Engineers will find enough depth to build; non-engineers will find enough clarity to evaluate.
Here’s the running order:
- Post 2 — The Chat Baseline. What every AI system starts with: a stateless function that turns tokens into tokens. Understanding this substrate is the precondition for everything above it.
- Post 3 — System Prompts & Personas. The cheapest, most underused control surface. A well-written system prompt still out-performs most of the fancy tricks people reach for first.
- Post 4 — RAG (Library Lookup). How to hand the model the right page, right before it answers — without retraining.
- Post 5 — Tool Use. The hinge rung. The moment a chatbot stops explaining and starts doing.
- Post 6 — The Agent Loop. One tool call is an API. A loop of tool calls with reasoning in between is an agent.
- Post 7 — Memory. Context windows forget. Production agents don’t. The difference is a layered architecture of working, session, and long-term memory.
- Post 8 — Multi-Agent Systems & Production Platforms. When one agent becomes a team, and what scaffolding (evals, tracing, guardrails, cost control) turns that team into a platform you can actually ship.
You don’t have to read them in order, but the dependencies go upward — post 6 will make more sense if you’ve read post 5, and post 8 assumes you’ve met the rungs below.
A rough map of where products live
Before we start climbing, one calibration. When someone tells you a product is “an AI agent,” a useful first question is: which rung?
- Rungs 2–3 — ChatGPT’s free tier, most “AI-powered” search boxes, simple chatbot widgets. Useful, but chatty.
- Rungs 4–5 — Coding assistants like Claude Code or Cursor, AI search with citations, most copilots. Genuinely doing work, one turn at a time.
- Rungs 6–7 — Autonomous research assistants, long-running coding agents, multi-agent support systems. Still young, still brittle at the edges, but the direction of travel.
None of these rungs are optional for the top of the stack. None are sufficient on their own. They compose.
Where we go next
In the next post, we start at the floor: the LLM itself. Not the model that you’ve heard about — the actual small, stateless, somewhat strange function that every product in this series is built on top of. Understanding exactly what it is and isn’t will make every rung above it click.
If you find this useful, the rest of the series is already being published — post 2 drops alongside this one. Read it next: The Chat Baseline: What You’re Starting With.