We’ve climbed seven rungs together. At the bottom, a stateless function that predicts the next word. At the top, the rung we just left: a single agent with tools, a reasoning loop, and persistent memory. That agent can do a remarkable amount. It is, for most products, already enough.
This final rung is where the ladder tips over from “a working agent” into “a production platform.” Two things happen at this level. First, some problems are genuinely too big for one agent — they want a team. Second, no matter how many agents you have, running them in front of real users requires scaffolding that the rungs below mostly glossed over: evaluations, tracing, guardrails, and cost control.
Both are this rung. Both are this post.
When one agent stops being enough
Before reaching for multi-agent, ask an uncomfortable question: does your problem actually have sub-problems?
The default answer is no. Most “we need multiple agents” impulses turn out, on inspection, to be “we need one good agent with the right tools.” A single agent with a clear role, a thoughtful system prompt, and a well-chosen toolkit will outperform a three-agent team on most tasks, while being faster, cheaper, and an order of magnitude easier to debug.
The problems that do benefit from multiple agents tend to share three properties:
1. The work has distinct phases that need distinct skills. A research agent reads widely; a coding agent writes precisely; a writing agent produces polished prose. Mixing all three capabilities into one agent often produces a jack-of-all-trades that is OK at each but excellent at none.
2. The phases benefit from different contexts. A researcher wants to pull in many sources; a coder wants a tight, code-focused context; a writer wants to see the final goal and the research summary but not the noise. Separate agents keep separate contexts.
3. The team structure is stable. If the division of labor would change every task, you don’t want a fixed team; you want one flexible agent. Multi-agent works best when the roles are recurring — the same specialist patterns showing up across many user goals.
If any of those don’t hold, stay with one agent. You’ll ship faster and sleep better.
The simplest useful multi-agent pattern
When multi-agent does make sense, the pattern that covers 80% of real needs is supervisor + specialists, and it is small enough to sketch in one paragraph.
A supervisor agent receives the user’s goal. Its job is not to do the work — it is to plan the work, delegate it to the right specialist, and combine the specialists’ outputs into a coherent answer. The supervisor has tools that look like other agents: ask_researcher(question), ask_coder(task), ask_writer(draft).
Each specialist agent has its own system prompt, its own set of tools, and a tightly scoped job. The researcher has web search and document retrieval. The coder has a code execution tool and a file editor. The writer has neither — it just takes notes and produces prose.
The supervisor orchestrates. The specialists execute. The user never talks to the specialists directly.
USER GOAL │ ▼SUPERVISOR ──┬──► RESEARCHER │ ├──► CODER │ └──► WRITER │ ▼FINAL ANSWERThis is not the only pattern — there are peer-to-peer, swarm, and hierarchical variants — but supervisor + specialists is the one to start with. It’s the cleanest to reason about, the easiest to debug, and it composes (a specialist can, itself, be a small supervisor team).
The new failure modes
Multi-agent introduces a category of failure that single-agent products don’t have: communication failures between agents.
- The supervisor asks the researcher a vague question and gets a vague answer.
- The researcher returns information the coder misinterprets.
- The writer is given a draft with unresolved disagreements between sources and produces confidently wrong prose.
- Two specialists, not knowing about each other, both try to solve the same sub-problem and waste effort.
These failures are hard to catch because each agent, viewed individually, is behaving reasonably. The bug is in the protocol between them.
Practical mitigations:
- Structured handoffs. Define the schema of what each specialist accepts and returns. Treat agent-to-agent calls the way you’d treat API contracts.
- A shared scratchpad. A small, visible document (often just a JSON object) that all agents can read and append to, carrying the current state of the task across specialists.
- The supervisor as arbiter. When specialists disagree, route the decision back to the supervisor rather than letting one specialist override another.
These are old lessons from distributed systems, showing up again in new clothes.
The production scaffolding — four pieces
Whether you have one agent or ten, running them in front of real users demands four pieces of infrastructure that none of the previous rungs made explicit. None of this is glamorous. All of it is the difference between a demo and a product.
1. Evaluations (evals)
An eval is a reproducible test that grades your agent’s output against a known good answer or a specified rubric. You run it whenever you change something — a prompt, a model, a tool, a piece of retrieval — to see whether the change made things better or worse. Without evals, you are flying blind. With them, you have version control for behavior.
A good eval suite covers:
- Golden-path tasks the agent should definitely get right.
- Edge cases — weird inputs, long inputs, empty inputs, inputs with trick phrasing.
- Known failure modes — bugs you’ve fixed, to make sure they stay fixed.
- Regression tasks — random samples of real user traffic, replayed.
Evals can be graded by humans, by programmatic checks, or by another model acting as a judge. Model-as-judge is faster but less reliable; human grading is the other way around. A healthy suite uses both.
2. Tracing
A trace is the full, ordered record of one agent run: every model call, every tool invocation, every retrieved passage, every intermediate output, with timestamps and costs. Think of it as a stack trace for AI — what happened, in what order, with what inputs and outputs.
When something goes wrong in production (and it will), the trace is how you figure out where. Without it, you have a user screenshot of a bad answer and no way to reproduce. With it, you can replay the exact sequence, identify the tool call that returned garbage, and fix the thing that caused it.
Every serious production AI platform ships with a trace viewer. If you’re building one and don’t have traces, stop and add them before you add anything else.
3. Guardrails
A guardrail is a check that runs around the agent, not inside it, and can block unsafe inputs or outputs. Examples:
- Input filters. Reject prompts that ask the agent to reveal its system prompt, or that contain known jailbreak patterns.
- Output filters. Check the agent’s reply before it reaches the user — for unsafe content, for PII leakage, for claims outside the agent’s domain.
- Tool-use filters. Require explicit user confirmation before any tool call that’s destructive (sending an email, running a shell command, moving money).
The key property of a guardrail is that it lives outside the model’s prompt. It runs in code, deterministically, and cannot be argued with. Rules you want the model to follow go in the system prompt. Rules you need to enforce go here.
4. Cost and latency control
Agents are expensive and slow by default. Every iteration of the loop costs money and time. A production platform controls both:
- Per-run budgets. Hard caps on tokens, tool calls, and wall-clock time per user request. If the budget is hit, the agent surfaces partial progress and exits.
- Model routing. Cheap models for routine steps, expensive models only when the work needs them. A good router saves 70% of spend with no quality loss.
- Caching. Tool calls with identical inputs return the cached result. Embedding the same document twice is a waste. Caching at multiple layers compounds.
- Concurrency limits. Limits on how many agent runs can be in flight at once, so a traffic spike doesn’t bankrupt you or rate-limit your upstream providers.
If the previous rungs were about making the agent smarter, this rung is about making it affordable.
A last honest take on multi-agent
Two years into the multi-agent boom, the returns have been mixed. Impressive demos, fewer landed products. Part of the reason is that the interesting problems that genuinely split across specialists are rarer than the hype suggests. Part of it is that the four pieces of scaffolding above are harder to get right when the agents are talking to each other.
The products that have succeeded with multi-agent — in research, in coding, in customer support — share a common recipe:
- Start with one agent. Instrument it. Evaluate it. Find where it consistently struggles.
- Split only when the bottleneck is the agent’s generality. If one prompt and one toolkit can’t cover the work, split — but along the lines where the work actually breaks, not where the org chart says it should.
- Keep the team small. Two or three specialists is enough for almost every case. Five is already probably too many.
- Invest heavily in the four pieces of scaffolding. Multi-agent without evals, traces, guardrails, and cost control is a ticking time bomb.
The failures you avoid by following that recipe are worth more than the clever architectures you build instead.
What you can do with the full ladder
If you’ve read this far, you’ve walked the full map:
- An LLM is a stateless function that predicts the next word.
- A system prompt shapes how it answers, at almost zero cost.
- RAG hands it the right page before it answers.
- Tools let it press real buttons.
- The agent loop strings those actions into real work.
- Memory gives it continuity across sessions.
- A team, wrapped in production scaffolding, scales the whole thing.
A product that reaches rung 7 in the right way — not by piling on features but by building each rung solidly before climbing — is, genuinely, a different kind of software. It has an author; it has judgment; it operates on behalf of a user. It’s not magic. It’s engineering. But the engineering rewards the craft.
We are very early in this. The rungs will change. New ones will be added on top (continuous learning, multi-modal reasoning, real physical embodiment) and some of the current ones will merge as models get smarter. But the shape of the ladder — prediction → instruction → knowledge → action → loop → memory → team — is likely to remain recognizable for years.
If this series helped you see where a product you use actually lives on the ladder — or where the one you’re building needs to go next — it did its job.
Where to go from here
If you want to keep climbing:
- Go back and build rung 4 (tool use) if you haven’t. It’s the single highest-leverage thing you can do to a chat baseline. The rest of the ladder opens up after it.
- Read the companion series on this blog: Anatomy of an AI Harness — it takes this same material and dissects a real production system (Claude Code) to show how each rung is implemented in practice. Where this series is the map, that one is the field guide.
- Start small. Build a one-rung-at-a-time product and instrument it properly. An agent you can debug always beats an agent you can’t.
- One pattern worth learning once you’ve climbed the ladder: skills — reusable, lazy-loaded bundles of instructions + tools that let an agent grow a playbook without bloating its system prompt. Covered in the bonus appendix to this series: Skills — Giving an Agent a Playbook.
Thank you for reading.