Ai Ml

Steer Before You Sprint: Maximizing Opus in a Large Codebase

In a large codebase the bottleneck isn't how smart Opus is — it's how precisely you point it and how tightly you verify. Six portable moves to steer a fast agent so its speed lands on the right target instead of becoming faster waste.

Tin Dang avatar
Tin Dang
Warm-paper editorial title card reading 'Steer before you sprint' — 'Steer' in teal, 'sprint' in amber — over the line 'Maximizing Opus in a large codebase: direction beats raw speed'

You point Opus at a task in a 600,000-line repo. Thirty seconds later there’s a diff — clean, well-named, plausible. It compiles. The tests it wrote are green. You skim it, it reads right, you ship it. Three weeks later a payment reconciles wrong, and the cause is a single assumption the model made about a function it never actually read.

Nothing in that story was a model failure. Opus did exactly what a fast, capable agent does: it moved confidently in the direction you pointed it. The problem is that you didn’t point it anywhere precise — and in a large codebase, “anywhere precise” is the whole game.

Most people, asked how to get more out of Opus on a big codebase, reach for the wrong levers: a bigger context window, a cleverer system prompt, a longer wishlist in the task description. Those help at the margins. But they treat the model’s intelligence as the constraint, and in a large codebase the model’s intelligence is rarely the constraint. Opus is already fast enough and smart enough to write the change. What it lacks is direction — a correct picture of the code it’s touching — and what you lack is verification capacity — a cheap way to know the change is right before it reaches production.

There’s a line from the AI-Driven Development method that names the trap exactly:

Speed in the wrong direction is not progress; it is faster waste.

That is the entire thesis of this post. Maximizing Opus is not about squeezing more tokens of output per prompt. It’s about maximizing direction × verification so that the model’s speed lands on the right target. A fast agent pointed wrong just produces wrong answers faster — and a big codebase is a machine for pointing agents wrong.

What follows is six concrete moves you make from the operator’s seat. They’re portable: they work the same in Claude Code, Cursor, Codex, or any agent you drive through a terminal or an editor. Each move leads with the principle, names the tool second, and gives a fallback for when you don’t have that tool — so the idea lands whether or not your harness has subagents or a fancy index. None of them is “use a smarter model.” The model is the constant. You are the variable.

Why a big codebase punishes a fast agent

Before the moves, it’s worth being precise about why scale changes the problem. If you understand the three structural reasons, the moves stop looking like rituals and start looking like the obvious countermeasures they are.

1. The window can’t hold the repo. Even a million-token context window cannot hold a multi-million-line codebase. The model always reasons over a slice of the code — whatever made it into context this turn. In a tiny project the slice is the whole thing, so the model effectively sees everything. In a large one, the slice is a tiny fraction, and which fraction it sees is a choice — yours or the harness’s. Miss the one file that holds the real invariant, and the model will reason flawlessly from an incomplete picture and hand you a confident, wrong edit. The intelligence was fine. The inputs were partial.

2. Blast radius. A large codebase is a dense web of invariants, callers, and implicit contracts. A change that is locally, obviously correct can break something three modules away that nothing in the prompt ever mentioned. The bigger the system, the more places a “looks right here” edit quietly violates a “must hold there” rule. Small repos forgive sloppy direction because there’s nowhere for a mistake to hide. Big repos have endless hiding places.

3. Plausibility is the default failure mode. A language model is, at its core, a generator of plausible continuations. Most of the time plausible and correct coincide, which is why this works at all. But in a large codebase, plausible-but-wrong is the expected output, not the exception — there are simply too many real constraints for a plausible guess to satisfy them all. And plausible-but-wrong is the most expensive kind of wrong: it compiles, it reads well, it survives a skim, and it passes review by anyone who is also just reading for plausibility. It fails later, in production, far from the diff.

Now layer speed on top. A fast agent generates wrong-direction work faster than you can review it. Without a structural way to point the speed and catch the drift, raw velocity becomes a liability — you’re not shipping features faster, you’re accumulating plausible mistakes faster. The six moves are how you convert that velocity back into progress.

Move 1 — Aim at reality before you prompt

Principle: retrieve, don’t dump. Ground the model in the real code it will touch before it designs against it — the actual files, symbols, and signatures, pulled from the codebase, not reconstructed from the model’s memory of “codebases like this.”

Do this. Before you ask for a change, make the agent produce a small grounding map: the exact files and functions the task reads or modifies, their real signatures, and the one or two invariants they must preserve. Fetch those with symbol-aware retrieval — a language-server lookup, a code index, an MCP retrieval tool, “find definition / find references” — so you pull the precise processPayment out of the five that exist, not a paste of the whole 4,000-line module it lives in.

There’s a test for whether the grounding is real: if you (or the agent) cannot name the files and symbols the task touches, you do not yet understand the work. Producing that map is the first chunk of the task, not a warm-up for it.

Why it works in a big repo. The model’s core failure here is a wrong mental picture. Grounding replaces assumption with the real slice — so the design is built against what exists instead of what’s likely. And because retrieval pulls symbols instead of files, you spend your context budget on signal (the three definitions that matter) instead of noise (the 3,900 lines around them).

The anti-pattern it fixes. Prompting from a vague English description and letting the model guess which function you meant, which table is the source of truth, which of two auth paths is live. Or the opposite extreme: pasting an entire file “so it has context,” and burning half the window on code the task never touches.

No-fancy-tools fallback. You don’t need a semantic index. Open the two files yourself, copy the three exact function signatures and the one invariant comment, and hand the agent those instead of the whole module. The discipline is “give it the real anchors,” not “own tool X.”

Move 2 — Budget the context window like the scarce resource it is

Principle: load the working state; reference the rest. The context window is a budget, not a bucket. Everything you put in it competes for the model’s attention with everything else — so curate it on purpose. (The AI-Driven Development method names these two layers directly: the State you load into the window, and the Story — the longer audit trail — you only point at and reference.)

Do this — four habits that pay off immediately:

  • Keep a standing brief the agent reads every session. A short project-memory file — AGENTS.md, an agent-loaded README, a .cursorrules, a CLAUDE.md, whatever your tool reads on startup — that holds the architecture in one screen, the conventions, the load-bearing invariants, and the explicit “don’t do this here” list. This is the context you never want to rediscover; pin it once so every session starts already oriented.
  • Point at paths, don’t paste. Tell the agent where things live and let it open files on demand. Front-loading ten files “to be safe” just dilutes attention before the work starts.
  • Fresh session per task. A window that has accumulated three unrelated tasks is polluted. The model starts contradicting decisions it made an hour ago because they’ve scrolled past the range where it actually attends. A clean window holding the right five files beats a bloated window holding fifty.
  • Compact deliberately, at boundaries. Summarize and reset between tasks, not in the middle of a delicate change where the details you’re compacting away are the ones that matter.

Why it works. The model can only reason over what’s actually in front of it; junk crowds out signal, and “context rot” — stale, half-relevant history — quietly drags every subsequent answer toward mediocrity. It’s a common practitioner observation that a top model fed a bloated, polluted window can underperform a weaker model given a clean one — not a benchmark you should quote as fact, but a pattern worth taking seriously. Protecting the window is one of the highest-leverage things you do, and it costs nothing but discipline.

The anti-pattern it fixes. The mega-session: one endless conversation that spans a whole day and ten loosely related tasks, where by the afternoon the model is confidently undoing what it did in the morning. It feels efficient — “I don’t have to re-explain everything” — but you’re paying for that history with degraded reasoning on every turn.

No-fancy-tools fallback. Even in a plain chat UI, start a new thread per task and paste a ten-line brief at the top: stack, the relevant module, the one rule that must hold. You’ve reconstructed the standing brief by hand.

Move 3 — Fix direction before code: spec, frozen contract, red test

This is the heart of direction before speed, and the single move that most separates getting lucky from getting reliable.

Principle: decide what it must do and reject, freeze the external shape, and write a test that fails — all before a line of implementation exists. Spec and contract come before any code, every time.

Do this, in order:

  1. State the spec in plain language. What must the change do? What must it reject — with named error cases, not a vague “handle bad input”? What is true once it succeeds? The goal is to leave no ambiguity for the model to resolve by guessing. Every requirement you don’t state is a decision you’ve silently delegated to a plausible guess.
  2. Freeze the contract. Fix the external shape — the function signatures, the data structures, the names, the error cases — and treat it as frozen. This is the one decision that’s expensive to change after code exists, so make it consciously and once. In the method this step is called, plainly, the decision point of the whole method. Everything downstream flows from it.
  3. Write the test red first. Turn the spec into an automated check and confirm it fails before any code exists — and fails for the right reason (the feature is missing), not a typo in the test. “Red for the right reason.” A test that was never red proves nothing; it might be asserting something that was already true.

Only now do you let Opus build — and here you let it run at full speed, because its speed is finally aimed. The contract can’t move under it. The red test names the exact finish line. This is the step where the model needs the least supervision, precisely because the direction is fixed.

Why it works in a big repo. This is where the plausible-wrong diff dies. A frozen contract plus a red test converts the question “does this look right?” — which plausibility games effortlessly — into “is the bar green?” — which it cannot. Just as importantly, it relocates your judgment to the cheap moment (a few minutes deciding the contract) instead of the expensive one (squinting at a 600-line diff hunting for subtle drift after the fact). You front-load the one thing only a human can do well — deciding what right means — and let the machine do the rest.

The anti-pattern it fixes. Vibe-coding the whole feature: describe it, accept the diff, ship it. Everything nobody spelled out — what a threshold is, who’s authorized, what “reconcile” means at the edges — gets guessed by the model and discovered by your users.

No-fancy-tools fallback. No test runner handy? Write the acceptance check as a runnable script, or even a precise manual checklist the change must satisfy, and confirm it currently fails. The point isn’t the framework — it’s that a fixed, falsifiable target exists before the code, so “done” is a fact and not a feeling.

Trust through evidence, not inspection: the change is trusted because the test passes, not because the diff reads plausibly.

Move 4 — Fan out to subagents to keep the main context clean

Principle: spend a cheap context to protect the expensive one. Your main session — the one holding the contract, the plan, and the clean window for the actual decision — is your most valuable context. Don’t pollute it with grunt work.

Do this. When a step needs a broad, noisy sweep — “find every caller of this function across the repo,” “map how authentication flows end to end,” “list which modules still use the deprecated client” — delegate it to a subagent. The subagent burns its own window reading forty files and returns a compact, distilled map. Your main thread never sees the forty files; it sees the one-paragraph answer.

Why it works in a big repo. Broad search is exactly the work that pollutes a window: dozens of file reads, most of them irrelevant to the final decision, all of them sitting in your history degrading every later turn. Isolating that work means the orchestrator stays focused and clean. And because subagents run in parallel, you also cover far more ground per minute of wall-clock — three sweeps at once instead of three in a row.

A useful rule of thumb: reach for a subagent when the task is a broad fan-out search where you only need the conclusion, not the file dumps; when you have independent work that can run in parallel; or when answering would mean reading across many files and you’d rather keep that reading out of your main context. That same isolation is what makes the skeptic in the next move trustworthy.

The anti-pattern it fixes. Running the repo-wide grep in your main session and letting four thousand lines of search output marinate in the window for the rest of the task — silently taxing every subsequent step.

No-subagents fallback. No agent-spawning in your harness? Do the sweep in a separate, throwaway session, then paste only the distilled answer back into your working session. Same principle: quarantine the noise, keep the conclusion.

Move 5 — Trust evidence, not inspection

Principle: a change is trusted because its tests pass and the real risks were checked — not because the diff reads plausibly. “Not by inspection” does not mean “don’t read the code.” It means reading is not enough, because plausibility is precisely the trap, and a big codebase is full of changes that read perfectly and are wrong.

Do this — three checks the green must survive before you believe it:

  1. Trace the wiring. Is the new code actually called? In a large repo it’s easy to produce a correct, well-tested function that nothing invokes — a green test sitting on a dead path. Follow the call from a real entry point and confirm the new symbol is genuinely reached.
  2. Refute the green adversarially. Ask not “does this look right?” but “how is this wrong?” Hunt the three classic cheats: overfit (the code special-cases the exact fixture), vacuous asserts (a test that passes without actually checking the behavior), and stubbed-away logic (the hard part mocked out so the test never exercises it). This is where a subagent earns its keep — an independent skeptic, fresh window, told to try to break the result, defaulting to “refuted” when unsure.
  3. Check the non-functional residue — concurrency, security, architecture. None of these show up in a passing unit test. Race conditions, an injection opening, a layering violation that mortgages next quarter — these are caught by review, not by green. And security is always a hard stop: a security finding is never quietly accepted to keep a build moving.

Why it works in a big repo. The expensive bugs are the ones that survive a skim. Evidence you watched go from red to green, plus a wiring trace, plus a skeptic actively trying to refute it, catches what plausibility hides. You’re not trusting the diff; you’re trusting the proof.

You cannot move faster than you can verify.

That line is the real ceiling. Verification capacity — not model speed — is what limits how fast you can safely ship. Which is the deeper reason every other move exists: they all serve to make the right thing cheap to verify.

The anti-pattern it fixes. Accepting the first green and the plausible diff. The model announces “done, all tests pass,” you nod at a diff that reads fine, and you ship a vacuous test guarding code that was never wired in.

No-fancy-tools fallback. Even working solo, run the test without the new code and watch it fail, then with it and watch it pass — so you know the test has teeth. Then read the diff like a critic, not a fan: your job in that read is to find the flaw, not to confirm the vibe.

Move 6 — Keep the lessons, throw away the code

Principle: the artifacts survive; the code is disposable. The spec, the contract, the tests, and the lessons learned are the durable assets. The implementation is regenerable — and treating it as precious is a mistake.

Do this. Whenever a task teaches you something the model didn’t know — a convention it kept violating, an invariant it broke, a non-obvious gotcha in this subsystem — fold that lesson back into the standing brief from Move 2. Next session, the model starts already knowing it. Do this consistently and the brief becomes a dense, hard-won map of your codebase’s actual rules, and the model’s first draft creeps steadily closer to correct.

Why it works in a big repo. The single most expensive thing the model does, over and over, is rediscover context — relearning the same conventions, re-deriving the same architecture, re-stepping on the same landmine. Every lesson you persist is a rediscovery you never pay for again. And because you’re capturing the decisions (the spec and contract) rather than just the code, you can regenerate, refactor, or rewrite the implementation freely without losing the thinking behind it. That’s what “throw away the code” actually buys you: the freedom to redo the how because the what is safely written down.

This is also the move that reframes the whole question. “Maximizing Opus” is not a per-prompt optimization — it’s a compounding one. A team that folds its lessons gets a sharper agent every week. A team that lets each session’s hard-won knowledge evaporate at merge is still on day one, every day.

The anti-pattern it fixes. The lesson that dies at merge: you correct the model five times in one session, ship, and next week it makes the identical mistake — because nothing was written down, so nothing was learned.

No-fancy-tools fallback. A plain NOTES.md the agent reads at the start of each session. It doesn’t have to be elegant. It has to be loaded.

The anti-patterns: what quietly wastes Opus

Put the six moves down and look at their shadows — the everyday habits that bleed Opus’s value in a large codebase. Each line is a waste on the left and its fix on the right:

What quietly wastes OpusThe move that fixes it
Vibe-prompting the whole feature and accepting the diffFreeze a contract and write a red test first (Move 3)
Pasting whole files to “give it context”Retrieve the exact symbols (Move 1)
One mega-session for everythingOne clean session per task (Move 2)
Accepting the first greenTrace the wiring, refute the green (Move 5)
Trusting a diff because it reads plausiblyTrust evidence you watched turn green (Move 5)
Letting it touch code before a target existsSpec and red test before build (Move 3)
Re-explaining your conventions every sessionA standing brief that compounds (Move 6)
Running the noisy repo-wide sweep inlineFan it out to a subagent (Move 4)

Notice the through-line: not one of these fixes is “use a better model.” Every single one is an operator move. The model in the wasteful column and the model in the fixed column are the same model. The only thing that changed is how it was driven — which is the whole point.

Conclusion: steer before you sprint

You don’t need a smarter model for your large codebase. You need a tighter loop around the fast one you already have. The capability is there; the wins are in pointing it and in catching its drift — direction and verification, the two things a big codebase quietly strips away and the two things only you can put back.

The method these moves add up to compresses to a mantra worth keeping on a sticky note:

Steer before you accelerate. Trust evidence, not impressions. Keep the decisions, throw away the code.

If you do one thing this week, do this: take a single real change you’d normally vibe-prompt, and instead — (1) make the agent produce a grounding map first, (2) freeze a one-paragraph contract, (3) write the check that is red before the code. Ship it that way once, and feel the difference between fast and aimed.

And when you look at any change in a large codebase, ask it three questions:

  1. Where is its frozen contract? Did you decide what it must do, or did the model guess?
  2. Where is the test that was red before the code? Or did you trust a green that was never earned?
  3. What did shipping it teach the brief? Or did the lesson evaporate at merge?

If you can’t answer all three, you’re sprinting without steering — moving fast in a direction nobody chose. The discipline that answers all three has a name — it’s one expression of AI-Driven Development — but you don’t have to adopt a framework to start. Pick one move. Run it on one task. Today.

Rent the model. Own the loop.

0