Abstract. Two AI coding workflows — GSD (a planning-first, fan-out method) and ADD (a single-task, contract-first method) — were each given one identical specification and asked to build the same forty-line utility function. Both produced correct, fully-tested code. GSD recorded the work in five files totalling 807 lines, of which 644 were research, plan, and review prose; ADD recorded it in three files totalling 186 lines, with the entire specification in a 22-line contract. Token usage, summed from each agent’s own session log, was 3,008 output tokens for ADD versus 15,594 for GSD — a 5.2× difference — with no measured quality cost on this task. This paper lays out the method, the recovered artifacts, the per-agent numbers, and the threats to validity, then asks the only question that matters: why the cheaper method was not the less careful one.
A note on trust. Every figure here is measured, and each section names its source. Line and file counts are exact wc -l over artifacts recovered to tmp/experiment/recovered/; token counts are exact, summed from the agents’ usage logs. Estimates carry a ~.
1. The problem: re-teaching is the real cost
The expensive part of agentic software work is not the model call. It is re-teaching the model the project every time work resumes — what is already settled, which files matter, what “done” is allowed to mean. Every workflow pays that tax; they differ only in how much each has to reload to pay it.
That reload cost is invisible on a per-call invoice and decisive on a per-feature one. A method that records a feature across a dozen documents must, on resume, load those documents back into a fresh context. A method that records it in one tracked file loads one file. This paper measures that difference on a single controlled task and traces where it comes from.
The short version. GSD pays the reload cost over and over — once per sub-agent — and each sub-agent leaves a document the next one must read back in. ADD keeps the project’s state in files on disk and walks one task through a single gate, so it reloads almost nothing. Fewer tokens, same safety — ~5× fewer output tokens here, for an identical, fully-tested result.
2. The two systems compared
Both methods pursue the same two safeties: not building the wrong thing, and not shipping a broken thing. They buy that safety differently.
GSD (Get-Shit-Done) buys it with breadth. A phase fans out specialist sub-agents — a researcher, a planner, a reviewer, a verifier — each a fresh context that loads the project, does one job, and writes a document the next agent consumes. Independent perspectives catch what a single pass misses.
ADD (AI-Driven Development, the subject of this series) buys it with structure. One task walks a fixed sequence — specify → freeze a contract → write failing tests → build → verify — recorded as sections of a single file the tool tracks by a phase marker. Direction is fixed before the build; the proof is a passing test suite, not a second read of the diff.
The comparison below puts those two bets on the same function and meters both.
3. The system under test
The artifacts are not sandbox toys — they are the build records of InkPaper, the blog you are reading now. It was specified and shipped as a run of milestones under a GSD-style workflow, then handed to ADD for ongoing work, so its repository carries both systems’ real files. The milestone target, verbatim from the roadmap:
A beautifully typeset, performant reading experience where rich interactive content enhances — never gates — the writing.
That target shipped as nineteen features across the v1.0 → v2.0 milestones, every one marked [x] done:
- Reading core — Foundation & Design System · Content Spine & Static MDX · Routing, SEO & Discovery · Interactive MDX Tier · Keystatic CMS · Edge AI, /write & CI/CD
- Series & engagement — Blog Series · Reading List · Email Notifications · Backend Clean-Code Series · Tech-Debt Cleanup · Series Banners
- Social platform — Editorial Homepage · Supabase Auth · Cloud Bookmarks · Post Reactions · Threaded Comments · Reader Profiles · Social Follows & Personalized Feed
Each feature left a GSD planning folder behind. The experiment isolates one helper of the kind the Blog Series feature is built from — getSeriesNeighbors, which finds a post’s previous and next part — and rebuilds that single unit under each method, so the same work can be metered both ways.
Measured — the feature list is the - [x] lines of .planning/milestones/v2.0-ROADMAP.md (plus phases 10–12 from the v1.1 roadmap); the target is its Core Value: line, verbatim. The theme groupings are descriptive, not exact milestone boundaries.
4. Method
Task. Implement getSeriesNeighbors(posts, currentSlug) — a pure function returning the previous and next post within a series. About forty lines; six required test cases covering ordering, the two series ends, no-series, unknown slug, and a single-post series.
Controls. Both agents received the same written specification, the same definition of done (all required tests green), and the same isolated git worktree. Neither could see the other’s work. The ADD agent ran the lean loop solo. The GSD agent ran the full fan-out, spawning its own researcher, planner, and reviewer sub-agents.
Measurement. Token usage was read from each agent’s session transcript — every assistant turn carries a usage record (input, output, cache-creation, cache-read). Output tokens are reported as the cleanest “work generated” signal; totals include context re-feed and cache reads. GSD’s three nested sub-agents are counted in its totals. Document footprint is wc -l / wc -c over the files each agent wrote, recovered from the transcripts to tmp/experiment/recovered/. Both implementations were re-run from the recovered files to confirm the tests pass — so the pass/fail column is observed here, not self-reported by the agents.
5. Results
5.1 What each method wrote
The spec was identical. Here is everything each method produced for it.
GSD — three prose documents, three sub-agents. A researcher first studied the repo’s conventions:
## 1. Utils Directory ConventionsPatterns observed:- Header comment: each file begins with `// src/utils/<name>.ts` + a one-line purpose- Named exports only: no default exports- TypeScript interfaces exported at the topA planner turned that into a 371-line implementation plan:
## 1. Files to Create| File | Purpose ||------|---------|| `src/utils/series-neighbors.ts` | The utility (named exports, pure function) || `tmp/experiment/gsd/series-neighbors.test.mjs` | Test script (node:assert, zero deps) |And after the build, an independent reviewer signed it off in a third file:
# Code Review: getSeriesNeighbors## VERDICT: PASS| Requirement | Status | Notes ||---|---|---|| Function signature matches spec exactly | PASS | named `PostLike` alias — semantically equivalent |RESEARCH.md (173 lines) + PLAN.md (371) + REVIEW.md (100) = 644 lines of prose about a forty-line function, each written by a fresh context and read by the next.
ADD — one contract, then a test. ADD wrote no research document and no plan document. It froze a single contract, wrote a failing test, then the code. The contract in full — its complete specification of the same function:
# Contract: getSeriesNeighbors
## Function signature (frozen)```tsexport interface SeriesNeighbor { slug: string; title: string; order: number }export function getSeriesNeighbors( posts: Array<{ slug: string; data: { title: string; series?: { slug: string; order: number } } }>, currentSlug: string): { prev: SeriesNeighbor | null; next: SeriesNeighbor | null }```
## Invariants1. Input posts may be in any order.2. Only posts whose `data.series.slug` matches the current post's are considered.3. Within a series, sort by `series.order` ascending; ties broken by `slug` ascending.4. `prev` = neighbour immediately before current (null if current is first).5. `next` = neighbour immediately after current (null if current is last).
## Reject / edge cases (always return `{ prev: null, next: null }`)- `currentSlug` not found in `posts`- Found post has no `data.series`- Single-post series (no neighbours exist)Signature, invariants, edge cases — 22 lines. No separate plan for an executor to reload, no separate review, because the executor is the same chain that wrote the contract and the proof is the red-then-green suite.
5.2 Document footprint
For getSeriesNeighbors | GSD | ADD |
|---|---|---|
| Files written | 5 | 3 |
| Total lines | 807 | 186 |
| Spec / plan / review prose | 644 (RESEARCH + PLAN + REVIEW) | 22 (one CONTRACT) |
| Code | 65 | 42 |
| Tests | 98 (7 cases) | 122 (6 cases) |
| Result | PASS | PASS |
Measured — both file sets recovered from the experiment’s session transcripts to tmp/experiment/recovered/{gsd,add}/; counts are wc -l there. The contract and the three GSD excerpts above are verbatim.
5.3 Token cost, per agent
The documents predict the tokens, and the meter agrees. Summed from each agent’s session log:
| Agent | Role | Output tokens | Total tokens | Turns |
|---|---|---|---|---|
| ADD | solo loop | 3,008 | 938,161 | 28 |
| GSD orchestrator | drives the fan-out | 4,482 | 1,645,388 | 37 |
| GSD researcher | sub-agent → RESEARCH.md | 3,333 | 645,804 | 17 |
| GSD planner | sub-agent → PLAN.md | 5,209 | 259,135 | 8 |
| GSD reviewer | sub-agent → REVIEW.md | 2,570 | 192,719 | 7 |
| GSD total | 4 agents | 15,594 | 2,743,046 | 69 |
| Headline | ADD | GSD | Ratio |
|---|---|---|---|
| Output tokens | 3,008 | 15,594 | 5.2× |
| Total tokens | 938K | 2.74M | 2.9× |
| Agents | 1 | 4 | — |
| Wall-clock | ~110s | ~410s | 3.7× |
| Tests | 6 / 6 green | 7 / 7 green | — |
Five times the output tokens, nearly four times the wall-clock — for the same working function. Note the per-agent rows: no single GSD agent is wasteful; the cost is structural, the sum of four contexts each paying to load and emit. Almost all of GSD’s surplus maps directly to the 644 lines of research, plan, and review from §5.1. The documents are the token bill.
Measured — token counts summed per agent from session-transcript usage records, nested sub-agents included. Both builds re-run from the recovered files; the 6/7 green is observed.
6. Analysis: why the cheaper method stays safe
The instinct is that fewer tokens must mean less care. The result inverts it, and the inversion is the point: ADD moves the rigor rather than removing it. Four mechanisms account for the gap, each removing a whole category of spend rather than trimming within it:
- State lives in a file. A resume reloads one task plus a pointer, never the repository or a folder of phase documents.
- Progressive disclosure. Only the current phase’s guide is resident, not the whole method.
- One bundle, one approval. Spec, scenarios, contract, and tests freeze together — collapsing the human round-trips that each re-summarize state back into context.
- Evidence over inspection. The gate trusts a red-then-green suite over re-reading the diff — cheaper to produce and a stronger signal than an agent talking itself into believing the code looks plausible.
None of that is a smaller model or a skipped step. The careful work simply lands where it is cheapest — on a 22-line frozen contract instead of 644 lines of prose. The decisive evidence is GSD’s own output: its REVIEW.md returned VERDICT: PASS with no blocking findings on the same function ADD also got right. On this task, the extra 644 lines of documents found nothing a frozen contract and a red suite had missed — which is exactly what “same safety, fewer tokens” has to look like to be true.
7. Threats to validity
A single controlled task is evidence, not a law. The honest limits:
- Construct — overhead, not ceiling. This measures GSD’s overhead on a small, well-understood function, where its breadth had nothing to catch. On a large or unfamiliar feature, that same research-and-review breadth is the thing that earns its tokens back. The result does not generalize to all tasks; it generalizes to small, well-scoped ones.
- Configuration — full vs lean GSD. I ran GSD’s full fan-out. Its “quick” path skips the optional sub-agents and would close much of the gap — converging toward what ADD does by default. The 5.2× is the upper end of the difference, not the only operating point.
- External validity — n = 1, one domain. One function, one repository, one model session. The mechanism (§6) is general; the multiplier is specific. A different task or codebase would move the number.
- Measurement — tokens vs cost. Output tokens are the cleanest signal but not the whole bill; totals are dominated by cache reads, which are cheaper than fresh input. The 5.2× output ratio is the conservative, most defensible figure; the 2.9× total ratio mixes in differently-priced tokens.
- Quality — no defect injected. “Same safety” here means both passed identical-shaped tests and GSD’s own review found nothing extra. It does not prove ADD would catch a subtle defect GSD’s redundancy would — only that, on this task, the redundancy caught nothing to catch.
8. When GSD is the right instrument
The fan-out earns its tokens when its breadth has something to catch:
- Unfamiliar domains — parallel research front-loads understanding that ADD’s single-threaded ground-and-specify discovers more slowly.
- High blast radius — several independent reviewers catch failure modes one evidence gate can rationalize past; redundancy is cheap insurance against an expensive mistake.
- Many moving phases — coordinating dependencies and contributors across a roadmap is what GSD’s orchestration is built for.
ADD’s bet is that these are the minority of real tasks, not that they don’t exist. When one arrives, you opt back into rigor deliberately — lower the autonomy for a human gate, spawn an advisor sub-agent — rather than paying for fan-out on everything.
| If the work is… | Reach for | Because |
|---|---|---|
| A focused, well-bounded feature or fix | ADD | direction-before-speed kills rework; state-on-disk keeps context flat |
| Mapping an unfamiliar domain | GSD | parallel research breadth front-loads understanding |
| A high-blast-radius change | GSD, or ADD at lowered autonomy | independent assurance is cheap against the downside |
| Broad multi-phase governance | GSD | orchestration across phases is what it is built for |
Stated as judgment, not data — §8 is reasoning about trade-offs, not a number I metered. I mark it as such rather than dress it up as a finding.
9. Conclusion
Two records of one function. They encode the same care — the same signature, the same invariants, the same passing tests — and reached it for a 5.2× difference in output tokens. GSD spread the work across five files and 807 lines a fresh agent must reload; ADD kept it to three files and 186, with the spec in a 22-line contract. The cost difference is not effort and not quality; it is reload. ADD spends fewer tokens by reloading and rebuilding less — moving cost from the expensive places (re-establishing context, re-inspecting diffs, recovering from wrong direction) to the cheap ones. On a small, well-scoped task that is the whole story: same safety, far fewer moving parts. On a large or unfamiliar one, GSD’s breadth is worth its bill. The skill is knowing which task is in front of you — and not paying for fan-out you do not need.