Ai Ml

Measuring ADD Across the SDLC: What ai-proxy Cost, by Phase and by Role

A data-driven companion to the ai-proxy field study. Mining the transcripts and the .add foundation to measure an AI-driven build across the full SDLC and every role, against modeled human baselines of 6 person-months to 32 person-years, with the caveats that keep the comparison honest.

Tin Dang June 17, 2026 23 min read

Warm-paper editorial illustration of a balance scale: one small figure beside a thin bundle of papers on the left pan, weighed against a faded crowd and a tall stack of papers on the right, over a faint software-lifecycle timeline

Part eight of this series told you what building ai-proxy with ADD felt like. This part puts numbers on it — not just on the coding, but on the whole lifecycle: capturing requirements, designing, building, verifying, releasing to production, and operating. And it does the comparison the way the rest of the industry secretly avoids: role by role, against what a conventional team would have spent.

The temptation, once a thing is built, is to reach for a flattering multiple. “10× engineer.” “A month of work in a day.” Those claims are cheap because they are unfalsifiable: no one measured the counterfactual. What follows is the opposite exercise. Every figure on the AI side is counted, not estimated, from the project’s own exhaust — the Claude Code session transcripts and the .add/ foundation. Every figure on the human side is modeled, with assumptions stated so you can reject them. The honest finding lives in the gap between the two, and in the caveats that keep the gap from becoming a slogan.

The one number, stated conservatively: across the full lifecycle and all nine roles, ~7 person-days of human supervision — about 54 active hours, ~$12K of API cost — produced what the most human-favorable estimate prices at six person-months. That is a ~17× reduction in human engagement time, and every less generous model widens it. Everything below is either counted from the project’s own exhaust or explicitly labeled as a model.

What was measured, and from where

Three data sources, all already lying on disk:

The session transcripts — 286 MB across 205 JSONL files: four top-level operator↔agent conversations plus 201 subagent sidechains. Each assistant turn carries a usage record (tokens) and a model id; each tool call is a tool_use block; every entry is timestamped. This is the effort-and-cost ledger.
The .add/ foundation — the lifecycle ledger: 116 append-only Key Decisions in PROJECT.md, 23 milestone RETRO.md files with their gate records, a frozen-contract trail, an 88 KB PROJECT.md, a 60 KB CONVENTIONS.md, and a domain GLOSSARY.md.
The repository — the delivered artifact: production code, tests, migrations, commits.

Nothing here is self-reported. The numbers are recovered by aggregation, and the aggregation is reproducible — which matters, because a measurement you can’t re-run is just a nicer kind of anecdote.

# Active operator time: merge every main-transcript timestamp, sort,
# and sum the gaps — but cap any gap over 5 minutes as "away," so a
# session left open overnight doesn't bill as 14 hours of work.
find "$DIR" -maxdepth 1 -name '*.jsonl' -exec cat {} + \
  | jq -rn 'def ep: sub("\\.[0-9]+Z$";"Z")|fromdateiso8601;
            inputs | (.timestamp // empty) | ep' \
  | sort -n \
  | awk 'NR==1{p=$1;next}{d=$1-p; if(d<=300)active+=d; p=$1}
         END{printf "active hours: %.1f\n", active/3600}'

That capping step is the whole reason the figure is defensible. The raw span between the first and last transcript event is 151 hours — but one session sat open across nearly the entire six days, so the span measures the calendar, not the labor. Gap-capped, the picture is very different.

The build, counted

Dimension	Measure
Calendar span	6.3 days (2026-06-10 → 2026-06-16)
Active operator time	~54 hours (≤5-min idle cap; ~65 h at a ≤10-min cap)
→ as person-days	~7 person-days of human-in-the-loop time
Agent turns	35.1K (18.8K main agent + 16.3K subagent)
Subagents spawned	201, across 7 types (general, Explore, python, backend, frontend, test, security)
Tool calls	20,385
Output tokens	25.9M (3.5M from subagents)
Cache-read tokens	5.04B
Estimated API cost	~$12K
Production code	~49.2K LOC (Python 40.8K + frontend 8.4K)
Test code	~60.8K LOC (1.24× the production code)
Lifecycle records	116 decisions · 23 milestones · 118 tasks · 279 commits

The ~54 hours is supervision, not typing — the wall-clock window during which a human was in the loop reading a spec flag, approving a contract, resolving a HARD-STOP, watching a build go green. The genuine cognitive cost is lower; ~54 hours is the generous upper bound, which is why it’s the right number to compare against.

And the cost was made possible by caching, not frugality: the agents read 5.04 billion cached input tokens. Billed as fresh input those would run past $80K; at cache-read rates the whole project lands near $12K. The foundation that kills context rot — the PROJECT.md, the contracts, the conventions, read from cache thousands of times — is also what makes the bill survivable.

(Cost uses standard Opus list rates and is an estimate, not a billing export.)

The full SDLC, phase by phase

A conventional lifecycle has six stages, and ADD’s eight steps map onto them cleanly. The interesting column is the last one — what moved. For each phase, the ADD+AI figures are measured from ai-proxy; the human column is the typical shape of that phase, not a measured control.

SDLC stage	ADD step	Human-manual, typical	ADD + AI on ai-proxy (measured)	What moved
Requirements capture	Specify	BAs interview, write docs; ambiguity surfaces in QA or production	116 decisions captured append-only; lowest-confidence assumption flagged first; contradictions (argon2 vs SHA-256, a 429 status range, an unmeasurable rate aggregate) killed before any code existed	ambiguity surfaced at spec, not in prod
Design / architecture	Scenarios + Contract	design docs that drift from the code within weeks	frozen, versioned, checksummed contracts; a domain `GLOSSARY` (one name per concept); an 88 KB + 60 KB foundation kept current every session	design became a durable asset, not a stale doc
Implementation	Build	engineers type every line — historically the bottleneck	AI authored ~49.2K production LOC while the operator directed in ~54 hours	typing stopped being the bottleneck
Verification / QA	Tests + Verify	QA writes tests after the fact; review trusts a plausible diff	~60.8K test LOC, red-first (1.24:1); 23/23 milestones PASS, all exit criteria met; live verify caught defects the green suite waved through	trust by evidence replaced trust by inspection
Release	(gate)	release engineering, sign-off meetings, runbooks	graduated to production in 6 days; gates enforced mechanically, with no silent skips	release gated on evidence, not on a meeting
Operate / learn	Observe	incidents land in a backlog; lessons live in people’s heads	production signal → spec delta → next loop; lessons fold into the foundation so later milestones reuse patterns by name	the method improved itself across loops

The shape of that shift — the human concentrated at the decision points, the agent doing the authoring in between — is the whole method in one picture:

Two phases deserve emphasis because they are where the human time actually went.

Requirements and design front-loaded the thinking. The classic SDLC effort distribution puts ~10% on requirements and ~15% on design — and treats them as overhead before the “real work” of coding. ADD inverts the weighting. The single human decision point is the frozen contract; the highest-leverage human act is reading the lowest-confidence assumption flag and confirming it in one sentence. On ai-proxy that habit killed at least three contract-level contradictions before a line of code existed, at zero rework cost. In the human-manual world those same contradictions surface in integration or production, where they are an order of magnitude more expensive to fix.

Verification stopped being a trust exercise. The 1.24:1 test-to-production ratio is not a number a tired team produces by exhortation; it is what the failing-tests-first gate looks like when an agent writes the red suite before the build. And the verify step proved its own necessity: green suites are necessary but never sufficient, which is exactly why the method makes live verification and an adversarial refute-read load-bearing on top of the suite.

The loop is fractal

It would be easy to read all of this as a claim about big projects — that ADD pays off once you have a six-day, twenty-three-milestone system to build. It isn’t. The eight-step loop is not the shape of the whole project; it is the shape of every unit of work, at every grain.

ai-proxy makes the recursion literal. The project ran one loop across 23 milestones; each milestone ran the same loop across its 118 tasks; and a single task ran it around one frozen contract. Nothing special happens at the top. A five-minute change to one function follows the same path — name it, freeze the shape, make “done” checkable, build to green, verify the residue — as the whole gateway did. The measured trail is 23 milestone loops nesting 118 task loops, each closing on its own recorded gate. That is why the method feels identical whether you are shipping a release or fixing a typo with teeth.

And the primitives aren’t even specific to code. Name the thing precisely · freeze what “done” means · check it by evidence, not by a plausible read — those are domain-general. They apply to a finance close, a contract review, an HR policy, a support macro: anywhere an agent will otherwise sprint confidently in whatever direction it was pointed. (This series’ companion, ADD Across the Org, walks those non-code domains in detail.)

The consequence for every number above: the compression is not a one-time, whole-project trick that amortizes a heavy setup. It is the same cheap loop running wherever a piece of work has a definable “done” — which is exactly why it scales down to a sub-task and out past software without changing shape.

Across every role

ADD’s own book ships a role × phase responsibility matrix. Here it is, lightly recast — the value is in seeing how ownership shifts when the AI does the authoring (R = leads/responsible, A = accountable, C = consulted):

Role	Specify	Contract	Tests	Build	Verify	Operate
Product / Domain	R	I	I	I	I	R
Architect / Lead	C	R/A	C	A	A	C
Engineer (Senior)	I	R	R	R	R	C
QA / Test	C	C	R	C	C	C
Designer	R	C	I	I	I	I
DevOps / SRE	I	C	C	R	R	R
Security	C	C	C	R	R	C
EM / Delivery	C	C	C	C	C	C

What the matrix doesn’t show — and what ai-proxy makes concrete — is that the authoring work of every one of these roles collapsed into agent execution, while the judgment work of every role concentrated into one operator’s ~54 hours.

Here is the per-role view — how each seat actually uses ADD, and the value it bought on ai-proxy, drawn from the project’s own record:

Role	How they apply ADD	Value realized on ai-proxy (real)
Product / Domain	Leads Specify; reads the lowest-confidence flag first and confirms the load-bearing assumption before any build	The argon2-vs-SHA-256 conflict was settled in one sentence before code existed — one of 116 decisions kept on the record
Architect / Lead	Owns the contract freeze (a one-way door), `CONVENTIONS.md`, and the architecture residue check	A frozen contract caught a 429 status-range contradiction pre-build; the 60 KB conventions stayed current across all 23 milestones
Senior Engineer	Directs Build; refuses to weaken a test; checks the residue tests can’t — concurrency, architecture	No test was weakened to make a build pass across 118 tasks; correctness rested on the residue checks, not the green bar alone
Junior Engineer	Enters at the Build end against handed-down contracts; raises a flag when a spec is ambiguous	A safe on-ramp: turn red tests green without touching the contract, and grow toward specification
QA / Test	Leads Tests; co-authors scenarios; owns the red-first suite and the coverage line	Owns the gate that catches what a green suite can’t; 60.8K test LOC, red before the build
Designer	Leads the design slice; the agent prototypes, the person owns the experience and every screen state	The enterprise dashboard ran through the design slice up front, rather than being retrofitted after the build
DevOps / SRE	Wires gate outcomes into the pipeline; owns telemetry, rollback, and the cost budget	A `HARD-STOP` became automatic rather than a meeting; the system graduated to production in 6 days
Security	Owns the security thread; every finding is a `HARD-STOP`, never a waiver	One unverified JWT decode escalated into a 13-path secret-leak sweep — zero security waivers shipped
EM / Delivery	Sets the autonomy level to match review capacity; tracks the scarce metrics, not code volume	Zero waivers across 23 milestones — autonomy never outran what verification could sustain

Read this as nine jobs that did not disappear but changed verb — from author to director-and-verifier. The Product Owner stopped writing tickets nobody reads and started killing the single assumption most likely to be wrong. QA stopped chasing coverage after the fact and started owning the gate the agent’s green suite must pass through. Security stopped scanning diffs for plausibility and started owning a stop button that fires on its own. Each kept the part only a human can do — judgment — and shed the part the agent now does faster.

A conventional team that ships a production multi-tenant gateway carries most of these as distinct people — call it eight to ten across product, architecture, two to four engineers, QA, design, platform, and security, coordinated by a manager. On ai-proxy, one operator occupied all of those seats, with the agent and its 201 subagents doing the authoring underneath. That is the cross-role shape of the result: not “the AI replaced the engineer,” but “every role’s typing moved to the agent, and every role’s judgment stayed with the human — and there was far less of it to do, concentrated into one week.”

The human baseline — modeled

Here is the part that requires honesty rather than arithmetic. No human built this system, so there is no measured human cost to compare against. There are only models. I ran three, deliberately spanning from “most generous to the human team” to “textbook.”

Model	Estimated human effort	Basis
Feature / milestone	6–24 person-months	a production LiteLLM-class gateway is what 2–4 engineers ship to v1 in 3–6 months
LOC throughput	22–75 person-months	49.2K production LOC ÷ 30–100 net LOC per engineer-day
COCOMO II (Basic)	143–384 person-months	49 KLOC through the organic→embedded coefficients

Ranked lowest-confidence first, in the ADD habit:

⚠ The LOC-throughput and COCOMO models are the least trustworthy, and they produce the largest numbers. Lines of code measure typing — precisely the activity that was never the bottleneck — and COCOMO was calibrated on waterfall-era, from-scratch projects that overstate modern framework-based work. They are here because they are the currency organizations still estimate in, not because they are accurate. Even discounted heavily, they don’t change the conclusion’s direction.

The feature/milestone model is the most grounded — it reasons from what the system is, not from how many characters it contains — and is also the most generous to the human side. So it is the one to anchor on.

The comparison

Converting each estimate to hours (~150 working hours per person-month) and setting it against the measured ~54 active operator-hours:

Human baseline	Human hours	Compression vs. ~54 h
Feature-based, low (6 PM)	~900	~17×
Feature-based, high (24 PM)	~3,600	~67×
LOC throughput, mid (~50 PM)	~7,500	~140×
COCOMO, embedded (384 PM)	~57,600	~1,000×

State it the way that survives scrutiny: take the estimate most favorable to the human team — six person-months — and the build still represents roughly a 17× reduction in human engagement time. Seven person-days of supervision produced, across the full lifecycle and every role, what conventional estimation prices at six person-months to thirty-two person-years. Every less-generous model only widens the gap. The conservative end is the headline precisely because a skeptic can’t easily knock it down.

Speed of development

Effort and dollars are one face of the comparison; raw speed is the one people actually feel. But “speed” splits in two the moment an agent is involved — speed to something that looks finished, and speed to something you can trust. The three methods rank in opposite orders on the two.

Speed dimension	Human-manual	AI manual-prompting	ADD + AI
Calendar to production	3–6 months	days, to “looks done”	6.3 days, verified
Time to first working build	weeks	hours	~1 day
Time to trustworthy ship	months	unbounded — the rework tail	6.3 days
Authoring throughput	30–100 prod LOC / engineer-day	high, but unverified	~7,800 prod LOC / calendar day (≈910 / active operator-hour)
Delivery cadence	~1 feature / weeks	bursty, then stalls in rework	~3.7 milestones / day
What gates the speed	typing + coordination	nothing — until production	direction + verification

Manual prompting wins exactly one row — time to a first working build — and loses the only one that ships: time to a build you can deploy without fear. ADD is the fastest to trustworthy production by a wide margin: against a human team’s 3–6 months, six days is a ~15–30× speed-up to the same verified state — and unlike manual prompting, that speed is real, because every milestone closed on evidence rather than on a plausible diff. The agent’s raw authoring rate — roughly 910 production lines per active operator-hour — is not the interesting figure. The interesting one is that none of those lines had to be re-litigated in production. (The ADD column is measured; the other two are modeled, with the same honesty as the rest of this post.)

What it cost in dollars

Time is the honest comparison; money is the one that gets budgeted. Converting both sides to USD needs one modeled input — a fully-loaded engineer rate — which you should adjust to your own market. Everything else carries forward from the measured figures.

Input	Value	Basis
Fully-loaded engineer	$125/hr (~$20K/person-month)	mid-range; varies by region — state it, don’t hide it
Operator time (ADD)	54 hours	measured
API cost (ADD)	~$12K	token-derived
Human effort	6–24 person-months	feature model — the most generous to humans

The ADD + AI build, totaled:

Line item	Cost
API (Opus, measured from tokens)	~$12,000
Operator labor (54 h × $125)	~$6,750
Total project cost	~$19,000

The human-manual build, totaled (feature model, the generous end): 6 person-months × $20K ≈ $120,000; at the high end (24 PM) ≈ $480,000.

Side by side, on the estimate most favorable to the human team:

Metric	ADD + AI	Human-manual (low)	Ratio
Total project cost	~$19K	~$120K	~6.3× cheaper
Cost per production LOC	$0.39	$2.44	~6×
Cost per milestone (23)	~$826	~$5,200	~6×

Six times cheaper is the conservative figure — the human end is the most generous model, and the API portion already absorbs the ~$68K that prompt caching saved. On the LOC or COCOMO models the dollar gap widens exactly as the time gap did. The shape is the same in either currency: the expensive resource stopped being typing and became judgment, and there was far less of it to buy.

ADD vs. manual prompting

The fairer contest isn’t ADD against a human team — it’s ADD against the other way people use an agent: manual prompting. Open a chat, describe the feature, accept the plausible result, move on. It is faster than ADD at the start, and cheaper at the start. The question is what it costs by the end.

Manual prompting trusts the plausible diff and the green suite — the failure mode this series opened with. On ai-proxy, trusting the green suite would have shipped, at minimum, the defects the gates actually caught: a PII marker silently dropped (invisible to 326 tests), two production-dead code paths (invisible to 399 tests), a hidden coverage regression, a fail-open identity bypass, and 13 secret-bearing error paths. Two of those are security-grade in a multi-tenant billing gateway.

Pricing the three modes at the same $125/hr — the manual-prompting column is modeled (no one vibe-coded this system to measure it, but the defects it would have shipped are documented, not invented):

Dimension	Human-manual	AI manual-prompting	ADD + AI
Trust basis	inspection	the plausible diff	evidence
Upfront human time	months	hours	~54 hours
Upfront cost	~$120K–480K	~$12K	~$19K
Defect-escape rate	moderate	high	low (caught pre-ship)
Rework + incident cost	moderate	~$15K+, security tail uncapped	~$0 (caught)
Total expected cost	high	~$27K+ and unbounded	~$19K, trustworthy

Read the manual-prompting column carefully, because it is the seductive one. It wins on upfront cost — about $12K against ADD’s $19K — and that ~$7K gap is the entire price of the spec, the frozen contract, and the verify discipline. But it ships the fast waste. Model only the engineering rework for the escaped defects — roughly five classes at ~20 hours each — and you are already past $15K in remediation, pushing the total above ADD’s. That is before the two security escapes fire, and a fail-open auth bypass or a leaked API key in a multi-tenant gateway has no upper bound on cost.

So the economic claim is narrow and defensible: ADD’s extra ~$7K of upfront direction and verification is insurance bought below the expected loss it prevents. Manual prompting is cheaper exactly until the first escaped defect — and the defects are not hypothetical. They are the ones in the log.

More builds that ran ADD

The multiplier above is one system measured one way, and the honest objection — it’s a single project — is the first caveat below. But ai-proxy is not the only build that ran the method end to end, and the gates caught the same class of problem each time.

What the gates actually caught on ai-proxy. The discipline is most legible in the specific things it stopped:

A contract contradiction killed before any code existed — the domain glossary asked for argon2 on all key material, but argon2 on a per-request API key is gratuitous. The spec flag surfaced it; the decision (SHA-256 for key secrets, argon2 for user passwords) was made in one sentence, at zero rework cost.
A HARD-STOP on an unverified session-JWT decode generalized into a project-wide sweep — 13 secret-bearing error paths hardened so a crash reporter couldn’t walk an exception chain back to an API key.
The adversarial refute-read caught a coverage regression a --no-coverage run had hidden, and a fail-open identity bypass where a followed redirect could chain to a trusted response — neither visible to the green suite.

ADD building ADD. The method’s own toolkit and book — shipped as pilotspace-add — were themselves built through ADD, which makes that repository a second real case with its own measured trail:

Dimension	Measure (from the book’s own `.add/`)
Milestones	36, each a full eight-step loop
Foundation versions	35 consolidation cycles
Test suite	1,158 tests, green
Waivers	none, across every milestone
Foundation compaction	`PROJECT.md` 399→215, `CONVENTIONS.md` 689→360 — 1088→575 lines (−47%), refute-verified against git with zero data loss

That last row is the self-improvement loop made measurable: the durable context was compressed by nearly half without losing a decision, and the change was trusted because an adversarial read checked it against git history — not because it looked right. Even the documentation gates were dogfooded; a recent milestone’s verify step caught a real scope-token bug and routed it back through tests→build before it could ship.

Neither case is a controlled trial. Both are audit trails specific enough — named defects, line counts, commit hashes — to be credible rather than aspirational. That is the bar the next section holds the headline number to as well.

The caveats that matter as much as the number

A multiple this large should make you suspicious — including when you are the one making it. Four things keep it honest.

It is not a controlled trial. One project, one operator, one system. There was no A/B against the same gateway built without ADD, and there couldn’t be — you can’t build the same novel system twice and call the second a control. This is a rich field study, not an experiment. The human numbers are models you are invited to reject.

LOC is a weak proxy, and the method says so. ADD’s governance chapter lists lines-of-code and reuse percentage among the anti-metrics — the cheap things that measure volume instead of value. I used LOC only because it is the language human estimation speaks, and I flagged it as the least trustworthy input.

Effort is not quality. This is the one that actually matters. The compression figure would be dishonest if the output were plausible-looking and wrong — the default failure mode of fast AI. The field study recorded the counter-evidence directly: a live run caught a defect 326 passing tests missed, and a later one caught two production-dead paths 399 passing tests waved through. Across all 23 milestones every gate ended in a recorded PASS (no waivers), and security findings always stopped for a human. Strip out that discipline and you don’t get a 17× engineer; you get 49K lines of confident, unreviewed risk.

The bottleneck moved; it didn’t vanish. The ~54 hours weren’t free time. They were spent on the two things the agent can’t do alone — setting direction and verifying results — across every role at once. That is the real shape of the win: the expensive human hours went to direction and verification instead of typing, and there were far fewer of them.

The dollar figures rest on one adjustable rate. Every cost here scales with the $125/hr loaded-engineer assumption; pick your own number and the ratios move, but not their direction. And the manual-prompting column is a model — its escaped defects are real and documented, its remediation cost is an estimate, and its security tail is deliberately left uncapped rather than guessed at.

How to measure this on your own builds

If you want to run this exercise rather than take this post’s word for it, measure the scarce things and ignore the cheap ones.

Measure:

Human-engagement-time compression — gap-capped active operator hours from your transcripts, against your team’s honest estimate for the same scope, per phase so you can see where the time actually went.
Defect-escape-caught — defects a green suite missed and the verify step caught. This is your evidence that the speed is trustworthy.
Cost per shipped feature — token-derived API cost ÷ features delivered; the cache-read share tells you how much the foundation is paying for itself.
Contract stability and autonomy ratio — how rarely frozen contracts churn, and the share of tasks that auto-gated on evidence versus those that needed a human.
Role coverage by one operator — how many traditional seats one director actually occupied, end to end.

Ignore: lines of code generated, reuse percentage, prompt counts, velocity measured in code volume.

The data is already on your disk. The transcripts hold the effort-and-cost ledger; the .add/ foundation holds the lifecycle and the evidence trail. The aggregation above is a dozen lines of jq and awk — which means the claim is reproducible, and a reproducible claim is the only kind worth making about a method whose entire premise is trust evidence, not inspection.

The code got cheap; six person-months of full-lifecycle work — requirements through production, across nine roles — arrived in seven person-days of supervision for around twelve thousand dollars. Direction and verification didn’t get cheap. They got concentrated into the hours that remained. That’s the trade ADD is actually offering. If you’ve run the same measurement on your own build and the numbers came out differently, that’s the more interesting post — bring the transcripts.

Next in this series

Why ADD Spends Fewer Tokens Than GSD — Without Getting Less Safe