Ai Ml

ADD in Production: The ai-proxy Field Study

A field study of ai-proxy, a multi-tenant AI gateway built end-to-end with ADD: 23 milestones, ~120 tasks, six days, zero waivers. What the method felt like in production and how the LLM behaved when scope was clamped.

Tin Dang avatar
Tin Dang
Series hero on warm paper: 'ADD in Production: The ai-proxy Field Study — 23 milestones, six days, zero waivers'

The previous seven parts of this series made an argument. This one is the receipt.

ai-proxy is a LiteLLM-class multi-tenant AI gateway: a FastAPI data plane behind an Envoy edge, Postgres and Redis, and a Next.js dashboard. Over six days it grew to cover six upstream providers — OpenRouter, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI, and a generic OpenAI-compatible adapter — an OpenAI-compatible /v1 surface supporting chat, streaming, tool/function-calling, JSON-mode, embeddings, images, and audio, plus usage metering, per-tenant markup billing, budget enforcement, team governance, a load-balancing router with cooldown and circuit-breaking, response and semantic caching, SSO/OIDC, and a full enterprise dashboard. It was built entirely through ADD. This part examines what the method produced, and what happened when a production-sized system applied the eight steps without exception.

The numbers and what they mean

DimensionMeasure
Milestones23 (v1v24), each a versioned foundation bump
Tasks~120 across the milestones
Calendar span6 days (2026-06-10 → 2026-06-16)
Stagegraduated to production
Evidence base140+ append-only Key Decisions, state.json, a 53 KB CONVENTIONS.md, and 262 MB / 185 session transcripts

Each term carries a precise meaning in ADD.

A milestone is not a sprint. It is one completed pass through the full eight-step loop — Ground through Observe — ending in a versioned foundation bump. Each milestone has an exit criterion; the loop does not close until that criterion is satisfied, which means milestones can reopen tasks and run multiple build/verify cycles before closing.

A task is a single spec-to-verify unit: one spec, one contract, one red suite, one green build, one recorded verify outcome. Tasks nest under milestones. The ~120 tasks across 23 milestones averages roughly five per milestone, though the distribution was uneven — hardening milestones carried more, and early milestones that established domain patterns were more expensive per task.

Zero waivers means every task’s verify gate produced exactly one of three recorded outcomes — PASS, RISK-ACCEPTED (a signed, named exception, never used for security findings), or HARD-STOP (a blocker requiring human resolution before build resumes). No gate was silently skipped. No green suite was treated as sufficient evidence on its own. The evidence base is the project’s own append-only Key Decisions log in PROJECT.md, and the specificity of that log — named defects, test counts, file paths — is what makes it credible rather than aspirational.

The eight steps in the field

Every milestone ran the same loop. Walking one representative example through each step shows what “the method” looks like when it contacts real code.

Milestone v1 established the foundational auth pattern: API key storage, argon2 hashing, the SigV4 signer for AWS Bedrock, and the tenant-scoped billing invariants — thoroughly documented in the fold log and useful here because it exercised every step.

Ground. Before specifying, the agent mapped the real files, symbols, and conventions the task would touch. This is what keeps the spec aimed at reality rather than assumption. In v1, Ground confirmed that no upstream provider adapters existed yet — so the spec could not assume shared infrastructure and had to define the invariants from scratch.

Specify. The spec named what each feature must do, what it must reject (each rejection paired with a named error code), and the after-state once it succeeds. Critically, it ranked its own assumptions lowest-confidence first. In v1, the flagged assumption was:

# SPEC.md — api-key-storage
⚠ argon2 for API key secrets adds 50–200 ms to the hot auth path —
acceptable for password hashing but a latency concern for per-request
key verification. Confirm: slow KDF or HMAC-SHA256?

That single flag surfaced a real conflict: the domain glossary said argon2 for all keys, but argon2 on a high-entropy secret that must be checked on every proxied request is gratuitous. The decision — SHA-256 for API key secrets, argon2 for user passwords — was made in one sentence, before any code existed. The fold log records it as: “the freeze flag caught this spec/GLOSSARY conflict before code existed.”

Scenarios. The spec rules became concrete pass-or-fail cases. For the SigV4 signer, scenario SV1 was the AWS published canonical vector (byte-for-byte match required); scenario SV5 asserted that the secret access key must not appear in either the returned headers or the credential object’s repr. Scenarios are what make “correct” checkable before a line of code is written.

Contract (frozen). The contract fixed the interface, data shapes, and error codes — and was immediately checksummed. For the SigV4 signer:

# contracts/bedrock-sigv4-auth.md — Status: FROZEN @ v1
sign_request(*, method, url, body, service, region, credentials, timestamp)
-> { "x-amz-date", "x-amz-content-sha256", "Authorization": "AWS4-HMAC-SHA256 ..." }
# PURE · TOTAL · DETERMINISTIC (timestamp injected, no IO, no globals)
AwsCredentials(access_key_id, secret_access_key) # secret_access_key: repr=False

The md5 tripwire over the frozen body meant that even editing a pseudocode comment in the frozen section after the snapshot would trip the alarm. The agent never altered a frozen contract on its own initiative — refinements went into the post-freeze sections (§6/§7), never into §3.

Tests (red). The suite was generated from the scenarios and contract, run to confirm failure, and the red was verified to be for the right reason. The fold log records: “Red confirmed for the RIGHT reason before any build line” (v1). In v6, absence-of-behavior tests were explicitly marked GREEN-BY-DESIGN so a green-before-build did not get misread as a test bug.

Build (green). The build instruction was simple: make every test pass, do not change any test, do not change the contract. The how was the agent’s to invent. In v11, for the JSON-mode feature, nobody prescribed the implementation. The agent reused the existing tool-coercion mechanism wholesale — emitting one synthetic forced tool whose output schema was the requested JSON schema, then unwrapping the returned tool-use block back into message.content. The contract fixed the observable behavior; the agent found the elegant path.

Verify. Not inspection of the diff, but evidence. In v1, every test used the path / — the AWS canonical vector. The refute-read added a test using a real Bedrock model ID containing a colon (...sonnet-20241022-v1:0). AWS canonicalizes : to %3A; the signer was passing the raw path. The new test went red against the “green” code. Every versioned-model call would have returned 403 in production. A suite of seven passing tests would have shipped it.

Observe. Production behavior becomes the next spec. A recurring boot failure — an empty upstream key producing an opaque Bearer '' 500 — surfaced across milestones v7 and v8. Observe turned it into a spec delta: milestone v12 added a boot guard that raises EmptyUpstreamKeyError at startup, before any adapter, converting the runtime mystery into a clear startup failure. The pattern folded into CONVENTIONS.md so later milestones inherited it by name.

How the LLM behaved when scope was clamped

The most interesting finding of this build is behavioral. ADD does not merely constrain the output — it reshapes how the agent operates. Across 185 session transcripts, eight patterns repeated.

Where the clamped scope let the model shine. Given a frozen contract and a red suite, the agent was free over how to satisfy them — and that freedom produced genuine design quality. The “freeze-first SHARED-SEAM pattern” was discovered once (v9, first provider extension), reused a second time (v10, tool-calling), and composed with a third feature on the second reuse (v11, response_format composed with v10 tools rather than spawning a parallel mechanism). No provider extension required a “rewrite the core” milestone; six providers and dozens of endpoints landed additively against a byte-identical regression net. The agent built this pattern — it was not prescribed.

Where the frozen contract and red tests caught plausible-and-wrong output. The clearest example is already documented above — the SigV4 path-encoding defect that seven green tests missed. A second example: in v2, mock-shaped fixtures produced a billing test suite that passed (0/0 recorded) while live billing upstream showed 24/73 — the fixtures were shaped to the mock, not to the reality. Verbatim live-captured fixtures became mandatory after that; the lesson folded into the foundation and held for every subsequent billing task.

Agent behavior without clamped scopeAgent behavior under ADD
Ambiguity at spec time Guesses confidently, sprints past it Flags lowest-confidence assumption, waits for confirmation
Contract stability Quietly alters interface to make tests pass Routes all refinements to post-freeze sections; tripwire prevents edits to §3
Red test confirmation Starts build on the first red run Verifies red is for the right reason before any build line
Green suite Treats passing tests as sufficient evidence Treats suite as necessary, not sufficient; runs live verify and refute-read on top
Security findings Auto-resolves to continue the build Every security finding escalates to the human as HARD-STOP
Lessons across milestones Re-derives patterns each session Reuses folded patterns by name from CONVENTIONS.md

The refute-read produced some of the build’s most consequential findings. In v14, running with --no-coverage hid a real 78.14% coverage regression below the 80% floor; the adversarial subagent returned EARNED-WITH-GAPS and surfaced three real coverage gaps the green run had concealed. In v17, the same mechanism traced two memory leaks dismissed as “benign” back to forgotten in-file handlers and drove them to zero — a misdiagnosis caught, not just a cheat caught. In v18, it found a fail-open identity bypass where a followed redirect could chain to a trusted 200 response, forcing redirect:"manual" and a redirect→503 test to close the gap.

What it felt like: the human’s role

The devlog for this build is the PROJECT.md fold log — 140+ append-only Key Decisions, each dated, each recording what was caught, where, and why.

Reading across it, the human operator’s role had a distinct shape. The operator confirmed or corrected flagged spec assumptions — one sentence, before any code. The operator gave one approval at the contract freeze — one human gate per task. The operator received HARD-STOP escalations for every security finding and discharged them with a dedicated remediation task. Everything else — design, data structures, algorithm, refactoring — was the agent’s.

What got faster was everything inside the clamped scope: implementation, regeneration after a verify catch, test generation, and documentation of decisions (the fold log is agent-written). What stayed hard — and did not get faster — was verification itself. Live verification, the adversarial refute-read, and the security subagent each consumed real time and could not be parallelized below a certain floor without sacrificing the independence that makes adversarial review meaningful.

The human’s job shifted from writing code to owning direction and verification — the two things the method protects explicitly because the agent cannot reliably supply them.

The foundation compounded with each loop. Later milestones were faster not because the method relaxed, but because earlier milestones had folded their lessons — a new session re-oriented on the foundation instead of re-deriving what was settled. Debt was never silently dropped: v7 OPEN: empty-key boot guard resolved as a named task in v12; v21 OPEN: secret-chain sweep resolved in v22. Two whole milestones — v12 and v17 — existed solely to pay down carried technical debt, on the record.

Honest costs and caveats

The case study’s caveats are worth naming directly.

One project, one operator. This is a rich field study, not a controlled experiment. There is no A/B comparison against the same system built without ADD, and the six-day calendar span does not measure operator hours or AI compute.

The evidence base is the project’s own audit trail. The fold log’s credibility rests on its specificity — named defects, test counts, file paths. One claim was spot-checked against the live code: the v22 decision states that rg "from exc" infrastructure/ → zero. The code confirms it: from None appears across exactly the provider adapters and the shared upstream_retry.py seam named in the decision, and from exc in the infrastructure source is zero. The log matched reality where it was checked.

Some overhead is real. The contract-freeze ritual adds friction at the beginning of every task. The md5 tripwire means that even a comment edit in the frozen body requires a deliberate re-freeze. For a simple feature where the interface is obvious and stable, that overhead can feel like ceremony. The method earns that cost on complex features — where “obvious” interfaces turn out to contain a status_class label that cannot express a required 402, or a status 400-499 RANGE that contradicts a 429 already retry-handled in the same spec.

Not every discipline is unique to ADD. A strong human team with rigorous TDD and contract-first design could reach many of the same outcomes. ADD’s claim is narrower: the method makes that discipline the default path for a fast, eager agent, and records the evidence that it held.

The one-line finding from the case study, quoted here exactly:

Pointed at a real production system for six days, ADD turned a fast, plausible-but-fallible agent into one that surfaced its doubts before building, froze and respected its contracts, attacked its own green, stopped hard on security, and compounded every lesson into the next milestone — and the defects it caught were the ones a green test suite, read and found plausible, would have shipped.

That outcome is what the previous seven parts argued for. This is what it looked like when it ran.


Next in the series: The final part steps back to ask where ADD sits relative to the broader landscape of methods and tools — what it inherits, what it adds, and where it is and is not the right choice. Read it at ADD: Where It Sits.

0

Next in this series

Where ADD Sits: Lineage, spec-kit, and GSD

Continue reading