Ai Ml

Observe and Fold: How ADD Improves Itself

The step that closes the loop: Observe production behavior and fold the spec delta back into the next Specify. How ADD keeps documents living while old-school artifacts rot — and how the method improves itself.

Tin Dang avatar
Tin Dang
Series hero on warm paper: 'Observe and Fold: How ADD Improves Itself'

Every prior step in ADD runs forward: Ground orients the agent, Specify clamps the scope, Scenarios make it concrete, Contract freezes the shape, Tests make it red, Build makes it green, Verify earns the trust. Then the feature ships. On most teams, that is where the story ends.

Observe is the backward arrow. It is the one step that points upstream — from production back to Specify — and that single arrow is what turns ADD’s eight steps from a tidy waterfall into a loop. Without it, you have a disciplined one-shot method. With it, the method improves itself.

What to observe: reality versus the spec

The spec — the Must / Reject / After that drove every step before Build — described expected behavior. Production reports actual behavior. The gap between them is what Observe watches for.

Three categories of gap matter.

Defects: a Must that turned out to be incomplete, a Reject that fires at a rate nobody designed for, an After-state that only holds under the narrow inputs the tests used. In ai-proxy, a recurring boot failure — an empty provider key producing a client-side protocol error — kept resurfacing across milestones. Each occurrence was handled; the pattern was not. Observe is what promoted it from repeated incident to formal spec entry.

Surprises: behaviors that are not wrong but were not anticipated. A user path nobody designed for, a latency distribution nobody measured, a rejection code firing on inputs the spec author never imagined. These describe real demand the spec did not capture.

New needs: what users do next, once the feature exists. A working auth layer surfaces a demand for audit logs; a working budget system surfaces a demand for budget inheritance. Each is better described as a spec delta than as a ticket no one traces back to a rule.

The scenarios from Step 2 have a second life here. They described the behavior you expected; in production, they become the monitors that flag when reality diverges. The same definition of “correct” that drove the tests now drives the alerts — a rate spike in one named rejection code (amount_invalid, forbidden, provider_absent) is a signal, not noise, precisely because the scenario named it.

This is what the spec delta is: a concrete change to the Must / Reject / After, grounded in a pointer to the evidence. Not a vague intention, not a ticket title. A rule.

Folding the delta: how the loop closes

A spec delta is only useful if it re-enters the flow at the right place. That place is Step 1 — Specify. The delta becomes the next loop’s opening material: a new Must, a corrected Reject, a revised After-state. The agent drafts it; a person confirms it. Then the loop proceeds — Scenarios, Contract, Tests, Build, Verify — against the updated spec.

This is the sense in which ADD improves itself. The method does not assume the spec is correct on the first pass. It assumes the spec will be corrected by what production teaches, and it builds the correction path into the process rather than leaving it to heroics or memory.

Each tagged learning follows a strict shape:

- [SDD · open] boot-guard must reject provider configuration with empty key at startup
(evidence: recurring boot failure across milestones 14, 17, 21 — empty provider key
producing a client-side protocol error; pattern promoted from incident to spec entry)

The tag marks which competency the learning sharpens — DDD if the domain model was wrong, SDD if a requirement was missing, UDD if a user-facing flow misled, TDD if a test scenario was absent, ADD if a build convention helped or hurt. The evidence is required; without it, the delta is an opinion, not a learning. The agent emits these as open; it never consolidates its own. Consolidation is judgment, and judgment belongs to the person who owns the review.

At milestone close, a person runs the retrospective consolidation: gather every open delta, group by competency, propose the exact edit to the foundation, confirm one by one, then write — append-only, newest first, flipping each delta from open to folded or rejected. The foundation version bumps. Later milestones inherit the learning by name, not by re-deriving it.

In ai-proxy, this compounding was measurable. A 600-plus-line conventions document accumulated patterns that could not have been written at project start. Two whole milestones existed only to pay down tracked debt, on the record. The method got faster as it ran.

The most valuable information about a feature arrives after it ships. Observe is what makes that information useful — not an incident to survive, but a spec delta to absorb.

Living documents: what does not rot

Every old-school methodology produces documents. Every one shares the same fate.

A PRD is signed once and is a little less true every sprint. An architecture diagram describes the system as someone imagined it the day the diagram was drawn, not as it is built today. The decision that explains why the system uses argon2 instead of SHA-256 for API keys lives in a closed thread, a departed teammate’s memory, or nowhere. By sprint 3, the doc is fiction — authoritative-looking fiction that will mislead the next person who reads it as fact.

ADD’s answer is not “write better docs.” Discipline always loses to deadlines. The answer is to make a small set of documents living — kept current by the loop itself, not by anyone’s good intentions.

Side-by-side: on the left, an old-school document signed once that drifts from the code and becomes fiction; on the right, ADD's living foundation — PROJECT.md, an append-only dated decision log, and Observe-to-fold — that remains accurate because the loop updates it
Old-school artifacts rot in silence. ADD's documents are kept accurate by the loop itself — and wear their state openly.

The contrast is structural, not aspirational:

Dead documentLiving document
What it captures The plan as imagined at signing The spec as refined by each loop
How it ages Silently diverges from the code Ground re-checks it against real code every task
The 'why' of a decision Buried in a thread or a memory Append-only dated entry in the decision log
Lessons learned A retro nobody reads twice Folded into CONVENTIONS.md, tagged, inherited by name
How you spot decay You cannot — it looks authoritative State is worn openly: FROZEN at a version, dated, append-only
Onboarding Read the stale wiki, then ask around The foundation is the onboarding; a cold session re-orients itself

The living foundation has a specific shape. PROJECT.md is the single document every task reads first — domain language, active spec, UI/UX constraints, key decisions. It is short by design: one screen, not a binder. The decision log is append-only and dated, so the why of every settled choice is recoverable. CONVENTIONS.md grows a tagged entry each time the loop learns something about how to build — patterns that would otherwise repeat as incidents. The spec marks what is settled versus still open. The contract is stamped FROZEN at a version.

Two scenarios from ai-proxy’s six days show what this means in practice.

The decision nobody could explain. Why argon2 and not SHA-256 for API keys? On most teams that answer lives in a closed thread or a former teammate’s head, and six months later someone reverts the decision and reintroduces the bug. On ai-proxy, each settled choice is an append-only, dated entry — 140-plus decisions logged across six days. A cold session, or a new team member, re-orients from the log instead of re-guessing.

The spec production rewrote. The recurring boot failure — an empty provider key — kept resurfacing across milestones. Old-school, each occurrence becomes a ticket, the spec is never updated, and the incident recurs. In ADD, Observe turned the pattern into a [SDD · open] delta, the delta re-entered at Specify as a dedicated boot-guard task, and the lesson folded into CONVENTIONS.md so every later milestone inherited the guard by name.

The deeper value is transparency. Old-school documents hide their own decay — they look authoritative until they mislead. ADD’s documents wear their state on their sleeve. You can always see what is true, what changed, and why.

Governance and scale: keeping the loop honest

A single feature loops through Observe back to Specify. A milestone has the same shape at a larger scale — and a gate to match.

Three-level hierarchy: Project at the top (the living foundation and the versioned decision log), Milestones in the middle (each with a goal, exit criteria, and gate), and Tasks at the bottom (each with a spec, frozen contract, tests, and an explicit autonomy level)
Project → Milestone → Task: the three levels of ADD's governance hierarchy. The loop runs at every level; each level has its own gate.

A task is one pass through the eight steps. It carries an explicit autonomy level — manual, conservative, or auto — declared at the freeze and reviewed at every gate. At auto, the build auto-gates on evidence; at conservative, a person stands at the Verify gate; at manual, the person owns every decision point. High-risk scope refuses an unguarded auto; it must be lowered deliberately. The autonomy level is not a policy written once — it is a per-task decision, made with knowledge of the risk.

A milestone groups tasks toward a goal expressed as exit criteria. The critical governance rule: a milestone is not done when its tasks are done. It is done when its goal is met, and only then. milestone-done is goal-gated — it refuses to close while any exit criterion remains unchecked. Those checkboxes are the human’s affirmation that the goal is genuinely met; the engine reads the tally and never judges the goal itself. While the milestone is held open, items discovered but out of scope become its next tasks, confirmed by the human, so the loop continues until the goal is reached.

A project carries the living foundation — PROJECT.md, the decision log, CONVENTIONS.md — that every milestone reads and every loop extends. The foundation is the asset that compounds; the code is what satisfies it.

Each gate resolves to exactly one of three outcomes. PASS — criteria met, proceed. RISK-ACCEPTED — proceed with a signed waiver: a named owner, a linked ticket, an expiry date. Allowed for non-security gaps only. HARD-STOP — cannot proceed. Triggered by any failing test or any security finding; overridable only by the most senior accountable owner, and never for security. There are no silent skips. A report nobody is accountable for approving is a document. An outcome with a named owner is governance.

In ai-proxy, this discipline held across 23 milestones and roughly 120 tasks over six days. Zero risk-waivers. Every security finding escalated to a human and resolved as a HARD-STOP. Two milestones existed only to address tracked debt — not discovered retrospectively, but entered on the record as soon as Observe named them. The governance hierarchy did not slow the build; it made the speed trustworthy.

The compound

The first loop through ADD is an investment: Specify before you build, Tests before you run, Verify before you trust. It costs more than prompting without discipline. What the discipline buys is that the second loop starts from a stronger foundation than the first, and the third from a stronger foundation than the second.

The spec delta from Observe becomes the next Specify’s opening material. The lessons folded into CONVENTIONS.md are patterns the next milestone inherits instead of re-deriving. The decision log is the institutional memory that survives cold sessions, team changes, and six-month gaps.

The artifacts compound. A binder of signed-once documents cannot do this. Living documents updated by the loop can.

The loop is not the overhead. The loop is the return.


Next in the series: ADD in Production: the ai-proxy Field Study — a full field report on the 23-milestone, 120-task, six-day build: what the governance log shows, how the verify gate caught defects a 326-test suite missed, and what the method looks like from the inside.

0

Next in this series

ADD in Production: The ai-proxy Field Study

Continue reading