Ai Ml

Verify by Evidence: ADD's Earned-Green Refute-Read

Verify by evidence, not by reading the diff. ADD's earned-green: an adversarial refute-read that tries to break the result, because AI code is frequently plausible and wrong. Trust through proof.

Tin Dang June 16, 2026 11 min read

Series hero on warm paper: 'Verify by Evidence: ADD's Earned-Green Refute-Read'

A green suite is a claim. It says: “no case I wrote caught a failure.” It does not say “the implementation is correct.” Those two things are not the same, and the distance between them is where production defects live.

Every previous step in ADD has been about constraining what gets built and proving the contract is executable before code exists. Verify is where you close the remaining gap — not by reading the diff and finding it plausible, but by marshaling evidence. Green plus survived refutation is earned green. Green alone is not.

Why inspection fails on AI code

Inspection — reading the diff, finding it reasonable, approving — is the default verification instinct, and it is exactly wrong for AI-generated code.

The problem is not that the model writes bad code. The problem is that it writes plausible code. A model trained on billions of tokens of high-quality software produces output that reads like good code, follows idiomatic patterns, passes a casual scan, and is structured exactly the way a competent human would structure it. That plausibility is the model’s strength when the spec is right. It is the trap when the spec is subtly wrong, or when the implementation quietly satisfies the test fixtures without satisfying the behavior those fixtures were trying to encode.

AI code is frequently plausible and wrong. Plausibility is not evidence of correctness — it is evidence that the model is good at looking correct.

The human reading a diff is doing pattern matching: “this looks right.” But the model that produced the diff was also doing pattern matching — and it was optimizing for coherent-looking output, not for the specific invariants your spec required. Reading one pattern-match result with another pattern-match does not produce verification. It produces two plausibility checks stacked on top of each other.

ADD’s answer is to make the basis of trust explicit. Trust comes from evidence — passing tests, boundary probes that survived, a wiring trace that shows the code actually runs on the production path — not from the code reading well.

The refute-read

The refute-read is the adversarial move at the core of ADD’s Verify step. The framing matters enormously: you are not a reviewer hoping the result is correct. You are an adversary trying to prove it is wrong. Every case you probe without breaking it is a piece of evidence. A case that breaks it is a finding — which is exactly what the step exists to produce before the code ships.

What you probe: the boundaries the tests did not cover, the must-Reject paths and their named error codes, the unstated assumptions the spec left implicit, and the structural cheats that make a suite go green without the implementation earning it.

Three mechanical cheats pass an unchanged suite without earning it. The first is src overfit to fixtures: the implementation special-cases the literal test inputs rather than implementing the general behavior the spec asked for. It returns the right answer for user_id = 42 but fails for user_id = 43. The second is vacuous assertions: the test’s assert statement is tautological — it would be green against an empty implementation because it is not actually checking the thing it names. The third is real logic stubbed away: the function returns a constant the test fixtures happen to accept, and the real behavior is simply absent. All three are invisible to a tamper tripwire that checks whether test files were edited. None of them show up on a casual diff read. The refute-read is how you catch them.

Here is the structure of a refute-read probe session — the adversarial checklist that defines the check:

# Refute-read checklist — Step 6 Verify

PREMISE: The suite is green. Argue that the green was NOT earned.

1. FIXTURE OVERFIT
   - Pick an input not in the test fixtures. Does the behavior hold?
   - If the impl hard-codes a value that matches the fixture literal, flag it.

2. VACUOUS ASSERTIONS
   - For each assert: would it pass against an empty / stub implementation?
   - If yes, the test does not protect the behavior it claims to protect.

3. STUBBED LOGIC
   - Is the function body returning a constant or a pre-computed fixture value?
   - Trace the return path under a varied input; confirm it executes real logic.

4. BOUNDARY PROBES
   - Probe at-limit, just-over-limit, and just-under-limit for every numeric
     boundary named in the spec.
   - Probe the empty input, the maximal input, and any input the spec says to
     Reject — confirm the named error code is returned, not a generic error.

5. MUST-REJECT PATHS
   - For each Reject rule in the spec: trigger it deliberately.
   - Confirm the exact named error code — not a cousin, not a 500.

6. WIRING TRACE
   - For every new symbol, trace from the production entry point to the call site.
   - A symbol reachable only from a test helper is not wired; the feature is absent
     in the running program even though the tests pass.

7. CONCURRENCY
   - Is the risky operation atomic? Simulate two simultaneous calls.
   - Tests run serially; this check is always manual.

8. SECURITY RESIDUE
   - Any exposed secret, any injection surface, any dependency pulled by a
     plausible-but-wrong package name?
   - Security is always a HARD-STOP. Never auto-passed, never RISK-ACCEPTED.

An adversarial reviewer — a subagent in autonomy: auto mode, or a human under conservative — works through this list and records what each probe returned. The output is not an opinion (“looks fine”) but a trace (“boundary probe at limit+1 returned QUOTA_EXCEEDED as specified; boundary probe at limit-1 returned 200 OK with correct remaining balance”).

Evidence over opinion

What counts as evidence is precise. An opinion is “the implementation looks correct.” Evidence is a probe that went in at a boundary, produced a result, and the result matched or did not match the spec.

The three forms of evidence ADD treats as load-bearing:

A probe that fails, then is fixed. This is the strongest form. You designed a case the tests did not cover, it went red against the green code, and after the fix it went green. That is not a testing accident — it is a real defect caught before it reached production. The probe, the failure, and the fix are all on the record.

An independent re-derivation. For a computation with a known-answer vector — a cryptographic signature, a checksum, a date calculation — compute the expected result independently of the implementation and compare. The implementation’s output either matches or it does not. This is especially important for anything security-adjacent, where a plausible-looking implementation can be wrong in ways that look completely normal on a read.

A wiring trace. For every new function, endpoint, or hook: trace from the production entry point to the call site and record it — symbol, file, line. A function that nothing calls is present but absent in behavior. Tests pass in isolation; the feature, in practice, does not exist. This is a recurring failure class in production AI-built systems: the agent writes a well-formed function, tests exercise it directly, and the wiring step that connects it to the actual request handler simply never happened.

Earned green = green + the refutation attempts survived. The “earned” qualifier is not rhetorical. It is the record of what was tried and failed to break it.

The opposite of earned green has a name in ADD: a shallow verify. A shallow verify is one where the reviewer looked at the code, found it plausible, and recorded a pass without the boundary probes, the must-Reject checks, or the wiring trace. It looks like verification. It produces the same artifact — a recorded PASS. But it has not done the work that gives that PASS its meaning.

Verification is the ceiling

Part 1 named the fourth AI-era SDLC failure: verification is the real ceiling. When an agent produces more output than your team can verify, the excess is not speed — it is unreviewed risk accumulating. The ceiling is not how fast the agent can build. It is how fast you can establish trust in what it built.

Most teams invest in the code-generation half: better prompts, faster build loops, richer context. The verification half stays what it always was — a human reading a diff. That half does not get faster, so throughput stays bounded by the slow half while the fast half accumulates debt.

ADD’s answer is to make verification systematic enough that parts of it can run automatically, concentrating human attention on the parts that cannot.

	Trust-by-inspection	Earned-green Verify
Basis of trust	The code reads plausible	Boundary probes, wiring trace, re-derivation
Catches fixture overfit	Rarely — it looks correct	Yes — adversarial input outside the fixtures
Catches vacuous assertions	No — the test has a name and an assert	Yes — tautology check on each assertion
Catches stubbed logic	No — stub code reads like real code	Yes — trace the return path under varied input
Catches dead code	No — the symbol is there	Yes — wiring trace from production entry point
Security handling	"Looks fine" is a common outcome	HARD-STOP, always, regardless of autonomy level
Scales with output volume	No — bounded by human reading speed	Partially — probe suites, wiring scans, coverage gates can auto-run

Systematic verification produces automatable gates: a boundary probe suite runs in CI, a wiring trace can be scripted, a coverage gate fails a build. These do not replace the adversarial refute-read — they free it to focus on what machines cannot check: whether the behavior makes sense, whether the architecture is coherent, whether a security surface is a genuine exposure.

You will not outrun the ceiling by generating code faster. You raise it by investing in verification — which is where the return on attention is highest.

The ai-proxy catch: what the refute-read found

Abstract principles earn their credibility from concrete examples. Here is the specific catch from ai-proxy’s SigV4 signer that Part 1 named but the present chapter must ground fully.

The signer was built and tested through the full ADD loop: spec, scenarios, frozen contract, a red suite of seven tests against the AWS published known-answer vectors, build to green. The suite passed clean. The contract was correct. Coverage was complete against everything in the fixture set.

The refute-read asked one adversarial question: what happens when the URL path contains a colon?

AWS’s SigV4 algorithm canonicalizes the URI path before signing — colons in path segments must be percent-encoded to %3A. The seven test fixtures all used / as the path (the canonical AWS test vector). The implementation passed the raw path through without encoding. Against the fixture set, this was invisible: / contains no colon, so the signer’s output matched the expected value byte-for-byte.

The refute-read added one boundary probe: sign a request to a real Bedrock model ID — anthropic.claude-sonnet-20241022-v1:0 — where the model ID appears in the path. The probe went red against the green code. The colon was passed unencoded; AWS would have returned 403 SignatureDoesNotMatch on every call to a versioned model. The seven-test green suite would have shipped it.

The fix was a single encoding pass on the path before canonicalization. The boundary probe went green. The outcome was recorded as PASS — earned this time, because the adversarial probe had tried and failed to break it.

This is the refute-read working as designed. The suite was not wrong; it tested what it said it tested. The refute-read found the case the suite had not thought to test. That is not a failure of TDD — it is the reason Verify is a separate step.

Outcomes and the record

Every verification in ADD ends with exactly one of three recorded outcomes. There are no silent passes, no ambiguous states, no “we will clean this up later.”

Outcome	Meaning	Conditions
`PASS`	All checks met, all probes survived	The normal path
`RISK-ACCEPTED`	Proceed with a signed waiver: named owner, linked ticket, expiry date	A non-security gap only
`HARD-STOP`	Cannot proceed	Any failing test, any earned-green failure, any security finding

RISK-ACCEPTED is not a way to skip the check. It is a deliberate, documented decision — owner, ticket, expiry — to ship a known non-security limitation. Security is categorically different: a security finding is always a HARD-STOP, never a waiver, never auto-passed at any autonomy level. A shipped security gap compounds; the rule enforces the one place where no autonomy expansion is permitted.

Without a recorded outcome, a pass is indistinguishable from a skip. The record is what gives the outcome its meaning: these probes were run, this is what they returned.

A verified result can ship

When Verify records a PASS, the result can ship — not because the code is perfect, but because the trust is earned. The boundary probes survived. The wiring trace is complete. The refutation attempts failed to break it. The security surface was checked. That is the evidence that gives the release its ground.

Production will then teach you what the spec missed: a usage pattern nobody anticipated, a load profile that exposes a race, a constraint from a downstream system that was never in scope. That is not a failure of the method — it is Step 7. Observe takes what production reports and folds it back into the next Specify, closing the loop.

The method does not promise that a passing Verify catches everything. It promises that trust is earned — that real attempts were made to break the result, and their outcome is on the record. What production discovers afterward becomes the next iteration’s input at Specify. The loop is what makes the method improve.

Next in the series: Observe and Fold: Closing the ADD Loop — how production signal becomes the next spec, what it means to fold a lesson into the foundation, and why the backward arrow from Observe to Specify is what turns the method from a line into a loop.

Next in this series

Observe and Fold: How ADD Improves Itself

Why inspection fails on AI code

The refute-read

Evidence over opinion

Verification is the ceiling

The ai-proxy catch: what the refute-read found

Outcomes and the record

A verified result can ship

Related Posts

Normal AI Use vs Pro AI Use: The Same Feature, Built Two Ways

Steer Before You Sprint: Maximizing Opus in a Large Codebase

From Enterprise SDLC to Solo Vibe-Code with ADD