Ai Ml

Red Tests and the Build Loop: Tests-First for AI Agents

Tests-first, pointed at an AI agent: write a red suite that asserts observable behavior, then turn the agent loose with one instruction — make every test pass, change nothing in the tests or the contract. Red to green.

Tin Dang June 16, 2026 10 min read

Series hero on warm paper: 'Red Tests and the Build Loop: Tests-First for AI Agents'

Classic TDD invented tests-first to design the code — the red test is a design act, a statement of intent that forces the API into existence before the implementation. AI-Driven Development borrows the same sequencing for a different reason. It writes tests first to constrain an agent. The goal is not primarily to shape an interface — the frozen contract already fixed that. The goal is to give the agent a fixed, machine-checkable definition of done, so that “done” is not whatever the agent decides it means.

The tests are the immovable point. The agent’s how — its data structures, its algorithm, its choice of library — pivots freely around that point. Red to green. That is the whole motion of Steps 4 and 5.

Writing red tests: the executable definition of done

The test suite enters existence after the scenarios and the frozen contract. It does not enter existence to describe what code will be written. It enters existence to describe what behavior must be present in the world once the build is over.

That framing matters. The agent generates the suite from the scenarios and the contract — one test per scenario, plus contract-conformance assertions, plus the boundary values implied by every “Reject” rule. Then it runs the suite. And the suite must fail. Every test. Every one.

A test that passes before any implementation exists is testing nothing. It is a false reassurance — a green light wired to always show green, waiting to wave bad code through.

Confirming the suite is “red for the right reason” is what makes it genuinely protective. The right reason is a missing implementation: the function is not there, the route does not exist, the module imports nothing. Red for the wrong reason — a syntax error in the test itself, a broken harness, an assertion that can never evaluate — must be fixed before the build starts. You cannot judge a build against a broken net.

Why tests assert observable behavior, not internals

This is the constraint that makes the how disposable. If a test reaches inside the implementation — checking a private field, asserting the exact cache structure chosen, verifying that a particular class was instantiated — it couples itself to one version of the code. Regenerate the code underneath and the test breaks, not because the behavior changed, but because the internal arrangement changed.

Tests that assert only what comes out of the system — the response status, the transformed value, the side-effect state of another record — are indifferent to internals. They say: given this input, this observable result must be present. The code under them can be thrown away and rebuilt entirely, and if the new build satisfies the assertions, it is correct.

This is what makes the agent’s implementation genuinely free. It is not free in a vague, aspirational sense. It is free in a specific, mechanical sense: the test harness will not notice whether the agent chose a hash table or a balanced tree, whether it used a library you would have chosen or one you would not have, whether it decomposed the logic into three functions or thirteen. The observable result is the only judge.

A red suite from ai-proxy

In ai-proxy — the multi-tenant AI gateway built end to end through ADD across 23 milestones in six days — the SigV4 request-signing feature produced a suite that looked, before any implementation existed, like this:

def test_sv1_canonical_vector():
    """AWS published known-answer vector — must match byte-for-byte."""
    result = sign_request(
        method="GET",
        url="https://s3.amazonaws.com/examplebucket/test.txt",
        body=b"",
        service="s3",
        region="us-east-1",
        credentials=AwsCredentials(
            access_key_id="AKIAIOSFODNN7EXAMPLE",
            secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        ),
        timestamp=datetime(2013, 5, 24, 0, 0, 0, tzinfo=timezone.utc),
    )
    assert result["Authorization"].startswith("AWS4-HMAC-SHA256 ")
    assert "Credential=AKIAIOSFODNN7EXAMPLE/20130524/us-east-1/s3/aws4_request" in result["Authorization"]
    assert result["x-amz-date"] == "20130524T000000Z"

def test_sv5_secret_never_leaks():
    """The secret access key must not appear in returned headers or repr."""
    creds = AwsCredentials(
        access_key_id="AKIAIOSFODNN7EXAMPLE",
        secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    )
    result = sign_request(
        method="POST", url="https://bedrock.us-east-1.amazonaws.com/v1/chat",
        body=b'{"messages":[]}', service="bedrock", region="us-east-1",
        credentials=creds, timestamp=datetime(2024, 1, 1, tzinfo=timezone.utc),
    )
    secret = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    assert secret not in str(result)
    assert secret not in repr(creds)

def test_sv3_missing_region_returns_none():
    """Missing region → provider absent; no exception, no partial header."""
    result = sign_request(
        method="GET", url="https://s3.amazonaws.com/bucket/key",
        body=b"", service="s3", region=None,
        credentials=AwsCredentials("key", "secret"),
        timestamp=datetime(2024, 1, 1, tzinfo=timezone.utc),
    )
    assert result is None

Run these three tests with no sign_request function defined. They fail: NameError: name 'sign_request' is not defined. That is the correct, honest starting point. The suite is red for the right reason. The contract is proven executable — the test harness can reach it, load it, assert against it — before a single line of implementation exists. Now the build can begin.

The build instruction: the how set free

Once the suite is confirmed red for the right reason, the build starts. The build instruction is explicit about constraints, because the constraints are what keep the speed safe. Here is the literal form:

Read SPEC.md, contracts/bedrock-sigv4-auth.md, and tests/test_sigv4.py.
Implement the feature so that EVERY test passes.
Constraints:
  - Do NOT change any test.
  - Do NOT change the contract.
  - The secret access key must never appear in logs, repr, or headers.
  - Stop and ask if any requirement is unclear — do not guess.
  - Use only packages listed in dependencies.allowlist.
Report which tests pass and exactly what you changed.

Read it carefully. It says nothing about data structures. Nothing about which cryptographic library to use. Nothing about how to decompose the signing logic into functions, whether to cache intermediate values, whether to use hmac.new directly or wrap it. The implementation is the agent’s to invent.

The constraints all concern the what: do not change the observable definition of done; do not change the locked interface; surface secrets nowhere in the observable surface; ask rather than guess. Within those walls, the space is entirely open.

This is the sentence that makes ADD different from ordinary prompting: “make these specific failing tests pass without changing them.” The agent that produces confident nonsense from “build me a request signer” produces correct, bounded code from that instruction. The agent did not change. The direction did.

The red-to-green loop with an agent

The loop has a simple shape:

agent writes code → pipeline runs tests → some still fail
  → agent reads failure output → agent adjusts → ...
  → all green → hand to Verify

Inside a task, this loop is largely autonomous. The agent runs the tests, reads what failed, and adjusts. You are not watching each iteration. Your attention belongs at the boundaries: set the task clearly going in, review the result coming out. The loop in between is the agent’s domain — it is fast, it does not get bored, and it holds the full failure output in context.

The cardinal rule, stated once and enforced always: a test changed to fit the code inverts the entire method. An agent under pressure to turn a suite green has an available shortcut — weaken the assertion, stub the failing branch, delete the offending test. If you find a test was altered during the build, reject the change outright and re-prompt with the constraint restated. A test that was changed to pass is no longer a test. It is a record of what the code happened to do.

The same applies to the contract. The build implements against the frozen contract and may not edit it. A genuine need to change either is a change request that returns to an earlier step — Specify, then Scenarios, then a new contract freeze, then a new test suite. The loop has a backward arrow for exactly this reason.

Work in small batches. A single enormous change that turns the whole suite green at once is not a triumph. It is an unreviewable blob. Small batches keep the next step — Verify — tractable. You cannot move faster than you can verify, and an unreviewable batch is not a fast batch. It is an unverified one.

Behavior versus internals: the regenerability test

The sharpest way to check whether a test is asserting behavior or internals is to ask: if I deleted the implementation entirely and had the agent rewrite it from scratch — different file structure, different internal abstractions, same spec — would this test still be valid? If yes, it is a behavior test. If it would break purely from the internal rearrangement, it is an internals test and needs to be repaired before the build starts.

	Tests asserting internals	Tests asserting behavior
What they check	Private fields, class names, internal call sequences	Return values, status codes, side-effect states of other records
Coupling	Tightly coupled to one implementation	Indifferent to implementation
On regeneration	Break when internals change, even if behavior is unchanged	Survive full rewrites — only behavior matters
Agent freedom	Forces the agent toward one internal arrangement	Leaves the how entirely open
Example (SigV4)	assert signer._hmac_key == b'...' (private field)	assert result['Authorization'].startswith('AWS4-HMAC-SHA256 ')

The left column defeats disposability. It does not verify that the signing is correct; it verifies that one particular implementation is in place. The right column verifies the claim the spec actually made: the Authorization header must have a specific format. Everything underneath that observable output is free.

ai-proxy: where red caught what the tests themselves got wrong

There is one more thing the “confirm red for the right reason” gate is protecting against: bugs in the tests. This sounds paradoxical — the tests are supposed to catch bugs, not carry them — but a test with a wrong assertion, a test with a misnamed attribute, a test that can never reach the right state, is a test that will mislead the entire build.

In ai-proxy, the agent repeatedly found and fixed its own test bugs at this gate. One assertion read .status_code on an error type that only exposed .status. The test would have raised AttributeError during the build rather than a clean assertion failure — red for the wrong reason, misleading the agent about what was broken. Caught and corrected before the build started. Another test had a fixture that could mathematically never reach green: the fixture’s byte budget was smaller than the constant in the contract it was testing against. The arithmetic was checked at the red gate; the fixture was adjusted.

These are not exotic failures. They are the ordinary consequence of tests being generated, like all generated artifacts, with some probability of subtle error. The red-confirmation gate is the one systematic opportunity to catch them before they cost time in the build loop.

Green is necessary, not sufficient

The build ends when every test passes and the exit check clears: all tests green, coverage not decreased, no test and no contract modified, no dependency outside the allow-list added, the change small enough to review in full.

That is necessary. It is not sufficient.

Green can be gamed. Overfit to the fixtures, vacuous assertions, real logic stubbed away — a suite can pass while proving nothing. Passing tests are the prerequisite for trust, not trust itself.

The next step — Verify — is what earns the green. An adversarial refute-read argues that the green was not earned, hunting for exactly those failure modes. On the SigV4 feature in ai-proxy, the original seven tests all used the path / (the AWS canonical vector). The implementation passed clean. The refute-read added a test using a real Bedrock model ID containing a colon — ...sonnet-20241022-v1:0. AWS canonicalizes : to %3A; the implementation was passing the raw path. The new test went red against the “green” code. Every versioned-model call would have returned 403 in production. A green suite of seven tests would have shipped it.

Green is the entry to the verification step, not the exit from it.

Next in the series: Verify by Evidence — how ADD turns a passing test suite into earned trust through adversarial refute-reads, non-functional checks, and a recorded outcome that replaces “it looks right” with receipts.

Next in this series

Verify by Evidence: ADD's Earned-Green Refute-Read

Writing red tests: the executable definition of done

Why tests assert observable behavior, not internals

A red suite from ai-proxy

The build instruction: the how set free

The red-to-green loop with an agent

Behavior versus internals: the regenerability test

ai-proxy: where red caught what the tests themselves got wrong

Green is necessary, not sufficient

Related Posts

Normal AI Use vs Pro AI Use: The Same Feature, Built Two Ways

Steer Before You Sprint: Maximizing Opus in a Large Codebase

From Enterprise SDLC to Solo Vibe-Code with ADD