Ai Ml

ADD for Machine Learning: The Eval Set Is the Frozen Contract

ML is AI building AI, and a high headline metric hides leakage, gamed proxies, and slice regressions. ADD locks the eval set, thresholds, and model card as the frozen contract, makes the held-out suite the red tests, and verifies by slices and online evidence — not by the leaderboard number.

Tin Dang June 16, 2026 13 min read

Series hero on warm paper: 'ADD for Machine Learning — the eval set is the frozen contract'

The model ships at 94% accuracy. The leaderboard number is real — the test set confirms it. Three weeks later, support tickets accumulate around a particular demographic slice that the headline number never mentioned. An audit reveals the evaluation set shared temporal structure with the training data. The “94%” was measuring a different thing than the thing that mattered, and the metric was never under suspicion because it was high.

This is the ML version of fast waste — not a sprint toward the wrong feature, but a training run toward the wrong objective, confirmed by a metric that was always a proxy and never the goal. In machine learning, AI is the producer, and the failures it generates look authoritative: a number, a curve, a confusion matrix. The appearance of rigor is the danger.

ADD — AI-Driven Development was developed in a software-native context where the producer is a coding agent. But ML is the domain where the producer is literally AI building AI: an automated pipeline that searches hyperparameter space, fits weights, and returns a frozen artifact. The method translates with unusual sharpness here, because the ML workflow already has all the right nouns — eval sets, thresholds, model cards, slice analysis — and routinely treats them the wrong way. ADD says: lock them first, as the frozen contract, before training begins.

The four failures in ML clothing

Fast waste in ML is a model that looks finished because the headline metric is high, but was optimized for the wrong thing. The pipeline sprinted confidently past an ambiguous objective — accuracy when you needed calibrated recall on the minority class — and the metric never revealed the error.

Context rot is the problem space drifting after the eval was designed. The data distribution shifts, a regulatory constraint changes, a key business rule is updated — but the frozen eval from the original project spec is never revisited, and the model is still being measured against a definition of success that no longer applies.

Trust by inspection is reading a training curve and concluding the model is good. A smooth loss curve and a clean confusion matrix look like correctness. They are not evidence of correctness. A proxy metric optimized to a high value tells you the proxy was optimized, not that the underlying objective was served.

Verification ceiling is the volume problem: an automated pipeline can train hundreds of variants overnight. Each produces numbers. The team cannot meaningfully evaluate all of them, so high numbers pass silently. Output beyond your capacity to verify is not throughput — it is unreviewed risk, and in ML it ships as a production model.

The eval set is the frozen contract

The central idea of ADD in ML is this: the eval set, the metric thresholds, and the model card together are the frozen contract. They are the one human gate. They are locked and checksummed before a single training run begins, so you cannot move the goalposts midway through a project when the numbers come back inconvenient.

Constrain the what — the eval, the thresholds, the model card. Free the how — the architecture, the features, the hyperparameters. Verify by slices and online evidence, not by the leaderboard number.

The leaderboard number is the result of optimizing a proxy. The eval is the instrument you built to detect whether the proxy was a good one. If you are allowed to change the instrument after you see the result, the instrument means nothing. Locking the eval is what gives the number its meaning.

The loop, translated

Step 0 — Ground

Before writing any objective, load what is actually true: the data schema and known quality issues, the prior model card if one exists, the business constraints (latency, memory budget, regulatory obligations), and the deployment environment. In ML, ground is often where you discover that the label definition in the annotation guide does not match the label definition in the training data — a mismatch that will silently corrupt everything downstream.

If a prior model exists, ground also includes its failure modes: which slices it underperformed on, what the online A/B showed versus the offline eval, what production drift was observed. The next model’s spec should be aimed at what actually failed, not at what the headline number suggests.

Step 1 — Specify

The spec states the true objective, the proxy metric chosen to approximate it, and the explicit list of things the model must not do — each paired with a named refusal code. Here is a realistic example for a content moderation classifier:

# SPEC — content-moderation-v3

objective: >
  Surface harmful content (harassment, self-harm, graphic violence) for human review
  before it reaches the user. Minimize false negatives on high-severity categories;
  tolerate a higher false-positive rate before escalating to the low-severity queue.

proxy_metric: recall@precision=0.90 on the held-out eval set

must_do:
  - Achieve recall >= 0.88 on high-severity labels at precision >= 0.90
  - Achieve recall >= 0.72 on high-severity labels within the "non-English" slice
  - Return a calibrated probability score, not just a binary label
  - Serve inference in < 80ms p99 under production traffic profile

must_not:
  - TRAIN-TEST-LEAK: Any sample from the eval set must not appear in training data.
    Verified by hash intersection of example IDs before training begins.
  - SLICE-REGRESSION: Recall on any predefined demographic or language slice must not
    regress more than 3 percentage points below the prior model's slice performance.
  - OBJECTIVE-PROXY-GAMED: A model that achieves the recall threshold by
    over-predicting positive on ambiguous inputs will be rejected even if it
    passes the aggregate threshold. Checked via calibration error and false-positive
    rate on the "clearly benign" control slice.
  - OFFLINE-ONLY-CLAIM: Offline eval results alone are not sufficient to claim
    production readiness. An online A/B with pre-registered success criteria is
    required before full rollout.

after_state: >
  A model that meets all thresholds on the locked, held-out eval AND on all predefined
  slices; passes leakage checks; is calibrated; and has been validated by an online A/B
  against pre-registered criteria before rollout begins.

assumptions — lowest-confidence first:
  ⚠ The "non-English" slice threshold (0.72 recall) is derived from the prior model's
    performance. If the new model's architecture changes the tradeoff surface, this
    threshold may need renegotiation — but only via a formal spec amendment before training.
  ⚠ The 80ms p99 latency target assumes the current serving infrastructure. Hardware
    changes in the next quarter may alter this constraint.

The refusal codes — TRAIN-TEST-LEAK, SLICE-REGRESSION, OBJECTIVE-PROXY-GAMED, OFFLINE-ONLY-CLAIM — are named reasons, not vague concerns. Each one points at a specific, checkable failure mode. When a model is rejected, the rejection cites a code. This is the discipline that prevents “the number looked high so we shipped it.”

Step 2 — Scenarios

Three scenarios make the spec concrete: the model case, the edge case, and the failure case.

Model case — a high-severity harassment example in English, where the model should return probability > 0.90 and be routed to immediate human review.

Edge case — a non-English post in a low-resource language that represents a real deployment slice. The model must hit the slice threshold even though training data density is lower. This is the scenario most often omitted from unit tests and most often where slice regressions hide.

Failure case — an adversarial input: a harassment message with deliberate spelling variations or code-switching designed to evade the classifier. The expected behavior is explicit: either flag with high confidence, or return low confidence and route to review. Shipping a model that returns high-confidence “benign” on known adversarial inputs is a hard stop.

Step 3 — The frozen contract

The contract is the eval set specification, the thresholds, and the model card template — locked before training begins. It is the single human gate.

# CONTRACT — content-moderation-v3 — FROZEN before training

eval_set:
  source: s3://ml-evals/content-mod/v3-holdout/
  checksum_sha256: "a7f3c2..."   # computed at freeze time; verified before every eval run
  composition:
    total_examples: 12000
    high_severity_positive: 3200
    low_severity_positive: 4100
    benign_control: 4700
    slices:
      - name: non_english
        size: 2400
        languages: [es, fr, ar, zh, hi, pt, id, de]
      - name: adversarial
        size: 600
        description: "Known evasion patterns — spelling variation, code-switching"
      - name: clearly_benign_control
        size: 1200
        description: "Human-verified benign content; used to check false-positive gating"

thresholds:
  aggregate_recall_at_p90: 0.88
  non_english_slice_recall_at_p90: 0.72
  adversarial_slice_recall_at_p90: 0.65
  calibration_ece: <= 0.05
  false_positive_rate_on_benign_control: <= 0.04
  latency_p99_ms: <= 80

model_card_template: docs/model-card-v3-template.md

status: FROZEN @ v3.0
amendment_protocol: >
  Any threshold or eval set change requires a written spec amendment reviewed and
  approved before any training runs that would be measured against the new criteria.
  Retroactive amendments are not permitted.

The checksum is what makes this a real frozen contract rather than a policy intention. If anyone modifies the eval set after this file is committed — adds examples, removes hard ones, rebalances classes — the checksum fails before the eval run begins. You cannot silently move the goalposts because the goalposts are hashed.

Step 4 — Acceptance checks (the red tests)

The held-out eval suite, the slice checks, and the leakage audits are the red tests. They are written and failing before training begins, because no model exists yet.

ACCEPTANCE CHECKLIST — content-moderation-v3
Run before any model is considered for production review.

[ ] LEAKAGE CHECK
    - Hash-intersect all training example IDs against eval IDs: intersection must be empty.
    - Verify eval set checksum matches FROZEN contract: sha256(eval_set) == "a7f3c2..."
    - Check temporal structure: no eval example timestamp falls within training window.
    Status: RED (no model trained yet — expected)

[ ] AGGREGATE THRESHOLD
    - recall@precision=0.90 on full held-out eval >= 0.88
    Status: RED

[ ] SLICE THRESHOLDS
    - non_english slice: recall@p90 >= 0.72
    - adversarial slice: recall@p90 >= 0.65
    - clearly_benign_control: false-positive rate <= 0.04
    Status: RED

[ ] CALIBRATION
    - Expected Calibration Error (ECE) <= 0.05
    - Reliability diagram reviewed by a human (not auto-passed)
    Status: RED

[ ] SLICE REGRESSION VS PRIOR MODEL
    - No slice regresses more than 3pp below prior model performance on any named slice
    - Prior model baseline scores: [loaded from model-card-v2.md]
    Status: RED

[ ] LATENCY
    - p99 inference latency <= 80ms under production traffic profile in staging environment
    Status: RED

[ ] ONLINE A/B PRE-REGISTRATION
    - Success criteria, traffic allocation, and minimum runtime registered before rollout
    - Primary metric: harm-reaching-users rate (not the offline proxy)
    Status: RED (required before full rollout — not before offline eval)

These checks are failing because nothing has been trained. They should fail. A check that passes before any model exists is not protecting anything.

Step 5 — Produce (make the tests green)

Now training begins. Architecture, feature engineering, data augmentation strategy, hyperparameter search — all of it is wide open. The only constraint is: make every acceptance check pass against the locked eval. The training pipeline does not touch the eval set. The hyperparameter search is measured against a validation split, not the held-out eval. The held-out eval is used exactly once per candidate model.

The how — whether to use a fine-tuned large model or a smaller distilled one, whether to use SMOTE for class imbalance or threshold calibration, whether to use cross-lingual transfer or train language-specific heads — is entirely the team’s to decide. ADD does not prescribe the architecture. It prescribes the gate.

Step 6 — Verify by evidence, not by the leaderboard number

A model that passes all acceptance checks is a candidate, not a product. Verification is the human act of checking that the green was earned.

	Chasing the leaderboard	Locked-eval ADD
Eval defined when	After seeing early results	Before training begins, checksummed
Metric treated as	The goal	A proxy for the goal
Slice analysis	Optional, post-hoc	Required, pre-registered, gates the build
Leakage check	Rarely formalized	Mandatory, hash-verified, blocks training
Adversarial testing	Ad hoc if at all	A named slice with a threshold
Online validation	Optional	Required before full rollout (OFFLINE-ONLY-CLAIM)
Goalposts	Can shift when results disappoint	Locked; amendments require written process
Verification standard	"The number is high"	Slices + calibration + online A/B + refute-read

The adversarial move — the refute-read in ML — is to actively hunt for the failing slice before declaring success. Hunt for the demographic where recall drops. Hunt for the domain shift that degrades the aggregate. Hunt for the training-distribution artifact the model is actually keying on. The work is not “show me the accuracy” but “show me where this model is still wrong, and whether that matters.”

The online A/B is where the proxy is finally confronted with the objective. The offline metric was calibrated recall. The online metric is harm actually reaching users. These two numbers will not move in perfect lockstep, and the gap between them is real information — about distributional shift, about annotation disagreement, about proxy quality. The A/B is not a formality; it is the first moment where evidence about the true objective is available.

Step 7 — Observe and fold

Production is a distribution, and distributions drift. The eval set, however carefully curated, was a sample from a past distribution. Over time, new failure patterns emerge — new evasion tactics, new content modalities, new languages in the user base — that the original eval did not capture.

In ADD, these observations fold back into the eval set. A newly discovered failure case becomes a new eval example, properly labeled and checksummed into the next version of the contract. The slice list grows. The thresholds are revisited with fresh evidence. The model card is updated to record what was observed in production.

The eval is living. It grows over time in the direction the real world is pulling it. What does not change is the principle: the eval is locked before training, not after.

Constrain the what, free the how

A team that locks the eval and thresholds before training is free to be genuinely creative in every other dimension. They can try a radically different architecture without worrying that the goalposts will shift if it underperforms. They can run an aggressive hyperparameter search without the eval contaminating it. They can swap the feature pipeline entirely and trust that the acceptance checks will tell them whether the swap helped.

The discipline is counterintuitive: you get more freedom in execution by accepting more constraint on the outcome. The locked contract is not a ceiling on ambition. It is the instrument that makes ambition measurable.

What does not transfer

Problem framing is irreducibly human. Deciding what to optimize, which slices matter, what counts as a harmful output, what the acceptable tradeoff is between false positives and false negatives — these are value judgments, not engineering decisions. An eval set operationalizes those judgments; it does not make them. The ADD loop can tell you whether you built the model you specified. It cannot tell you whether you specified the right model.

A metric is a proxy, and no checklist changes that. ADD reduces the risk that you optimize for a proxy while losing the objective, but it does not eliminate proxy risk. A team that locks a bad eval and trains a model that passes it has moved the problem upstream, not solved it. The work of designing a good eval — choosing examples, defining slices, setting thresholds that correspond to real-world stakes — is not automatable. It is the hardest part of the job.

Ethics and impact judgment cannot be delegated to the optimizer. A model that passes every threshold on a well-designed eval can still cause harm if the deployment context changes, if the user population is different from the annotation population, or if the use case is misaligned with the model’s capabilities. ADD handles the engineering discipline around training and evaluation. The question of whether to build and ship a given model at all is one the method explicitly does not answer.

Next in the series

Part 10 applies ADD to Finance and FP&A — where the fast waste is a forecast that looks authoritative because a spreadsheet produced it, and the frozen contract is the model assumptions locked before the numbers are run. Read it at ADD for Finance.

Next in this series

ADD for Finance and FP&A: Lock the Assumptions, Prove the Forecast