The model ships at 94% accuracy. The leaderboard number is real — the test set confirms it. Three weeks later, support tickets accumulate around a particular demographic slice that the headline number never mentioned. An audit reveals the evaluation set shared temporal structure with the training data. The “94%” was measuring a different thing than the thing that mattered, and the metric was never under suspicion because it was high.
This is the ML version of fast waste — not a sprint toward the wrong feature, but a training run toward the wrong objective, confirmed by a metric that was always a proxy and never the goal. In machine learning, AI is the producer, and the failures it generates look authoritative: a number, a curve, a confusion matrix. The appearance of rigor is the danger.
ADD — AI-Driven Development was developed in a software-native context where the producer is a coding agent. But ML is the domain where the producer is literally AI building AI: an automated pipeline that searches hyperparameter space, fits weights, and returns a frozen artifact. The method translates with unusual sharpness here, because the ML workflow already has all the right nouns — eval sets, thresholds, model cards, slice analysis — and routinely treats them the wrong way. ADD says: lock them first, as the frozen contract, before training begins.
The four failures in ML clothing
Fast waste in ML is a model that looks finished because the headline metric is high, but was optimized for the wrong thing. The pipeline sprinted confidently past an ambiguous objective — accuracy when you needed calibrated recall on the minority class — and the metric never revealed the error.
Context rot is the problem space drifting after the eval was designed. The data distribution shifts, a regulatory constraint changes, a key business rule is updated — but the frozen eval from the original project spec is never revisited, and the model is still being measured against a definition of success that no longer applies.
Trust by inspection is reading a training curve and concluding the model is good. A smooth loss curve and a clean confusion matrix look like correctness. They are not evidence of correctness. A proxy metric optimized to a high value tells you the proxy was optimized, not that the underlying objective was served.
Verification ceiling is the volume problem: an automated pipeline can train hundreds of variants overnight. Each produces numbers. The team cannot meaningfully evaluate all of them, so high numbers pass silently. Output beyond your capacity to verify is not throughput — it is unreviewed risk, and in ML it ships as a production model.
The eval set is the frozen contract
The central idea of ADD in ML is this: the eval set, the metric thresholds, and the model card together are the frozen contract. They are the one human gate. They are locked and checksummed before a single training run begins, so you cannot move the goalposts midway through a project when the numbers come back inconvenient.
Constrain the what — the eval, the thresholds, the model card. Free the how — the architecture, the features, the hyperparameters. Verify by slices and online evidence, not by the leaderboard number.
The leaderboard number is the result of optimizing a proxy. The eval is the instrument you built to detect whether the proxy was a good one. If you are allowed to change the instrument after you see the result, the instrument means nothing. Locking the eval is what gives the number its meaning.
The loop, translated
Step 0 — Ground
Before writing any objective, load what is actually true: the data schema and known quality issues, the prior model card if one exists, the business constraints (latency, memory budget, regulatory obligations), and the deployment environment. In ML, ground is often where you discover that the label definition in the annotation guide does not match the label definition in the training data — a mismatch that will silently corrupt everything downstream.
If a prior model exists, ground also includes its failure modes: which slices it underperformed on, what the online A/B showed versus the offline eval, what production drift was observed. The next model’s spec should be aimed at what actually failed, not at what the headline number suggests.
Step 1 — Specify
The spec states the true objective, the proxy metric chosen to approximate it, and the explicit list of things the model must not do — each paired with a named refusal code. Here is a realistic example for a content moderation classifier:
# SPEC — content-moderation-v3
objective: > Surface harmful content (harassment, self-harm, graphic violence) for human review before it reaches the user. Minimize false negatives on high-severity categories; tolerate a higher false-positive rate before escalating to the low-severity queue.
proxy_metric: recall@precision=0.90 on the held-out eval set
must_do: - Achieve recall >= 0.88 on high-severity labels at precision >= 0.90 - Achieve recall >= 0.72 on high-severity labels within the "non-English" slice - Return a calibrated probability score, not just a binary label - Serve inference in < 80ms p99 under production traffic profile
must_not: - TRAIN-TEST-LEAK: Any sample from the eval set must not appear in training data. Verified by hash intersection of example IDs before training begins. - SLICE-REGRESSION: Recall on any predefined demographic or language slice must not regress more than 3 percentage points below the prior model's slice performance. - OBJECTIVE-PROXY-GAMED: A model that achieves the recall threshold by over-predicting positive on ambiguous inputs will be rejected even if it passes the aggregate threshold. Checked via calibration error and false-positive rate on the "clearly benign" control slice. - OFFLINE-ONLY-CLAIM: Offline eval results alone are not sufficient to claim production readiness. An online A/B with pre-registered success criteria is required before full rollout.
after_state: > A model that meets all thresholds on the locked, held-out eval AND on all predefined slices; passes leakage checks; is calibrated; and has been validated by an online A/B against pre-registered criteria before rollout begins.
assumptions — lowest-confidence first: ⚠ The "non-English" slice threshold (0.72 recall) is derived from the prior model's performance. If the new model's architecture changes the tradeoff surface, this threshold may need renegotiation — but only via a formal spec amendment before training. ⚠ The 80ms p99 latency target assumes the current serving infrastructure. Hardware changes in the next quarter may alter this constraint.The refusal codes — TRAIN-TEST-LEAK, SLICE-REGRESSION, OBJECTIVE-PROXY-GAMED, OFFLINE-ONLY-CLAIM — are named reasons, not vague concerns. Each one points at a specific, checkable failure mode. When a model is rejected, the rejection cites a code. This is the discipline that prevents “the number looked high so we shipped it.”
Step 2 — Scenarios
Three scenarios make the spec concrete: the model case, the edge case, and the failure case.
Model case — a high-severity harassment example in English, where the model should return probability > 0.90 and be routed to immediate human review.
Edge case — a non-English post in a low-resource language that represents a real deployment slice. The model must hit the slice threshold even though training data density is lower. This is the scenario most often omitted from unit tests and most often where slice regressions hide.
Failure case — an adversarial input: a harassment message with deliberate spelling variations or code-switching designed to evade the classifier. The expected behavior is explicit: either flag with high confidence, or return low confidence and route to review. Shipping a model that returns high-confidence “benign” on known adversarial inputs is a hard stop.
Step 3 — The frozen contract
The contract is the eval set specification, the thresholds, and the model card template — locked before training begins. It is the single human gate.
# CONTRACT — content-moderation-v3 — FROZEN before training
eval_set: source: s3://ml-evals/content-mod/v3-holdout/ checksum_sha256: "a7f3c2..." # computed at freeze time; verified before every eval run composition: total_examples: 12000 high_severity_positive: 3200 low_severity_positive: 4100 benign_control: 4700 slices: - name: non_english size: 2400 languages: [es, fr, ar, zh, hi, pt, id, de] - name: adversarial size: 600 description: "Known evasion patterns — spelling variation, code-switching" - name: clearly_benign_control size: 1200 description: "Human-verified benign content; used to check false-positive gating"
thresholds: aggregate_recall_at_p90: 0.88 non_english_slice_recall_at_p90: 0.72 adversarial_slice_recall_at_p90: 0.65 calibration_ece: <= 0.05 false_positive_rate_on_benign_control: <= 0.04 latency_p99_ms: <= 80
model_card_template: docs/model-card-v3-template.md
status: FROZEN @ v3.0amendment_protocol: > Any threshold or eval set change requires a written spec amendment reviewed and approved before any training runs that would be measured against the new criteria. Retroactive amendments are not permitted.The checksum is what makes this a real frozen contract rather than a policy intention. If anyone modifies the eval set after this file is committed — adds examples, removes hard ones, rebalances classes — the checksum fails before the eval run begins. You cannot silently move the goalposts because the goalposts are hashed.
Step 4 — Acceptance checks (the red tests)
The held-out eval suite, the slice checks, and the leakage audits are the red tests. They are written and failing before training begins, because no model exists yet.
ACCEPTANCE CHECKLIST — content-moderation-v3Run before any model is considered for production review.
[ ] LEAKAGE CHECK - Hash-intersect all training example IDs against eval IDs: intersection must be empty. - Verify eval set checksum matches FROZEN contract: sha256(eval_set) == "a7f3c2..." - Check temporal structure: no eval example timestamp falls within training window. Status: RED (no model trained yet — expected)
[ ] AGGREGATE THRESHOLD - recall@precision=0.90 on full held-out eval >= 0.88 Status: RED
[ ] SLICE THRESHOLDS - non_english slice: recall@p90 >= 0.72 - adversarial slice: recall@p90 >= 0.65 - clearly_benign_control: false-positive rate <= 0.04 Status: RED
[ ] CALIBRATION - Expected Calibration Error (ECE) <= 0.05 - Reliability diagram reviewed by a human (not auto-passed) Status: RED
[ ] SLICE REGRESSION VS PRIOR MODEL - No slice regresses more than 3pp below prior model performance on any named slice - Prior model baseline scores: [loaded from model-card-v2.md] Status: RED
[ ] LATENCY - p99 inference latency <= 80ms under production traffic profile in staging environment Status: RED
[ ] ONLINE A/B PRE-REGISTRATION - Success criteria, traffic allocation, and minimum runtime registered before rollout - Primary metric: harm-reaching-users rate (not the offline proxy) Status: RED (required before full rollout — not before offline eval)These checks are failing because nothing has been trained. They should fail. A check that passes before any model exists is not protecting anything.
Step 5 — Produce (make the tests green)
Now training begins. Architecture, feature engineering, data augmentation strategy, hyperparameter search — all of it is wide open. The only constraint is: make every acceptance check pass against the locked eval. The training pipeline does not touch the eval set. The hyperparameter search is measured against a validation split, not the held-out eval. The held-out eval is used exactly once per candidate model.
The how — whether to use a fine-tuned large model or a smaller distilled one, whether to use SMOTE for class imbalance or threshold calibration, whether to use cross-lingual transfer or train language-specific heads — is entirely the team’s to decide. ADD does not prescribe the architecture. It prescribes the gate.
Step 6 — Verify by evidence, not by the leaderboard number
A model that passes all acceptance checks is a candidate, not a product. Verification is the human act of checking that the green was earned.
| Chasing the leaderboard | Locked-eval ADD | |
|---|---|---|
| Eval defined when | After seeing early results | Before training begins, checksummed |
| Metric treated as | The goal | A proxy for the goal |
| Slice analysis | Optional, post-hoc | Required, pre-registered, gates the build |
| Leakage check | Rarely formalized | Mandatory, hash-verified, blocks training |
| Adversarial testing | Ad hoc if at all | A named slice with a threshold |
| Online validation | Optional | Required before full rollout (OFFLINE-ONLY-CLAIM) |
| Goalposts | Can shift when results disappoint | Locked; amendments require written process |
| Verification standard | "The number is high" | Slices + calibration + online A/B + refute-read |
The adversarial move — the refute-read in ML — is to actively hunt for the failing slice before declaring success. Hunt for the demographic where recall drops. Hunt for the domain shift that degrades the aggregate. Hunt for the training-distribution artifact the model is actually keying on. The work is not “show me the accuracy” but “show me where this model is still wrong, and whether that matters.”
The online A/B is where the proxy is finally confronted with the objective. The offline metric was calibrated recall. The online metric is harm actually reaching users. These two numbers will not move in perfect lockstep, and the gap between them is real information — about distributional shift, about annotation disagreement, about proxy quality. The A/B is not a formality; it is the first moment where evidence about the true objective is available.
Step 7 — Observe and fold
Production is a distribution, and distributions drift. The eval set, however carefully curated, was a sample from a past distribution. Over time, new failure patterns emerge — new evasion tactics, new content modalities, new languages in the user base — that the original eval did not capture.
In ADD, these observations fold back into the eval set. A newly discovered failure case becomes a new eval example, properly labeled and checksummed into the next version of the contract. The slice list grows. The thresholds are revisited with fresh evidence. The model card is updated to record what was observed in production.
The eval is living. It grows over time in the direction the real world is pulling it. What does not change is the principle: the eval is locked before training, not after.
Constrain the what, free the how
A team that locks the eval and thresholds before training is free to be genuinely creative in every other dimension. They can try a radically different architecture without worrying that the goalposts will shift if it underperforms. They can run an aggressive hyperparameter search without the eval contaminating it. They can swap the feature pipeline entirely and trust that the acceptance checks will tell them whether the swap helped.
The discipline is counterintuitive: you get more freedom in execution by accepting more constraint on the outcome. The locked contract is not a ceiling on ambition. It is the instrument that makes ambition measurable.
What does not transfer
Problem framing is irreducibly human. Deciding what to optimize, which slices matter, what counts as a harmful output, what the acceptable tradeoff is between false positives and false negatives — these are value judgments, not engineering decisions. An eval set operationalizes those judgments; it does not make them. The ADD loop can tell you whether you built the model you specified. It cannot tell you whether you specified the right model.
A metric is a proxy, and no checklist changes that. ADD reduces the risk that you optimize for a proxy while losing the objective, but it does not eliminate proxy risk. A team that locks a bad eval and trains a model that passes it has moved the problem upstream, not solved it. The work of designing a good eval — choosing examples, defining slices, setting thresholds that correspond to real-world stakes — is not automatable. It is the hardest part of the job.
Ethics and impact judgment cannot be delegated to the optimizer. A model that passes every threshold on a well-designed eval can still cause harm if the deployment context changes, if the user population is different from the annotation population, or if the use case is misaligned with the model’s capabilities. ADD handles the engineering discipline around training and evaluation. The question of whether to build and ship a given model at all is one the method explicitly does not answer.
Next in the series
Part 10 applies ADD to Finance and FP&A — where the fast waste is a forecast that looks authoritative because a spreadsheet produced it, and the frozen contract is the model assumptions locked before the numbers are run. Read it at ADD for Finance.