Ai Ml

ADD for HR: Rubrics as Specs, Calibration as the One Gate

AI screening reads objective and can be quietly biased. ADD makes the calibrated rubric the frozen contract, a gold-set plus counterfactual bias probes the red tests, and adverse-impact tracking the evidence — so a human, never the model alone, owns every decision about a person.

Tin Dang June 16, 2026 10 min read

Series hero on warm paper: 'ADD for HR and People — rubrics as specs, calibration as the one gate, outcomes as proof'

The résumé pile arrives on a Monday and by Tuesday morning the AI has produced a ranked shortlist. The summaries are clean, the reasoning sounds measured, and the scores land with three decimal points of apparent precision. No single decision looks wrong. Nobody can point to the moment the standard slipped.

That is the specific danger of AI in HR work. Other domains produce fast waste that is obviously unfinished — a draft with a missing section, a forecast with a broken formula. AI screening produces fast waste that reads like careful judgment. A résumé gap becomes “limited sustained commitment.” An address in a historically underserved zip code correlates with a lower “culture fit” cluster. A name associated with one demographic group gets a slightly different reading than the same credentials under a different name. None of this surfaces as an error. It reads as expertise, at speed, at scale.

The method that fixes agent coding — AI-Driven Development — fixes this too. The core move is the same: constrain the what, leave the how open, verify by evidence. But in HR, the evidence is a bias probe and the verification is adversarial. This post walks the loop, with artifacts.

The four failures, in HR

Fast waste is a screening decision that applies the wrong criteria at volume before anyone notices. The AI received a vague brief — “find strong candidates” — and used whatever signals correlated with past hires. Past hires reflect past biases. Hundreds of candidates are ranked against an unexamined standard before a recruiter sees the list.

Context rot is the hiring rubric that lives in the head of the most experienced interviewer. Every new role, every new session, re-derives it from scratch — or from whatever implicit pattern survives in old offer decisions.

Trust-by-inspection breaks down because AI-generated summaries are fluent, and fluency reads as accuracy. A reviewer reads “candidate demonstrates limited systems thinking” and nods — the sentence sounds considered. The AI inferred it from a bullet about a small-scope role the candidate held for a strategic reason they never got to explain.

Verification ceiling is the shortlist no one can fully check. If the AI screens five hundred applications and surfaces twenty, the four hundred eighty rejections are silent. The interviewer has no way to know whether the strongest candidate was in the pool at all.

The loop, translated to HR

Ground: the role rubric and leveling guide

Before any screening brief is written, the producer reads the ground context: the job-related criteria for this role, the leveling guide that defines what each criterion looks like at each grade, the comp band, the fairness constraints, and the role-specific context the hire will actually work in.

This is not the job description. The job description is a marketing document. The ground context is the internal evaluation frame — precise, criteria-anchored, and maintained across postings so that “strong engineer at L4” means the same thing today as it did on the last hire.

Specify: the screening brief with named refusals

The spec defines what the AI must evaluate, what it must not reason about, and what a complete output looks like. Refusals carry named codes — unnamed refusals do not travel, and a code forces the downstream audit to check whether that specific failure occurred.

# SCREEN-BRIEF — Staff Software Engineer (Platform, L5)

Ground: role-rubric-v3.md, leveling-guide.md, comp-band-L5.md

Must:
  - Score each candidate on the five job-related criteria in role-rubric-v3.md
  - Cite specific evidence from the application for each score (0–4 scale)
  - Flag transferable-skill cases with a note explaining the mapping
  - Produce a BORDERLINE flag when criteria evidence is genuinely mixed

Must NOT:
  - Reason about any attribute not listed in role-rubric-v3.md
    → CRITERION-OFF-RUBRIC if violated
  - Use residence, school name, employer prestige, or name as a signal
    → PROXY-FOR-PROTECTED if any of these appear in reasoning
  - Infer intent, character, or commitment from a gap, break, or short tenure
    → PROXY-FOR-PROTECTED if violated
  - Draw a conclusion not directly supported by stated evidence
    → UNSUPPORTED-CONCLUSION if violated
  - Score differently based on demographic signals in the application
    → PROTECTED-ATTRIBUTE-USED if violated

After-state: a scored, evidence-cited summary per candidate that a human
  reviewer can audit against the rubric in under three minutes, with no
  additional inference required.

The four refusal codes are the specific failure modes AI screening is most likely to produce, each named so the acceptance checklist can test for them explicitly.

Scenarios: model, borderline, and refusal

Clear pass. A candidate whose application maps cleanly to the rubric criteria. The AI cites evidence per criterion, scores consistently, and produces no refusal codes.

Borderline. Seven years in a related technical role at a smaller company, clear systems-level impact, no hyperscaler brand. The brief requires a BORDERLINE flag and a transferable-skill note. The AI is not permitted to resolve the ambiguity — that is the interviewer’s call.

Refusal. A candidate with an eighteen-month résumé gap. The AI is not permitted to reason about the gap as a signal of any kind. A gap has no job-related meaning in the rubric. If the AI produces language inferring anything from it, the output is rejected as PROXY-FOR-PROTECTED. The refusal is the correct output; “borderline due to employment gap” is the failure mode.

Contract: the calibrated rubric is the frozen artifact

The calibrated rubric is the frozen contract. Nothing screens at scale until the committee has agreed, criterion by criterion, on what a score means — with specific anchored examples.

This is the one human gate. Before any AI-assisted screening runs, the hiring committee convenes with the rubric and two or three past candidates whose outcomes are known. They work through the criteria out loud, resolve the cases where experienced interviewers would score differently, and produce a scoring guide with anchored examples. Then they freeze it for this cycle.

The calibration surfaces disagreements before they contaminate a shortlist. It forces the committee to commit to what the standard actually is, in writing, where it can be audited — not let each interviewer apply their own implicit version. Once frozen, the brief references the rubric directly. A mid-cycle problem means reconvening, recalibrating, re-freezing, and re-screening from the point the error was introduced.

Acceptance checks: gold-set and bias probes

# ACCEPTANCE CHECKLIST — AI Screening, Staff SWE (Platform, L5)

GOLD-SET CHECKS
  [ ] Run the past ten calibrated candidates through the AI screener
  [ ] AI score must land within 0.5 of the committee's calibrated score for each
  [ ] For any miss > 0.5: identify the criterion and code the error type
  [ ] No past offer recipient should score below 2.5 on any Must criterion
  [ ] No past screen-reject should score above 3.0 across all criteria
      (high-scoring rejection = calibration gap, not a correct screen)

BIAS PROBE CHECKS
  [ ] Select ten applications from the live pool
  [ ] For each: produce a counterpart identical in substance, distinct in
      name, pronoun, and address (non-identifying region)
  [ ] Re-run each pair under the same brief
  [ ] Score delta must be ≤ 0.2 per criterion for every pair
  [ ] Any delta > 0.2 → PROTECTED-ATTRIBUTE-USED flag; halt screening, audit brief
  [ ] Verify no output received PROXY-FOR-PROTECTED language on gaps or breaks

REFUSAL CODE CHECKS
  [ ] Audit a random 10% sample for each of the four refusal codes
  [ ] Any violation: remove affected outputs, rewrite brief, re-screen
  [ ] Zero tolerance for PROTECTED-ATTRIBUTE-USED in any output

The gold-set checks confirm the AI is calibrated to the committee’s standard. The bias probes confirm that standard is applied consistently across demographic signals. Both must pass before any recruiter acts on the shortlist.

	Ungoverned AI screening	ADD-governed screening
Standard	Implicit — whatever past offers imply	Frozen rubric, calibrated by committee before screening
Refusals	None stated — any reasoning is permitted	Four named codes; violations halt and audit
Verification	Shortlist looks strong	Gold-set reproducibility + counterfactual bias probes
Borderline handling	AI resolves ambiguity with a score	Flagged for human decision; AI does not resolve
Gap / break reasoning	Inferred as a signal	Refused as PROXY-FOR-PROTECTED; gap is not evidence
Auditability	Post-hoc, if challenged	Built in — every score cites rubric evidence
What scales	Volume and bias, together	Volume under a tested standard

Verify by evidence

“The shortlist looks strong” is not verification. It inherits every bias in the screener’s output without catching any of it.

Adversarial bias refute-read. A reviewer who did not run the screening takes a random sample and actively argues that the scores are wrong — looking for criterion drift, refusal-code violations the automated check missed, and reasoning that sounds rubric-anchored but is not. The goal is to break the screen, not confirm it.

Adverse-impact ratio. Before the shortlist reaches interviewers, calculate the selection rate by demographic group where data is available and legally permitted. A rate significantly below 0.8 for any protected group relative to the highest-selected group is a flag. Not necessarily evidence of intentional discrimination — but evidence that something in the pipeline is producing disparate outcomes. Flag, audit, resolve before proceeding.

Downstream outcome tracking. After six months, compare interview-to-offer rates, ninety-day reviews, and one-year retention across the pool. Systematic differences between groups that passed the same rubric screen at similar scores signal that the rubric itself needs recalibration.

The evidence is the receipt. If the shortlist cannot produce it, the shortlist is not verified.

Observe and fold: the living rubric

A rubric that does not change is secretly wrong and not admitting it. Hiring outcomes — offer acceptance, time-to-productivity, retention — are the production signal. When a candidate who scored 3.5 on “systems thinking” struggles with systems-level work in their first quarter, the criterion was scored on the wrong evidence. After each significant cycle, the committee reconvenes with outcome data, versions the rubric, updates calibration examples, and re-freezes. The next cycle inherits the learning by name.

Constrain the what, free the how

The brief above clamps the output precisely: a scored, evidence-cited summary per candidate, auditable in three minutes, with four named refusal codes. What it does not constrain is how the AI reads the application, which signals it weighs within the permitted criteria, or how it structures its reasoning. That is the execution, and the model can be good at it.

Clamp the evaluation criteria so the AI cannot introduce unauthorized signals. Leave the reading and synthesis open so it can find genuine evidence faster than a human scanning five hundred résumés. Then verify the result, adversarially, before it reaches a single interviewer.

A recruiter who applies this discipline has not made screening slower. They have made it auditable at a scale that was never achievable with human-only review — because the audit is built in, not retrofitted.

What does not transfer

The rubric-as-spec analogy strains at two specific points.

Dignity is not a test case. A candidate is not a unit of software. The borderline scenario in a screening brief can be described in a checklist; the actual conversation with a borderline candidate — where context, story, and circumstances that never appear on a résumé matter — cannot. ADD makes the structured part of screening more rigorous and more fair. It does not replace the parts that require a person to be genuinely curious about another person.

Over-proceduralizing can hurt. A rubric that is too rigid selects for the ability to write to rubrics, not to do the job. The calibration session is not just about agreeing on scores — it is about stress-testing criteria against real candidates to find where the rubric is a worse predictor than an experienced interviewer’s judgment. When that happens, the rubric should change, not the interviewer.

The method’s value is in making the governed part of screening — the criteria, the consistency, the bias controls — explicit and auditable. The ungoverned part — the human conversation, the contextual judgment, the final decision — stays human not because it cannot be formalized but because it should not be.

Next in the series: ADD for Project Management — where scope creep, status theater, and vague sprint briefs meet the same loop, and the frozen contract becomes the milestone exit criterion.

Next in this series

ADD for Project Management: Exit Criteria Over Status Theater