A support agent drafts a reply. Polite, structured, confident — and wrong. The refund window it cites does not match the policy document. The troubleshooting steps apply to version 2.3; the customer is on 3.1, where that menu no longer exists. The reply reads helpful. It ships. The customer hits a dead end and reopens the ticket angrier than before.
This is the defining failure mode of AI in customer support: not rudeness, but confident wrong answers. The error is invisible to a quick reviewer because reading a confident, well-phrased reply and finding it plausible is not the same as checking it against the actual policy. The reviewer looks for tone, not factuality — because factuality requires checking the source, and nobody has time to check every draft against the KB.
AI-Driven Development was built around exactly this problem. It separates what the output must do from how it does it, and it verifies by evidence rather than by inspection. This post translates the method for customer support, where trust-by-inspection fails hardest.
The four AI-era failures, in a support queue
Fast waste is a deflection that closes the ticket without solving it — a plausible reply that takes five minutes to generate and twenty minutes to undo when the customer confirms the steps don’t exist.
Context rot is the policy Notion page written eighteen months ago, partially updated after one pricing change, not updated after the next plan restructure. The AI drafts from a stale scrape; nobody catches it because the team lead who knew the old policy moved to a different pod.
Trust-by-inspection is the fluency trap. A hallucinated step or an entitlement boundary the model silently expanded passes the quick-read review precisely because it is phrased with the same confidence as correct information.
The verification ceiling arrives when deflection volume scales before the verification system does. At 200 replies per day, a sample of ten catches the obvious errors. At 2,000, the sample stays the same size. The risk scales; the checking does not.
The loop, translated
The framework post (ADD Beyond Code) lays out eight steps. Here is what each one means when the producer is an AI drafting support replies.
Step 0 — Ground: load the source-of-truth
Every support AI session begins by loading the authoritative sources the answers will be drawn from: the knowledge base, the current policy document, the entitlement and SLA matrix for the customer’s plan tier, and any locked macros or templates. These are not context — they are the contract the answer must be consistent with.
Most support AI deployments inject a system prompt with a summary of policy — someone’s paraphrase, already context rot. Ground requires the actual documents, versioned and dated.
Step 1 — Specify: what the answer must do, and what it must refuse
A support answer spec is short. It names what the draft must do, what it must not do — each refusal paired with a code — and the after-state that constitutes a resolved ticket.
# ANSWER-SPEC — Tier 2 Subscription Billing Reply
Must: - Resolve the customer's stated issue within current policy boundaries. - Cite the specific KB article, policy section, or approved macro it relies on. - Use the customer's plan tier (from ticket metadata) to bound any entitlement claim. - Match the approved tone: clear, direct, warm — not deferential, not apologetic by default.
Must Not (refusal codes): - POLICY-INVENTED : state a policy, window, or exception not in the policy document - ENTITLEMENT-EXCEEDED : offer a feature, credit, or extension beyond the customer's plan - PII-LEAK : include account details, emails, or identifiers not already in the thread - PROMISE-UNAPPROVED: commit to a timeline, refund, or outcome not authorized by this tier
After-state: The customer has a clear, accurate next step OR the ticket is flagged for escalation with a documented reason. No open question is answered with an invented answer.The refusal codes make failure modes nameable — an agent reviewing a draft tags POLICY-INVENTED rather than writing a freeform note — and create shared vocabulary for the checklist and the audit log.
Step 2 — Scenarios: the model, the edge, and the failure case
Three scenario types cover the queue.
The model case is a routine how-to: the KB has a clear article. The draft cites it, matches the steps exactly, and does not volunteer features that belong to a higher tier.
The edge case is an entitlement boundary question — can the customer add a second workspace when their plan allows one? The correct answer acknowledges the limit and explains the upgrade path. ENTITLEMENT-EXCEEDED is the relevant refusal; this is where AI drafts most commonly err, because “just this once” sounds like good service.
The failure case is the most important: no KB answer exists. The product team changed a workflow and the article was not updated. The correct behavior is to escalate — never to invent. A draft that produces plausible steps from general reasoning fails POLICY-INVENTED, even if the steps happen to work.
Step 3 — Contract: the frozen gate
The frozen contract is the approved answer policy and escalation rules, signed off by support leadership before the AI workflow goes live — specifying which question types are autonomous, which require review, and which are human-owned end-to-end.
# SUPPORT-CONTRACT v1.2 — Tier 2 Billing & Subscriptions# Status: FROZEN — support leadership sign-off 2026-05-14
AUTONOMOUS (AI drafts, agent reviews before send): - How-to questions answered by an exact KB article reference - Account status lookups with KB citation and plan-tier confirmation
REVIEW-REQUIRED (AI drafts, team lead approval before send): - Entitlement boundary questions (any claim about plan limits) - Refund or credit requests within the approved policy window
HUMAN-OWNED (AI may summarize thread; human writes and sends): - Complaints involving PII concern - Goodwill exceptions or anything outside approved policy - Any ticket where no KB answer exists and escalation is warranted
Change process: contract changes require support leadership sign-off and aversion increment. AI workflow may not operate against a contract under revision.One human gate. One sign-off. The contract changes by deliberate decision, recorded and dated — never by drift.
Step 4 — Acceptance checks: the red tests
The acceptance checklist runs against every draft before send. It is checkable, not impressionistic.
# FACTUALITY & POLICY ACCEPTANCE CHECKLIST
Factuality (vs source-of-truth): [ ] Every factual claim in the draft cites a specific KB article, policy section, or approved macro — no unsourced assertions [ ] Steps listed in the draft match the current KB article for the customer's product version (version confirmed from ticket metadata) [ ] Entitlement claims (features, limits, windows) match the customer's plan tier from the entitlement matrix — not a general description of the product
Policy compliance: [ ] No refusal code triggered: POLICY-INVENTED, ENTITLEMENT-EXCEEDED, PII-LEAK, PROMISE-UNAPPROVED [ ] Any commitment (timeline, credit, outcome) falls within the tier-authorized range listed in the contract
Escalation: [ ] If no KB article covers the issue, the draft routes to HUMAN-OWNED and does not attempt an answer [ ] Escalation reason is documented in the ticket, not left implicit
Tone: [ ] Reply matches approved tone guidelines — clear, direct, warm [ ] No deferential or apologetic framing not warranted by the situation
Final gate: [ ] All boxes checked → eligible for send (per contract tier) [ ] Any unchecked box → return to agent for revision or escalationA draft that reads well but fails a factuality check does not pass. “Reads well” is not the bar.
Step 5 — Produce: constrain the what, free the how
The draft’s factual claims are clamped to the source-of-truth. How it phrases them — tone, structure, the empathy in the opening line — is the AI’s to invent.
Governing AI output means fixing what must be true — the facts, the policy, the entitlement — while leaving the expression open. Two drafts can cite the same KB article in different registers; both pass if the factual spine is correct.
| Ungoverned AI deflection | ADD-governed support draft | |
|---|---|---|
| Source of answer | Model's training data + summary prompt | Current KB article, policy doc, entitlement matrix |
| Refusal behavior | Invent a plausible answer rather than escalate | POLICY-INVENTED triggers escalation — no invented answer ships |
| Entitlement claims | Based on general product knowledge, often generous | Bounded to customer's plan tier from entitlement matrix |
| Correctness check | Reads helpfully → approved | Every factual claim cites a source → checked before send |
| Escalation | Implicit, inconsistent, often skipped | Explicit: no KB match → HUMAN-OWNED, documented reason |
| Evidence of quality | CSAT after the fact | Factuality audit + hallucination rate + reopen rate |
Step 6 — Verify by evidence
The support team’s refute-read is a factuality audit against source: pull the KB articles a sample of drafts cited, and check each claim sentence by sentence. Assume the draft is wrong and hunt for the fabrication — the invented step, the paraphrased policy that shifted meaning, the entitlement claim that rounded up. Not “the replies looked helpful.”
The metrics that constitute evidence:
- Hallucination rate — the fraction of audited drafts containing a claim not supported by the cited source. This is the primary signal. A team that does not measure it has no idea whether its AI workflow is safe.
- Reopen rate — the lagging signal that answers were wrong or incomplete.
- CSAT — useful, but not a substitute for factuality audit. A customer can rate an answer well despite a policy error. CSAT measures satisfaction, not accuracy.
- Escalation rate — unusually low escalation against a complex queue signals the system is inventing rather than routing.
Measure the hallucination rate. Without it, you are managing a queue, not a reliability system.
Step 7 — Observe and fold: KB gaps become the next brief
Every escalated ticket with no KB answer is a documented gap. Every recurring refusal-code trigger is a pattern. These fold back: KB updates close the gaps; spec deltas sharpen the rules when the audit surfaces a failure the current spec does not handle. The playbook stays living — unlike the policy document signed once and consulted under pressure, which is the document that generates the confident wrong answers you are trying to stop.
What doesn’t transfer
The loop governs factual correctness and policy compliance. It does not govern judgment.
Empathy in hard escalations. A distressed customer needs a human who responds to the person, not the ticket. The AI summarizes and routes; it does not write the reply.
Goodwill exceptions are decisions, not policy applications. The risk of AI here is not that it refuses — it is that it says yes, confidently, to something requiring authorization it does not have. The contract puts these in HUMAN-OWNED for this reason.
KB boundary cases require escalation, not synthesis. A draft reasoning across two partial matches produces an answer from neither source — the failure mode that most resembles expertise and is most likely to be wrong.
The loop governs the tractable space. Know where it ends.
Next in the series: ADD for HR and People Operations: When the Candidate Is the Stakeholder — how the same method applies to job descriptions, screening, and offer communications, where inconsistency creates legal exposure and bias compounds quietly.