Ai Ml

ADD for DevOps and SRE: Policy as Contract, Evidence as the Gate

AI can emit Terraform and runbooks that read correct and provision the wrong, insecure, or irreversible thing. ADD makes policy-as-code the frozen contract enforced in the pipeline, chaos and rollback tests the red tests, and SLO evidence — not a clean apply — the proof a change is safe.

Tin Dang June 16, 2026 10 min read

Series hero on warm paper: 'ADD for DevOps and SRE — policy as contract, evidence as the gate'

A Terraform plan applies cleanly. The diff reads tidy — a security group rule, an IAM policy attachment, a new subnet. The engineer skims it, approves. Three days later the on-call rotation discovers that the security group allows unrestricted outbound on port 443, the IAM policy grants s3:* to a role that previously had object-level scope, and there is no rollback path because the subnet already has running instances attached.

The plan applied cleanly. The change was wrong.

This is the defining failure mode when AI generates infrastructure: plausible, apply-able output that is insecure, over-scoped, or irreversible. Unlike application code — where a bad function throws an exception and surfaces itself — a bad IaC change may sit silent for days until an incident or audit exposes it, by which point reversing it is non-trivial.

AI-Driven Development was built around this exact trap. It separates what the output must achieve from how it achieves it, enforces that separation in automated gates, and verifies by evidence rather than by inspection. This post translates the method for DevOps and SRE — a domain that maps almost literally onto ADD, because the discipline already thinks in contracts, gates, and measured outcomes.

The four AI-era failures, for infra and reliability

Fast waste is a runbook that reads correct and fails mid-incident, or a pipeline definition that deploys to the wrong environment because the AI inferred a variable name instead of reading the environment matrix. It merges because the YAML is valid and CI turns green.

Context rot is the security baseline from fourteen months ago — approved AMI IDs, IAM permission boundaries, egress rules — none updated since the last two architectural changes. The AI generates IaC from a stale ground.

Trust-by-inspection is the Terraform plan review. A * in an IAM statement is easy to miss on line 47 of a 200-line plan. A missing prevent_destroy looks like ordinary omission, not the absence of a rollback guard.

The verification ceiling arrives when AI-assisted platforms generate dozens of IaC modules per sprint. If the only gate is a human plan review, it gets thinner under deadline pressure — not thicker.

The loop, translated

The framework post (ADD Beyond Code) lays out eight steps. Here is what each one means when the producer is an AI authoring Terraform, pipeline definitions, runbooks, and dashboards.

Step 0 — Ground: load the real state of the platform

Before specifying any change, the agent loads the live infra inventory (what is actually deployed, not what was planned), the SLO and error-budget dashboards, the security baseline, the org’s change-management policy, and any locked architectural decisions. It reads the current state file — not the module README.

Most AI-assisted IaC fails here: the agent reasons from a template and the user’s description, never reading what is actually running. Ground requires the real state, not the intended state.

Step 1 — Specify: the change contract, with refusal codes

A change spec in DevOps is short and precise. It names what the change must do, what it must not do — each refusal paired with a named code — and the after-state that defines a complete, safe deployment.

# CHANGE-SPEC — us-east-1 RDS failover routing update

Must:
  - Route read traffic to the standby replica within 30 s of a primary failure.
  - Leave write-path SLO unaffected (P99 < 50 ms under steady-state load).
  - Be fully reversible: the original routing must be restorable in a single apply.
  - All IAM changes scope to the specific resource ARN — no wildcard resource.

Must Not (refusal codes):
  - POLICY-VIOLATION    : any resource config that violates the approved security baseline
  - PUBLIC-EXPOSURE     : a security group rule or bucket ACL reachable from 0.0.0.0/0
  - NO-ROLLBACK         : any resource with prevent_destroy removed, or no prior state snapshot
  - UNBOUNDED-BLAST-RADIUS : an IAM statement with Action or Resource wildcards

After-state:
  A routing change that passes policy-as-code checks, preserves the write-path SLO,
  and has a tested, documented rollback procedure confirmed in staging.

Lowest-confidence assumption:
  ⚠ The standby replica's replication lag at failover time — if lag exceeds the
    RPO target, routing fast does not help. Measure replication lag in the
    failover test before approving the spec.

The human reads the lowest-confidence flag first. If the replication lag is not measured, the change is not ready to approve — no matter how clean the Terraform plan will look.

Step 2 — Scenarios: the model case, the edge, and the failure

Concrete scenarios written in infra language before any code is generated:

Standard deploy: a rolling update to a stateless service; all replicas healthy; SLO hold; rollback unneeded.
Edge — region failover: primary AZ becomes unavailable; standby takes writes within the SLO window; no data loss beyond RPO.
Failure — no rollback path: a change attempts to replace a stateful resource (database replacement instead of modification); the NO-ROLLBACK code fires; the pipeline stops; no apply occurs.

Scenarios make the failure mode concrete before any Terraform is written. The failure scenario is especially important: it specifies what must not happen, not just what must succeed.

Step 3 — Contract: policy-as-code is the frozen gate

This is where DevOps maps most directly onto ADD. The frozen contract is the approved policy set — expressed as code, enforced in the pipeline, and owned by the platform team. It is not a checklist a human runs through before apply; it is a set of machine-checkable rules that either pass or block.

The gate is enforced in the pipeline, not in the agent. The AI that generates the Terraform does not decide whether its output passes policy. A separate, automated gate does — every time, without exception, regardless of which agent or engineer authored the change.

This is the portability principle from the software-native version of this method restated for infrastructure: the correctness check is decoupled from the producer. Swap the AI model, the IaC tool, or the engineer — the policy gate runs regardless.

The contract freeze means the approved policy set is versioned in source control, and changes to it are themselves a change request — they re-enter the loop at Specify. A policy change is not an emergency patch to make a failing gate pass.

# Illustrative — policy/no_public_exposure.rego (OPA/Conftest)
package terraform.deny

deny[msg] {
  r := input.resource_changes[_]
  r.type == "aws_security_group_rule"
  r.change.after.cidr_blocks[_] == "0.0.0.0/0"
  r.change.after.type == "ingress"
  msg := sprintf("PUBLIC-EXPOSURE: ingress rule open to 0.0.0.0/0 in %s", [r.address])
}

deny[msg] {
  r := input.resource_changes[_]
  r.type == "aws_iam_policy_document"
  stmt := r.change.after.statement[_]
  stmt.actions[_] == "*"
  msg := sprintf("POLICY-VIOLATION: wildcard Action in IAM statement in %s", [r.address])
}

Step 4 — Acceptance checks: the pre-apply red tests

Before any terraform apply runs, the following checks must pass — and critically, they must be written and confirmed to fail on a non-compliant change before the change is authored. A gate that never fires is not a gate.

# pre-apply acceptance checklist (illustrative)
pre_apply_checks:
  - name: policy-as-code
    tool: conftest
    command: conftest test plan.json --policy policy/
    must_pass: true
    blocks_on: POLICY-VIOLATION, PUBLIC-EXPOSURE, NO-ROLLBACK, UNBOUNDED-BLAST-RADIUS

  - name: security-scan
    tool: checkov
    command: checkov -d . --compact --quiet
    must_pass: true
    severity_threshold: HIGH

  - name: rollback-verification
    description: >
      Apply the change in staging, confirm the SLO holds, then execute
      the rollback procedure and confirm the prior state is restored cleanly.
    must_pass: true
    blocks_on: any state that cannot be reverted in a single apply

  - name: chaos-or-failover-test
    description: >
      For changes that affect the failure path (routing, failover, DR),
      run the failover scenario from Step 2. Measure actual recovery time
      and replication lag — not estimated.
    must_pass: true
    required_for: edge-case scenarios only

A change that fails any of these does not proceed. A HARD-STOP is automatic — not a meeting, not a waiver negotiation.

Step 5 — Produce: the AI authors IaC that makes every check pass

Now, and only now, the agent writes Terraform, pipeline YAML, or runbook content — under one rule: do not weaken the checks to fit the output.

The how is unconstrained. The agent chooses module structure, variable naming, resource ordering. It may extend an existing module rather than authoring from scratch. The contract fixes the behavior; the agent finds the path. This is where AI delivers genuine leverage in DevOps — not deciding what a change should accomplish, but authoring correct, well-structured IaC against a precise specification.

Step 6 — Verify by evidence: SLO hold, not “applied cleanly”

A clean apply is not evidence of a safe change. It is evidence that Terraform did not encounter a state conflict. The proof that a change is safe comes from observability.

After deploy — staged, behind a feature flag, or as a canary — the evidence required is:

SLO burn rate: is the error budget consuming faster than the pre-change baseline?
Latency distribution: is P99 at or within the spec?
Rollback test: was the rollback procedure executed in staging and confirmed clean?
Security posture: did the automated scanner find no new HIGH/CRITICAL findings post-apply?

The adversarial move — the infra equivalent of ADD’s refute-read — is a “try to break it” review in staging: run the failure scenario from Step 2 and try to induce the failure mode the change was designed to prevent. If the failover spec says “standby takes writes within 30 seconds,” kill the primary and measure. Not estimate. Measure.

	Ungated AI change	ADD-gated change
Ground	Module README + user description	Live infra state + security baseline + SLO dashboard
Spec	Prompt in the chat window	Written change spec with refusal codes and after-state
Gate	Human plan review (reads the diff)	Policy-as-code in pipeline (machine-checkable, every time)
Rollback	Assumed recoverable	Rollback procedure tested in staging before apply
Evidence of safety	Plan applied cleanly	SLO holds post-deploy; rollback confirmed; chaos test passed
Failure mode	Silent misconfiguration or NO-ROLLBACK discovered in an incident	Blocked at policy gate before apply; failure surfaced in staging

Step 7 — Observe and fold: incidents become policy updates

The loop does not end at a clean deploy. A post-deploy SLO burn spike becomes a spec delta: a new acceptance check targeting the metric that degraded. A blameless postmortem that uncovers a rollback gap updates the change spec template and pre-apply checklist. A recurring PUBLIC-EXPOSURE finding becomes a stricter policy rule, versioned in source control.

The playbook is updated by evidence, not intention. The next change inherits the tighter gate; no one has to remember to apply the lesson.

Constrain the what, free the how

In DevOps, the temptation is to constrain the how — prescribe the module structure, variable naming, resource ordering — because those are the things visible in a plan review. ADD inverts the instinct.

Clamp the what: the policy the change must satisfy, the SLO it must preserve, the rollback it must enable. Leave the how — module decomposition, dependency ordering, helper script design — to the agent. The pipeline gate confirms correctness; no human needs to read every line of a Terraform plan to trust the result.

What doesn’t transfer

The analogy is unusually tight in DevOps — but three things resist it.

Incident command under pressure is not a procedure. An on-call engineer diagnosing a novel cascading failure at 2 a.m. is reasoning under uncertainty with incomplete signal, making reversible bets, adjusting as data arrives. A runbook covers known failure modes. The incident commander owns judgment for the unknown ones. Automation cannot hold the pager’s accountability.

Novel failure diagnosis requires systems intuition. Evidence gates confirm expected properties; they do not replace the instinct that asks why a metric looks wrong when everything else looks fine. A latent bug — a subtle race condition, a metric-reporting gap masking a real error rate — surfaces through operational pattern recognition, not a checklist.

Over-gating kills incident response. Applying the full ADD loop to every change — including the three-line emergency DNS fix at 3 a.m. — would make incidents worse. The method is calibrated to planned changes. Emergency procedures are a different protocol with a different risk tolerance, and should be documented as such.

Next in the series

DevOps and SRE sit at the intersection of automation and accountability — and ADD maps cleanly because the discipline was already thinking in contracts and evidence. The next domain stretches the method further, into work where the producer is itself an AI system rather than an AI writing code about one.

Next: ADD for Machine Learning — where the artifact being verified is a model, the “test” is an evaluation harness, and the frozen contract is a behavior specification that holds across distribution shift.

Next in this series

ADD for Machine Learning: The Eval Set Is the Frozen Contract