LLMOps Runbook (Incident + Rollback)

Doc typePrimary usersSuccess metricArtifacts

Doc type: Runbook + incident response

Primary users: Platform ops, SRE, ML engineers

Success metric: Safe rollback in under 30 minutes

Artifacts: Incident checklist, alerts map

0. Why this guide exists

AI incidents compound quickly. This runbook standardizes response for Azure OpenAI + Azure Monitor so teams can contain issues fast.

Problem

Incidents handled ad hoc, with slow rollback and unclear ownership.

Outcome

Restore stability in under 30 minutes with clear escalation paths.

Goal

Fast containment and traceable decisions.

1. Azure Monitor mental model (Signal -> Triage -> Action)

Signal

Azure Monitor alerts on latency, cost, and refusal spikes.

Triage

Confirm impacted model, scope, and user impact.

Action

Rollback, failover, or throttling based on severity.

Incident response flows from signal to postmortem with clear ownership.

Azure Monitor alert overview for latency, refusal, and cost. — Azure Monitor alert overview.

2. Trigger conditions (governance first)

Outcome: Clear thresholds that trigger paging before customer impact expands.

Latency exceeds 2x baseline for 5 minutes.
Guardrail failure rate above 1 percent.
Cost anomalies exceed daily budget threshold.
Policy refusal spikes beyond defined threshold.

Trigger thresholds for paging and review. — Trigger thresholds for incident response.

3. Triage steps (isolation and safety)

Outcome: Confirm blast radius and stabilize before changes go live.

Confirm affected model deployment and user impact.
Snapshot logs, traces, and recent prompt changes.
Switch to safe fallback model if severity is high.
Notify support and leadership channels.

4. Rollback procedure (learning before building)

Trigger

Policy breach or guardrail failure above threshold.

Action

Shift traffic to last known stable model deployment.

Verify

Run smoke tests and confirm error rate recovery.

Deployment swap and health check after rollback. — Deployment swap and health check.

5. Communication checklist (proof of access)

Notify on-call and platform owner.
Update leadership channel with impact summary.
Post status updates every 30 minutes.

6. Guardrails and limits (preventing early failures)

Reduce incident frequency by enforcing guardrails and budgets:

Guardrails

Content filters and policy checks on every call.

Budget alerts

Daily anomaly alerts for cost spikes.

Prompt review

Change approval for high-risk workflows.

Guardrails and budget alerts configured. — Guardrails and budget alerts.

7. Common failure modes (what breaks in real orgs)

Silent drift

Quality changes without prompt/version control.

Cost spikes

No alerts before budgets are exceeded.

Slow rollback

No documented fallback model or playbook.

8. What "ready" actually means

Monitoring: Alerts wired to on-call channels.
Rollback: Fallback deployment tested quarterly.
Ownership: Incident commander and comms owner defined.
Audit: Postmortems stored with remediation owners.

Business impact: Lower MTTR, fewer escalations, and higher reliability.

Author note

Runbooks work when they reduce ambiguity. I tie steps to signals and make rollback explicit.