LLMOps Runbook (Incident + Rollback)
An Azure OpenAI runbook for incident response, rollback, and postmortems with Azure Monitor signals.
0. Why this guide exists
AI incidents compound quickly. This runbook standardizes response for Azure OpenAI + Azure Monitor so teams can contain issues fast.
Incidents handled ad hoc, with slow rollback and unclear ownership.
Restore stability in under 30 minutes with clear escalation paths.
Fast containment and traceable decisions.
1. Azure Monitor mental model (Signal -> Triage -> Action)
Azure Monitor alerts on latency, cost, and refusal spikes.
Confirm impacted model, scope, and user impact.
Rollback, failover, or throttling based on severity.
2. Trigger conditions (governance first)
Outcome: Clear thresholds that trigger paging before customer impact expands.
- Latency exceeds 2x baseline for 5 minutes.
- Guardrail failure rate above 1 percent.
- Cost anomalies exceed daily budget threshold.
- Policy refusal spikes beyond defined threshold.
3. Triage steps (isolation and safety)
Outcome: Confirm blast radius and stabilize before changes go live.
- Confirm affected model deployment and user impact.
- Snapshot logs, traces, and recent prompt changes.
- Switch to safe fallback model if severity is high.
- Notify support and leadership channels.
4. Rollback procedure (learning before building)
Policy breach or guardrail failure above threshold.
Shift traffic to last known stable model deployment.
Run smoke tests and confirm error rate recovery.
5. Communication checklist (proof of access)
- Notify on-call and platform owner.
- Update leadership channel with impact summary.
- Post status updates every 30 minutes.
6. Guardrails and limits (preventing early failures)
Reduce incident frequency by enforcing guardrails and budgets:
Content filters and policy checks on every call.
Daily anomaly alerts for cost spikes.
Change approval for high-risk workflows.
7. Common failure modes (what breaks in real orgs)
Quality changes without prompt/version control.
No alerts before budgets are exceeded.
No documented fallback model or playbook.
8. What "ready" actually means
- Monitoring: Alerts wired to on-call channels.
- Rollback: Fallback deployment tested quarterly.
- Ownership: Incident commander and comms owner defined.
- Audit: Postmortems stored with remediation owners.
Business impact: Lower MTTR, fewer escalations, and higher reliability.
Author note
Runbooks work when they reduce ambiguity. I tie steps to signals and make rollback explicit.