Why this is hard to get right
The Real Cost of a Runbook Written Under Pressure
Marcus is an SRE manager at a mid-sized fintech company. His team runs 24/7 on-call rotations across three services. Six months ago, a payment processing outage stretched to four hours — not because the fix was hard, but because nobody agreed on who owned communications, three engineers duplicated investigation steps, and the incident commander spent 20 minutes drafting a customer-facing status update from scratch.
After the post-mortem, leadership mandated runbooks for every critical service. Marcus knew what a good runbook looked like in theory: severity levels, RACI tables, step-by-step mitigation, rollback procedures, comms templates. But translating that knowledge into a clean, usable document — for an audience ranging from junior on-call engineers to a VP of Engineering receiving executive briefings — was harder than it looked.
His first attempt asked an AI assistant something like: "Write an incident response runbook for API outages." The output was generic. It gave him five high-level phases with bullet points lifted from an ITIL textbook. No tool references, no Slack message templates, no severity thresholds tied to his actual SLAs. His on-call engineers took one look and said it wasn't usable.
Marcus tried again with more detail, but he kept forgetting things. One draft had no escalation paths. Another had no rollback steps. A third had comms templates but no decision tree for when to escalate to a third-party vendor. Each revision cost him another hour — time he didn't have between incidents.
The core problem wasn't the AI. It was the prompt. Marcus was asking for a document without telling the AI what "done" looked like for his team specifically — which tools they used, what SLAs were non-negotiable, what failure modes they'd already seen, and who needed to receive what information at each stage.
When Marcus built a structured prompt that named the incident type, spelled out the full toolchain (AWS, Kubernetes, Datadog, PagerDuty, Slack), defined SLA thresholds, and requested specific outputs like a RACI table, copy-ready Slack messages, and a decision tree — the AI produced a near-final draft in a single pass.
His team reviewed it in one 30-minute session. They added two edge cases and adjusted one escalation threshold. Then it was done. The next P1 incident resolved in 38 minutes. The comms lead sent the first customer update within five minutes of detection — using the template, word for word.
That's what a well-structured prompt actually does: it compresses weeks of documentation work into a focused, expert-reviewed output that holds up when everything else is on fire.
Common mistakes to avoid
Skipping the Incident Type Entirely
Asking for a generic 'incident response plan' forces the AI to invent a scenario. Runbooks are scenario-specific by design — a database failover and a DDoS attack require completely different investigation steps, tools, and escalation chains. Name the exact incident type upfront so every step maps to a real situation your team will face.
Omitting Tools and Monitoring Stack
A runbook that says 'check your monitoring dashboard' is useless at 2 AM. Specify your actual tools — Datadog, PagerDuty, Grafana, AWS CloudWatch — so the AI can reference real dashboards, alert names, and CLI commands. Generic steps require mental translation under pressure, which introduces errors and delays.
Forgetting to Define SLAs and Severity Thresholds
Without SLA anchors, the AI produces vague guidance like 'respond quickly.' Define detect, mitigate, and resolve targets in minutes — these numbers drive urgency, trigger escalation, and create accountability. A runbook without thresholds gives engineers no signal for when to escalate versus continue investigating.
Leaving Out Communication Templates
During an active incident, drafting Slack updates or status page messages from scratch wastes critical minutes. Request copy-ready templates for internal channels, executive briefings, and customer-facing communications. Engineers shouldn't be writing prose under pressure — they should be filling in blanks.
Ignoring Edge Cases and Failure Modes
Most runbooks document the happy path: the incident is exactly what you expect and the fix works first time. Include 2-3 realistic edge cases — partial region outages, third-party dependency failures, noisy neighbors — so engineers aren't stranded when the main playbook doesn't fit the actual situation.
Requesting a Document Without a Role Structure
A runbook read by everyone is owned by no one. Define the audience explicitly: on-call engineer, incident commander, comms lead, executive sponsor. Each role needs a different information layer. Without role-specific steps and a RACI table, engineers overlap, skip tasks, or wait for each other to act.
The transformation
Make an incident response plan for outages with steps and best practices.
You are an experienced SRE lead. Create a step-by-step incident response runbook for “Production API latency spike.” 1) Audience: on-call engineers, incident commander, comms lead 2) Environment: AWS, Kubernetes, Datadog, PagerDuty, Slack 3) SLAs: detect <2 min, mitigate <15 min, resolve <60 min 4) Structure: overview, severity matrix, roles, trigger conditions, investigation checklist, mitigation playbook, rollback steps, comms templates (internal/external), decision tree, escalation paths, post-incident actions 5) Constraints: plain language, numbered steps, time stamps, checkboxes, copy-ready Slack/Status Page text, links placeholders 6) Edge cases: partial region outage, noisy neighbor pod, third-party dependency degradation 7) Output: 2 pages max, include RACI table and handoff protocol
Why this works
Named Scenario Anchors All Steps
The After Prompt specifies "Production API latency spike" rather than a generic outage. This single detail forces the AI to generate investigation steps, tool queries, and mitigation actions that are relevant to that failure mode — not to a database crash or a network partition. Specificity eliminates generic filler.
Toolchain Context Produces Usable Commands
By listing AWS, Kubernetes, Datadog, PagerDuty, and Slack, the prompt lets the AI reference real dashboards, alert configurations, and notification channels. Engineers following the runbook don't need to mentally translate 'check logs' into 'open Datadog APM and filter by service:api, status:error.'
SLA Targets Create Built-In Urgency
The prompt defines detect under 2 minutes, mitigate under 15, resolve under 60. These numbers give the AI concrete thresholds to embed into every phase — triggering escalation steps, defining handoff moments, and making the runbook self-enforcing rather than advisory.
Structured Output Requirements Prevent Gaps
The numbered structure section demands overview, severity matrix, RACI table, decision tree, rollback steps, and comms templates in a single request. This forces completeness — the AI can't deliver a partial runbook and call it done. Every critical section is explicitly required.
Edge Cases Eliminate Dead Ends
Including partial region outage, noisy neighbor pod, and third-party dependency degradation as explicit edge cases means the AI writes branching logic for situations where the primary playbook breaks down. Engineers get decision trees, not dead ends, when reality deviates from the expected failure mode.
The framework behind the prompt
The Theory Behind Incident Response Documentation
Incident response runbooks sit at the intersection of two well-established disciplines: Site Reliability Engineering (SRE) and cognitive performance under stress.
The SRE model, formalized by Google in the Site Reliability Engineering handbook, treats operations as a software engineering problem. Runbooks are a foundational artifact of that model — they encode operational knowledge into repeatable, testable procedures that reduce reliance on individual expertise. The goal is what SREs call toil reduction: converting reactive, manual, high-stress work into systematic, documented processes.
From a cognitive science perspective, runbooks address a well-documented limitation of human performance under stress: working memory degrades under high arousal. A 2010 study by Sian Beilock demonstrated that performance pressure impairs the prefrontal cortex's ability to sequence complex tasks — exactly the situation engineers face during a P1 incident at 3 AM. Well-structured runbooks externalize that sequencing, reducing cognitive load at precisely the moment it matters most.
The ITIL (Information Technology Infrastructure Library) framework provides another lens. ITIL's incident management process defines five phases: identification, logging, categorization, prioritization, and resolution. Effective runbooks operationalize all five, with explicit entry criteria, severity classification matrices, and resolution definitions. Without this structure, teams skip phases under pressure — most commonly logging and categorization — which creates gaps in post-incident analysis.
RACI matrices (Responsible, Accountable, Consulted, Informed) come from project management but solve a critical incident management problem: role ambiguity. When multiple engineers receive the same alert, diffusion of responsibility is the default — everyone assumes someone else is leading. A RACI table embedded in the runbook eliminates that ambiguity in the first 60 seconds.
Finally, Chaos Engineering principles (pioneered by Netflix) suggest that runbooks should be validated through game days and tabletop exercises, not just written and filed. A runbook that hasn't been walked through under simulated conditions will fail in unexpected ways during a real incident — usually at the decision tree branches and edge cases.
Prompt variations
You are a senior database reliability engineer. Create a step-by-step incident response runbook for a PostgreSQL primary database failure on Google Cloud Platform.
Audience: On-call DBA, application team lead, incident commander
Environment: GCP Cloud SQL, PostgreSQL 15, Cloud Monitoring, PagerDuty, Google Chat
SLAs: Detect under 3 minutes, promote replica under 10 minutes, restore full read-write traffic under 30 minutes
Required sections:
- Severity matrix with connection loss thresholds
- Replica promotion checklist (step-by-step with GCP Console and gcloud CLI commands)
- Application reconnection verification steps
- Rollback procedure if promotion fails
- Internal and customer-facing communication templates
- Post-incident data integrity checks
Edge cases: Replication lag exceeding 60 seconds, promotion blocked by active long-running transactions, application connection pooler not releasing stale connections
Format: Plain language, numbered steps, checkboxes, max 2 pages, include RACI table
You are a customer success operations lead. Create an incident response runbook for a full SaaS application outage affecting paying customers.
Audience: Customer success managers, support team leads, VP of Customer Experience — not engineers
Tools: Salesforce, Zendesk, Statuspage.io, Slack, email
SLAs: First customer communication within 5 minutes of P1 confirmation, status page update every 15 minutes, executive briefing within 30 minutes
Required sections:
- Role assignments (CSM lead, support triage, executive liaison)
- Customer communication templates for email, Zendesk macro, and Statuspage by severity level
- High-value account escalation checklist (accounts over $50K ARR)
- Internal Slack update cadence and channel list
- Post-incident customer follow-up email template
- Decision tree: when to offer SLA credits proactively
Constraints: No technical jargon, empathetic tone, all templates copy-ready, numbered steps
Edge cases: Partial outage affecting only one region, outage during peak business hours, repeat outage within 30 days
You are a senior information security engineer. Create a step-by-step incident response runbook for a suspected unauthorized access event involving customer PII.
Audience: Security analyst (first responder), CISO, legal counsel, DPO
Environment: AWS, Okta, Splunk SIEM, Jira, Slack (security channel)
Regulatory constraints: GDPR 72-hour breach notification, SOC 2 Type II evidence preservation
Required sections:
- Containment steps (account isolation, session revocation, access key rotation)
- Evidence preservation checklist (Splunk queries, CloudTrail export, chain of custody log)
- Severity classification matrix (suspected vs. confirmed breach, number of records affected)
- Legal and regulatory notification decision tree
- Internal communication templates (CISO brief, board update)
- External notification templates (regulatory body, affected users)
- Post-incident forensic action items
Edge cases: Insider threat scenario, compromised third-party OAuth token, breach discovered by external researcher
Format: Numbered steps, checkboxes, timestamps, evidence log template included, max 3 pages
You are an experienced site reliability engineer. Create a simple, easy-to-follow incident response runbook for a small engineering team (3-5 people) managing a Node.js web application hosted on Heroku.
Audience: Junior and mid-level engineers with no dedicated on-call role
Tools: Heroku, Papertrail logs, UptimeRobot, Slack, GitHub
SLAs: Acknowledge within 10 minutes, restore service within 45 minutes
Required sections:
- How to detect the incident (UptimeRobot alert, Slack notification)
- First 5 steps any engineer should take immediately
- Common causes with matching fixes (dyno crash, memory limit, bad deploy)
- How to roll back a bad deploy on Heroku (exact commands)
- Who to notify and what to say (Slack template)
- What to document after the incident resolves
Constraints: Plain language, assume no prior on-call experience, keep it under 1 page, use numbered steps and checkboxes
Format: Beginner-friendly, no acronyms without definitions, practical over comprehensive
When to use this prompt
SRE Managers
Standardize runbooks across services so on-call engineers follow consistent steps with clear SLAs and escalation paths.
Product Managers
Document incident communication flows and customer impact updates for major feature launches with minimal downtime risk.
Customer Success Leaders
Prepare ready-to-send outage updates and FAQs to reduce support volume during high-severity incidents.
Engineers On Call
Use a clear investigation and mitigation checklist that matches your monitoring and deployment tools.
IT Operations Directors
Create audit-ready, repeatable processes for compliance reviews and post-incident reporting.
Pro tips
- 1
Define measurable SLAs to anchor urgency and guide escalation timing.
- 2
List the exact tools and integrations so the runbook references real commands and dashboards.
- 3
Specify communication templates and channels to cut delays and keep messaging consistent.
- 4
Include 2-3 probable edge cases to avoid stalls when the main path doesn’t fit.
Once you've validated one runbook with your team, you can systematically expand your library without starting from scratch each time.
Use a master prompt template. Keep the structure section constant across all runbooks — severity matrix, RACI, investigation checklist, comms templates, rollback steps. Only swap the incident type, tools, and SLAs. This produces consistent documentation your engineers can navigate quickly because the format is always familiar.
Layer in learned edge cases. After every post-incident review, extract the unexpected failure mode and add it to the corresponding runbook prompt as a new edge case. Over six to twelve months, your prompts accumulate institutional memory that makes each regenerated runbook more accurate than the last.
Version-control your prompts, not just your runbooks. Store your runbook prompts in your team's wiki or Git repository alongside the output documents. When your toolchain changes — you migrate from PagerDuty to OpsGenie, or from AWS to GCP — update the prompt and regenerate. You get a current runbook in minutes, not days.
Generate test scenarios alongside the runbook. Add a final section to your prompt: 'Include a 10-question tabletop exercise based on the edge cases above.' Use these in quarterly game days to validate that your team can actually follow the steps under simulated pressure. Runbooks that have been walked through once perform dramatically better in real incidents.
The core structure of an incident runbook travels across industries, but the constraints, communication layers, and compliance requirements change significantly by sector.
Financial services: Regulatory reporting windows are non-negotiable. Runbooks must embed FINRA, PCI-DSS, or FCA notification timelines directly into the escalation path. Include a legal review checkpoint before any external communication is sent. Evidence preservation steps must follow chain-of-custody standards.
Healthcare: HIPAA breach notification rules (60-day window for affected individuals, immediate HHS reporting for breaches over 500 records) must appear as explicit steps in security incident runbooks. Patient safety implications should trigger immediate escalation to clinical leadership, separate from the technical resolution path.
E-commerce: Revenue impact per minute of downtime should anchor severity levels. A checkout outage at $10,000 per minute of lost revenue warrants a different escalation path than a product catalog slowdown. Build revenue thresholds into your severity matrix and tie them to on-call escalation triggers.
SaaS B2B: Multi-tenant impact assessment is a required first step. Before mitigating, determine which customer segments are affected and whether any enterprise accounts with SLA guarantees are impacted. This drives both the technical priority and the customer communication sequence.
The same prompt discipline that produces a strong runbook can also generate the post-incident review documentation that turns incidents into improvements.
Generate a post-mortem template alongside the runbook. Add a final section to your prompt: 'Include a post-incident review template with sections for timeline reconstruction, contributing factors, contributing systems, customer impact summary, and action items with owners and due dates.' This ensures your after-action review follows the same structure as the incident itself.
Automate the timeline from your incident log. After an incident, paste your Slack thread, PagerDuty timeline, or incident log into an AI assistant with this prompt: 'Reconstruct a chronological incident timeline from the following log. Group events into: detection, investigation, mitigation, resolution, and communication. Note any SLA breaches and flag contributing factors.' This transforms a raw log into a structured timeline in minutes.
Extract runbook improvements automatically. End every post-mortem session with this prompt: 'Based on this post-incident review, identify three specific improvements to the runbook: one investigation step that was missing, one edge case that should be added, and one communication template that needs revision.' This creates a direct feedback loop between incidents and documentation quality, so your runbooks compound in accuracy over time.
When not to use this prompt
When This Prompt Pattern Doesn't Fit
Don't use a single AI-generated runbook for incidents that span organizational boundaries — mergers, cross-company SLA disputes, or multi-vendor failures require legal review and negotiated escalation paths that AI cannot determine for you. Use this prompt to generate the technical sections, then engage legal and vendor management separately.
Avoid this approach for novel, never-before-seen failure modes in highly regulated environments. If your organization has no prior experience with a specific incident type — a novel ransomware variant, a zero-day in production infrastructure — the AI will generate plausible-sounding steps based on general knowledge, not your specific environment. Treat that output as a first draft requiring expert security review, not a deployable runbook.
Don't rely on AI-generated runbooks as your only preparation for high-consequence scenarios (data breaches involving patient records, financial system failures with regulatory implications). These require human review, tabletop exercises, and sign-off from legal and compliance stakeholders.
If your team has fewer than three engineers, a formal runbook may add overhead without value. A shared Notion checklist reviewed monthly may serve you better than a structured AI-generated document with RACI tables your team is too small to fill.
Troubleshooting
The runbook is too generic and doesn't reference our actual tools
Your prompt is missing toolchain specificity. List every tool by exact product name in the environment section: monitoring platform, alerting system, deployment tool, communication channel, log aggregation platform. Include the names of specific dashboards, alert policies, or Slack channels you use. The AI produces output at the specificity level you provide — vague tools produce vague steps.
The AI skipped the RACI table and communication templates I requested
When prompts have long structure lists, AI models sometimes omit lower-priority-looking sections. Number every required section explicitly and add a constraint like: 'Every section listed below is required. Do not skip or combine any item.' If a section is still missing, follow up with: 'You omitted the RACI table and communication templates. Generate those two sections now, formatted consistently with the runbook above.'
The investigation steps are too high-level — they say 'check logs' instead of giving actual steps
Add a specificity constraint to your prompt: 'For each investigation step, include the exact tool, the specific query or command, and the expected output that confirms or rules out that cause.' For example: 'Check Datadog APM for p99 latency above 2000ms on the payment-service in the last 15 minutes.' This forces the AI to write executable steps, not conceptual ones.
The runbook is too long — engineers won't read it during an active incident
Set an explicit page limit in the prompt (e.g., 'maximum 2 pages') and add a format constraint: 'Use numbered steps and checkboxes throughout. No prose paragraphs in the investigation or mitigation sections — bullet points and one-line steps only.' If the output is still too long, ask the AI to 'condense the investigation section to a maximum of 10 checklist items, prioritized by most common cause first.'
The rollback procedure doesn't match our actual deployment process
The prompt didn't specify your deployment toolchain. Add your CI/CD platform, deployment method, and rollback mechanism to the environment section: 'Deployments use GitHub Actions, Helm charts on Kubernetes, with rollback via helm rollback [release] [revision].' Include your typical release cadence and whether rollbacks require approval. The AI will then write rollback steps that match your actual workflow.
How to measure success
How to Evaluate Your Runbook Output
A well-generated incident response runbook should pass a practical stress test before it goes live. Check for these signals:
Completeness indicators:
- Every section specified in your prompt is present (no skipped sections)
- RACI table names real roles, not generic titles like "Team Member"
- Communication templates are copy-ready — no blanks labeled "[insert message here]"
- Decision tree covers at least the three main branching points: issue confirmed, issue unconfirmed, issue partially resolved
Specificity indicators:
- Investigation steps reference your actual tools by name
- SLA thresholds appear in the severity matrix and trigger escalation steps
- Rollback procedure matches your deployment method
- Edge cases result in different steps, not the same steps repeated
Usability indicators:
- An engineer unfamiliar with the service can follow step 1 without asking a question
- The document fits on two pages or fewer
- Steps use numbered lists and checkboxes, not prose paragraphs
- The first five steps take under three minutes to execute
Red flags in AI output:
- Vague verbs like "investigate," "review," or "assess" without a defined tool and output criterion
- Missing rollback steps or steps that say "contact your deployment team"
- Communication templates addressed to no specific audience
Now try it on something of your own
Reading about the framework is one thing. Watching it sharpen your own prompt is another — takes 90 seconds, no signup.
Build a complete, team-ready incident runbook — with roles, SLAs, and communication templates — in one guided session.
Try one of these
Frequently asked questions
As specific as possible. 'API latency spike' produces a better runbook than 'outage,' and 'payment service timeout during checkout' produces a better runbook than 'API latency spike.' The more precisely you name the failure mode, the more the AI can tailor investigation steps, tool queries, and mitigation actions to your actual environment rather than writing generic guidance.
Not effectively. Runbooks are most useful when they're scenario-specific and tool-specific. If you try to cover five services in one prompt, the AI generalizes and omits the exact steps engineers need. Run the prompt once per critical service or incident type. You'll invest more time upfront but produce runbooks your team will actually use under pressure.
Simply substitute your actual tools in the environment section. Replace Datadog with Grafana or New Relic, PagerDuty with OpsGenie or VictorOps, Slack with Teams or Google Chat. The AI will reference the tools you specify. The more accurately you name your stack, the more the runbook reflects real workflows rather than theoretical ones.
Generic steps usually mean the prompt lacked tool and environment specificity. Add: exact tool names, dashboard names, CLI commands you use, and alert names from your monitoring setup. Also add 2-3 real edge cases your team has encountered. Specificity in the prompt produces specificity in the output — the AI mirrors the level of detail you provide.
Yes — if you have an existing format, include it. Paste the section headers and any required fields into the prompt under 'Structure.' This forces the AI to follow your org's documentation standard rather than inventing a new one. You'll get output that drops directly into your existing wiki or runbook library with minimal reformatting.
Provide context: who sends the message, which channel or platform, what tone is expected, and what the audience already knows. For example: 'Write a Statuspage update for paying customers who don't understand technical details, acknowledging the issue without specifying root cause, with an estimated resolution time.' Constrained templates produce copy-ready text. Open-ended requests produce drafts.
The AI can produce runbooks that include compliance-relevant elements — evidence preservation steps, chain-of-custody logs, notification timelines, and audit trail checkpoints — if you specify the regulation by name and list the required artifacts. Always have your compliance team or legal counsel review the output before treating it as audit-ready documentation.
One to two pages for most operational incidents. Longer runbooks don't get read during active incidents. Specify a page limit in your prompt. If your incident is complex (multi-team, regulatory implications, multi-region), allow up to three pages but demand that each section use numbered steps and checkboxes so engineers can scan rather than read.