Why this is hard to get right
The 2 AM Postmortem Problem
Maria is a senior SRE at a mid-sized fintech company. Her team just resolved a 4-hour database failover that knocked out payment processing for customers across three regions. It's now 6 AM. She has a Slack thread with 200 messages, a PagerDuty timeline, three competing theories about root cause, and a standing 9 AM meeting with the VP of Engineering.
She needs a postmortem. A real one — not a wall of bullet points or a blame session dressed up as a report.
Maria's first attempt is what most engineers write under pressure:
"Write a postmortem for our database failover on Friday night. It caused payment failures. We fixed it by promoting a replica."
She pastes this into an AI assistant. The output she gets is a generic five-paragraph template. It has headings like "What Happened" and "Next Steps," but no timeline, no contributing factors, and action items like "improve monitoring" with no owner, no date, and no way to track completion. Leadership will ask three follow-up questions. Her on-call engineer will ask four more. Nothing is actually actionable.
The problem isn't the AI. It's the prompt.
Postmortems are hard to write well because they require synthesizing four different types of information simultaneously: a factual chronology, a causal chain, a stakeholder-facing narrative, and a prevention plan with real accountability. Most engineers are strong at one or two of these — not all four under time pressure.
A generic prompt produces generic output because it gives the AI nothing to work with. No timeline anchors. No impact numbers. No distinction between the root cause and the contributing factors. No clarity on who owns what.
When Maria restructures her prompt — specifying her role as the author, naming the audience (internal teams and leadership), pasting in the relevant timeline and alert data, defining the exact output sections she needs, and setting a word limit with a blame-free tone requirement — the AI produces something she can actually use. The executive summary is in plain language. The timeline table has timestamps and evidence. The five action items each have an owner, a priority level, a due date, and a success metric her team can verify.
She walks into the 9 AM meeting with a document that answers questions before they're asked.
That's the difference a well-structured prompt makes. It doesn't replace Maria's engineering judgment. It forces her to organize what she already knows into the inputs the AI needs to do its job. The result is a postmortem that teams trust, leadership can act on, and future on-call engineers can learn from — written in one pass instead of three drafts.
Common mistakes to avoid
Omitting Impact Numbers and Regions
Vague impact descriptions like "some customers were affected" force the AI to write equally vague summaries. Always provide specific numbers — affected users, transaction failure rate, regions, revenue exposure. Concrete inputs produce measurable executive summaries that leadership can reference in future planning.
Skipping the Contributing Factors Instruction
Asking only for a root cause produces a one-dimensional analysis. Real incidents have a root cause plus 2-3 contributing factors — a missed alert, a config drift, a capacity assumption. Explicitly request contributing factors or the AI will collapse everything into a single cause and oversimplify prevention.
Leaving Action Items Without Owners or Dates
Generic prompts return action items like "improve alerting" with no owner, no due date, and no success metric. These are untrackable. Specify in your prompt that each action item must include an owner, a priority level, a due date, and a measurable success criterion. Otherwise follow-ups stall in the next sprint.
Ignoring Detection and Response Gaps
Most postmortems focus on what broke, not on how long it took to detect and why. Detection gaps are where your monitoring and alerting strategy actually lives. If your prompt doesn't ask for detection method, time-to-alert, and escalation path, the AI skips the section that prevents the next incident from lasting just as long.
Not Setting a Blameless Tone Constraint
Without explicit tone guidance, AI output can drift toward assigning fault to individuals or teams. Blameless postmortems are a cultural practice, not just a style preference. Add a direct instruction — 'avoid blame, focus on systems and processes' — or you risk producing a document that damages team trust instead of building it.
Pasting Raw Logs Without Structure
Dumping an unformatted wall of logs into the prompt overloads the AI's context and produces low-quality timelines. Pre-process your inputs: extract the 8-12 most relevant log lines, annotate them with timestamps, and summarize what each one means. Structured inputs produce structured outputs — raw noise produces noise.
The transformation
Write a postmortem for our outage yesterday and include what happened and how to prevent it.
You’re a senior SRE writing an incident postmortem for internal teams and leadership. Use these inputs: **incident notes:** [paste timeline, alerts, key logs], **service:** [name], **date/time:** [UTC], **customer impact:** [numbers, regions], **duration:** [mins], **root cause hypothesis:** [text], **fixes already shipped:** [list]. 1. Write a 1-paragraph executive summary in plain language. 2. Build a timeline table (time, event, evidence). 3. Identify root cause and 2 contributing factors. 4. List **5 action items** with **owner, priority, due date, and success metric**. Keep it under **900 words**. Avoid blame. Use concise headings.
Why this works
Role and Audience Anchoring
The After Prompt opens with 'You're a senior SRE writing an incident postmortem for internal teams and leadership.' This dual-audience framing forces the AI to calibrate tone — technical enough for engineers, plain enough for leadership — without you having to write two separate documents.
Structured Input Slots
The After Prompt defines exactly what data to paste in: timeline, alerts, key logs, service name, date, customer impact numbers, duration, root cause hypothesis, and fixes already shipped. This prevents the AI from hallucinating facts when evidence is missing and anchors every claim in the document to your actual incident data.
Enumerated Output Sections
The numbered list — executive summary, timeline table, root cause plus contributing factors, five action items — removes all ambiguity about structure. The AI doesn't decide what sections matter. You do. This produces consistent postmortems across incidents, which makes quarter-over-quarter pattern analysis possible.
Action Item Accountability Schema
Requiring owner, priority, due date, and success metric for every action item transforms the postmortem from a narrative into a project plan. This specific schema is embedded directly in the After Prompt, so the AI knows it cannot write a vague follow-up without violating the prompt's explicit constraints.
Hard Constraints Prevent Scope Creep
The After Prompt sets a 900-word cap, requires concise headings, and mandates a blame-free style. These constraints force compression, clarity, and cultural alignment in a single instruction. Without them, AI output balloons into unreadable documents that no one in leadership actually reads.
The framework behind the prompt
The Theory Behind Blameless Postmortems
The blameless postmortem is one of the most important cultural and analytical practices in modern software reliability engineering. Its roots trace to safety science — specifically the work of Dr. Sidney Dekker and Charles Perrow on complex system failures. Their research showed that in high-reliability industries (aviation, nuclear power, surgery), blaming individuals for failures actually made systems less safe by suppressing reporting, hiding near-misses, and ignoring the systemic conditions that made failure predictable.
Google's SRE Book formalized this approach for software engineering, introducing the concept of error budgets and postmortem-driven reliability improvement as a systematic practice. The core principle: every incident is a systems failure, not an individual failure. The goal of a postmortem is not to assign blame but to understand the causal chain well enough to interrupt it.
The OODA Loop (Observe, Orient, Decide, Act) — originally a military decision framework — maps well to incident response and postmortem structure. The timeline section covers Observe. The root cause and contributing factors analysis covers Orient. Action items cover Decide and Act. When postmortem prompts produce weak output, it's usually because one of these phases is underspecified.
The Five Whys method, developed in the Toyota Production System, provides the iterative causal analysis technique most useful for the root cause section. By asking "why" recursively until you reach an organizational or process-level factor, you avoid the common trap of stopping at the proximate technical cause and writing action items that fix symptoms instead of systems.
For prompt engineering specifically, postmortem prompts benefit from few-shot structure: giving the AI explicit output schemas (timeline table format, action item fields) rather than asking it to invent structure. Schema-constrained prompts produce consistent, comparable documents — which is critical when your team needs to analyze postmortem patterns across incidents over time.
Prompt variations
You are a Customer Success lead drafting an external incident summary for affected business customers.
Incident context:
- Service affected: payment processing API
- Date and duration: March 14, 09:12–13:47 UTC (4 hours 35 minutes)
- Customer impact: ~2,400 enterprise accounts experienced failed transactions
- Root cause: database primary node failure during scheduled maintenance window
- Resolution: failover to replica completed, all services restored
- Preventive action already taken: maintenance window now requires replica health check before any primary node changes
Write the following:
- A 2-sentence subject line and opening that acknowledges the impact without minimizing it.
- A plain-language explanation of what happened (no technical jargon, under 100 words).
- A clear statement of what was fixed and when service was restored.
- Three specific steps you are taking to prevent recurrence.
- A closing sentence that invites customers to reach your support team.
Keep total length under 300 words. Use a professional, empathetic tone. Do not use phrases like 'we apologize for any inconvenience.'
You are facilitating a blameless postmortem retrospective meeting for an engineering team of 8 people.
Incident summary:
- A misconfigured Kubernetes resource limit caused a memory exhaustion cascade across three microservices
- Duration: 2 hours 10 minutes
- Detection method: customer complaint, not internal alerting
- Recovery: manual pod restart and config rollback
- Contributing factors: config change was not peer-reviewed, staging environment did not replicate production memory constraints
Generate the following meeting materials:
- A 5-question agenda that guides the team from timeline review to prevention planning (45-minute session format).
- Three open-ended discussion prompts that encourage psychological safety and systemic thinking.
- A facilitation note for each agenda item suggesting what the facilitator should watch for.
- A template for capturing action items during the meeting, with fields for owner, priority, due date, and success metric.
Keep language direct and non-accusatory throughout. Format for easy screen sharing during the meeting.
You are a senior SRE preparing a quarterly reliability review for engineering leadership.
Data for this quarter:
- Total incidents: 11
- P1 incidents: 3 (database failover, CDN misconfiguration, auth service timeout)
- P2 incidents: 8
- Average time to detect: 18 minutes
- Average time to resolve: 94 minutes
- Most common root cause category: configuration changes without staged rollout (5 of 11 incidents)
- Repeat incident pattern: auth service timeouts occurred in January and again in March despite a January action item marked 'complete'
Write the following:
- A one-paragraph executive summary suitable for a VP-level audience.
- A root cause category breakdown with percentage of incidents per category.
- An analysis of the repeat auth service incident — why the January fix was insufficient and what a more durable solution would require.
- Four prioritized recommendations for Q3 with rationale, estimated engineering effort (S/M/L), and expected impact on MTTR.
Keep the tone analytical and forward-looking. Avoid re-litigating individual incidents. Under 700 words.
When to use this prompt
Engineering Leaders
Share a leadership-ready incident summary with risks, decisions, and clear follow-ups after an outage.
Customer Success Teams
Create an internal version of the incident narrative that helps you align support messaging and reduce ticket churn.
Product Managers
Translate technical incident details into user impact, prioritization tradeoffs, and roadmap adjustments.
Platform and SRE Teams
Standardize postmortems across services so you can compare causes, response gaps, and repeat patterns quarter to quarter.
Pro tips
- 1
Define what counts as customer impact so your summary stays measurable and consistent.
- 2
Add detection and alerting details so you can surface gaps in monitoring, not just the root cause.
- 3
Specify your action item priorities so the AI doesn’t treat minor fixes like major risks.
- 4
Include a success metric for each follow-up so you can validate prevention, not just completion.
The Five Whys is a root cause analysis method developed at Toyota and widely adopted in SRE practice. It works by asking 'why' iteratively until you reach a systemic cause rather than a proximate one.
To integrate Five Whys into your postmortem prompt, add this instruction after the root cause section:
'Perform a Five Whys analysis starting from the customer-visible symptom. Present each Why as a numbered step with a one-sentence answer. Stop when you reach an organizational, process, or systemic factor — not a technical component.'
This produces a causal chain that connects a database failure to, for example, a deployment process that skips pre-production validation — which is far more actionable than 'the replica wasn't promoted fast enough.'
A few important constraints to add:
- Cap the analysis at 5-7 steps. Longer chains lose reliability.
- Flag each step as 'confirmed' or 'hypothesized' based on your evidence.
- End with the systemic fix, not the technical patch.
Teams that use Five Whys consistently in postmortems reduce repeat incident categories by surfacing process gaps that technical fixes alone can't address. The technique pairs well with the contributing factors section already built into the After Prompt.
Postmortem methodology originated in aviation safety and surgical teams before it reached software engineering. The blameless, evidence-first format translates well outside pure SRE work.
Product teams use postmortem structure for failed feature launches: what was the rollout plan, what user behavior surprised us, what assumptions were wrong, what would we change in the go/no-go process?
Customer Success teams use incident summaries to align internal messaging after an outage affects key accounts. The format mirrors the engineering postmortem but replaces timeline tables with a customer impact narrative and replaces action items with account recovery steps.
Security teams adapt postmortem structure for breach response documentation, replacing root cause with attack vector analysis and adding a containment and disclosure timeline.
Data engineering teams use postmortem format for pipeline failures: what data was affected, what downstream dashboards or models consumed incorrect data, how far back did the problem reach, and what data quality checks would have caught it earlier?
In all these contexts, the core principles hold: define what happened, establish causality, identify prevention actions with owners and dates, and write for both a technical and a leadership audience.
Individual postmortems are valuable. But the compounding value comes from treating postmortems as a data source — not just a document.
High-performing SRE teams conduct a monthly or quarterly reliability review that aggregates postmortems to identify patterns:
- Which root cause categories are recurring?
- Which services generate disproportionate incidents?
- Which action items were marked complete but didn't prevent recurrence?
- Is MTTR (mean time to resolution) improving or plateauing?
To support this practice, your postmortem prompt should use consistent taxonomy for root cause categories. Define 5-8 category labels — configuration change, dependency failure, capacity limit, code defect, process gap, external provider — and instruct the AI to tag each postmortem with the appropriate category.
Over 6-12 months, this taxonomy lets you run a quarterly incident pattern prompt: paste in your category counts and repeat incident list, and ask the AI to identify systemic gaps and prioritize your reliability investment areas.
The prompt on this page produces individual postmortems. The pattern review prompt turns them into organizational learning. Both are necessary for a mature reliability practice.
When not to use this prompt
Don't use this prompt pattern when:
-
The incident is still active. During an active incident, you need a runbook or decision tree — not a postmortem. Run the postmortem after service is fully restored and the team has had at least a few hours of recovery time.
-
You have no actual incident data. This prompt requires a real timeline, real impact numbers, and a working root cause hypothesis. If you're writing a hypothetical postmortem for training purposes, use a scenario-based prompt instead and label the output clearly as a drill.
-
Legal or regulatory proceedings are involved. If the incident has triggered a breach notification requirement, SLA penalty dispute, or legal investigation, do not use AI to draft the postmortem without legal review. AI-generated causal analysis could be discoverable and may conflict with counsel's strategy.
-
The incident affected only internal tooling with no customer impact. For minor internal disruptions, a brief Slack summary and a one-line entry in your incident log is sufficient. A full postmortem consumes team time that has real opportunity cost — reserve it for customer-affecting or system-revealing incidents.
For compliance-adjacent incidents, consult your legal team before publishing any AI-drafted incident analysis externally.
Troubleshooting
The AI writes a generic timeline with no specific timestamps or evidence
Your input timeline is too vague. Reformat your timeline before pasting it: each row should have an exact timestamp (UTC), a one-sentence event description, and a source (alert ID, log line, Slack message). Then add this instruction: 'Use only the timestamps and evidence I've provided — do not infer or estimate times not in the input.'
Action items are too vague to track — no real owners or measurable outcomes
Add an explicit schema instruction to your prompt: 'Format each action item as: Owner (specific role or name), Priority (P1/P2/P3), Due Date (specific date, not relative), Success Metric (what measurable condition confirms this is done).' Then add: 'Do not write an action item that cannot be assigned to a single owner.' This forces specificity the AI won't produce without direct constraint.
The executive summary is too technical for a non-engineering leadership audience
Add a second audience instruction: 'Write the executive summary for a VP or C-level reader with no engineering background. Use no acronyms. Translate all technical terms into business impact.' Then add: 'The summary should answer three questions: what broke, how many customers were affected, and what you changed to prevent recurrence.'
The root cause analysis conflates the root cause with contributing factors
Define the distinction explicitly in your prompt: 'Root cause is the single condition that, if absent, would have prevented the incident. Contributing factors are conditions that worsened severity or extended duration but were not the primary cause.' Ask the AI to label each factor clearly. Without this definition, AI output treats all causes as equal, which produces a muddled analysis.
The postmortem is too long — leadership won't read past the first section
Add a strict structure instruction: 'Write a two-part document: Part 1 is the leadership summary (executive summary plus action items only, under 300 words). Part 2 is the full technical postmortem for engineering. Label each part clearly.' This lets different audiences read at the depth they need without losing any content.
How to measure success
How to Evaluate Your Postmortem Output
Before sharing your AI-generated postmortem, check it against these quality signals:
Executive summary
- A non-engineer can read it and answer: what broke, who was affected, and what changed?
- No unexplained acronyms or technical terms
- Impact is quantified (not 'some customers' — specific numbers)
Timeline table
- Every row has an exact timestamp and an evidence source
- The table covers detection, escalation, diagnosis, mitigation, and resolution
- No gaps longer than 30 minutes without an explanatory note
Root cause and contributing factors
- Root cause is a single, specific condition — not a category
- Contributing factors are labeled separately and each adds causal information
- The analysis uses evidence from the input, not general assumptions
Action items
- Every action item has a named owner, a priority level, a specific due date, and a measurable success metric
- No item is vague enough to be interpreted multiple ways
- At least one item addresses detection or alerting — not just the fix
Overall
- Total length is within your specified word limit
- Tone is systemic and blameless throughout
- The document answers the question: 'What would you tell a future on-call engineer about this incident?'
Now try it on something of your own
Reading about the framework is one thing. Watching it sharpen your own prompt is another — takes 90 seconds, no signup.
Turn your incident timeline and raw notes into a leadership-ready postmortem with action items, owners, and due dates — in one pass.
Try one of these
Frequently asked questions
You need at minimum: a rough timeline with 4-6 key events, the service name, the approximate customer impact, a working hypothesis about root cause, and the fix you shipped. You don't need perfect data. Mark uncertain items as 'hypothesis' or 'approximate' in your inputs — the AI will reflect that uncertainty in the output rather than fabricating precision.
Yes, with two modifications. First, add a confidentiality constraint — specify which sections are internal-only vs. shareable with external auditors. Second, replace 'root cause and contributing factors' with a section on attack vector, detection gap, and containment timeline, since security incidents require a different causal framework than infrastructure failures.
Change the root cause input to a hypothesis statement and add an explicit instruction: 'If root cause is uncertain, present 2-3 competing hypotheses with supporting and contradicting evidence for each.' This keeps the postmortem honest, prevents premature closure, and gives your team a structured way to continue the investigation in parallel with prevention work.
900 words works well for most P1 incidents. For P2 incidents, aim for 500-600 words. For a major outage affecting many customers or with regulatory implications, you may need 1,200-1,500 words — but split it into a leadership summary and a technical appendix. Longer doesn't mean more useful: every word your on-call team skips is a missed lesson.
The prompt generates action items with owner, priority, due date, and success metric — but tracking is your job. Copy the action item table directly into your project management tool (Jira, Linear, Notion) immediately after the meeting. Assign tickets the same day. Schedule a 2-week check-in. Postmortems fail when action items live only inside the postmortem document.
Yes. Adjust the executive summary instruction to read 'internal engineering audience only' and reduce the action item count to 2-3. Staging incidents are worth documenting when they reveal production risks, monitoring gaps, or process failures — but they don't need the full leadership-facing treatment.
Paste your existing template structure into the prompt as the output format. Replace the numbered output list in the After Prompt with your team's section headers. The AI will fill your existing template rather than inventing a new structure. This is especially useful for teams with compliance requirements or audit trails that mandate a specific document format.