Updated 10 hours ago
After the dust settles from an incident, teams face a choice: move on quickly and hope it doesn't happen again, or pause to learn from what happened. Organizations that consistently improve choose learning. That learning happens through postmortems.
What Is a Postmortem?
A postmortem is a documented analysis of an incident, created after the incident is resolved. It examines what happened, why it happened, how the team responded, and what should change to prevent similar incidents or improve future response.
The term comes from medicine—an examination after death to understand what caused it. In technology, we examine "dead" incidents to understand what killed our service, even temporarily.
An incident that costs your business money and damages customer trust becomes worthwhile only if it makes your systems and processes better. That's what postmortems do: transform expensive failures into valuable learning.
When to Conduct Postmortems
Not every incident deserves a full postmortem. Creating one requires time and effort that should be invested where it provides the most value.
Always conduct postmortems for:
- Severity 1 incidents (complete outages or critical failures)
- Incidents that violated SLAs or caused customer escalations
- Incidents that resulted in data loss or security breaches
- Any incident that affected customers for more than an hour
- Incidents that revealed systemic issues, even if impact was minor
Consider postmortems for:
- Repeated incidents of the same type
- Near-misses that almost caused major problems
- Incidents where response was particularly chaotic or slow
- Incidents that surfaced interesting technical or process issues
Skip postmortems for:
- Very minor incidents with obvious causes and fixes
- Incidents identical to recent incidents already reviewed
- Transient issues caused by external factors completely outside your control
The goal isn't to create postmortems for everything—it's to learn from incidents that provide valuable lessons.
The Blameless Principle
The most important postmortem principle: postmortems are blameless.
You're not investigating to find out whose fault it was. You're not looking for someone to punish. You're trying to understand how your systems, processes, and practices allowed this incident to occur.
This distinction is critical. If people fear blame, they'll hide the very information you need to prevent the next incident. They'll deflect responsibility, protect themselves, minimize their involvement. You'll learn nothing useful.
Blameless doesn't mean no accountability. It means recognizing that incidents result from complex interactions between systems, processes, and people making reasonable decisions with the information they had at the time. Finding someone to blame doesn't prevent future incidents. Understanding systemic factors and improving them does.
When someone made a mistake that contributed to an incident, a blameless postmortem examines why that mistake was possible: Was documentation unclear? Was the interface confusing? Was there time pressure? Was training inadequate? These are fixable system problems. "Don't make that mistake again" isn't a fix—it's a wish.
Postmortem Structure
Most effective postmortems follow a common structure.
Summary
A brief overview of what happened: "On December 11th at 14:23 UTC, our primary database crashed, causing complete service outage for 47 minutes affecting all users."
The summary gives readers immediate context without requiring them to read the entire document.
Impact
Quantify the incident's effects:
- How many users were affected?
- For how long?
- What functionality was unavailable?
- What was the business impact (revenue lost, SLA violations, customer escalations)?
- Were there any data integrity issues?
Concrete numbers make impact clear and help prioritize future prevention work.
Timeline
A chronological sequence of events from incident start through resolution:
The timeline reconstructs what happened based on incident documentation, logs, and monitoring data—not memory. Memory is unreliable. Timestamps aren't.
Root Cause
What fundamentally caused this incident?
"The database crashed" isn't a root cause—it's the corpse. Why did it crash? Memory leak in a recent code change? Query load exceeded capacity? Hardware failure? Configuration error?
Good root cause analysis often reveals multiple contributing factors. Real incidents rarely have a single cause—they result from combinations of circumstances that aligned badly.
Contributing Factors
What conditions allowed the root cause to create an incident?
- Lack of automated failover
- Insufficient monitoring of memory usage
- No load testing before deploying the problematic code
- Alerts that didn't wake the on-call person
- Documentation that was outdated
Contributing factors are often more valuable than root causes because they represent opportunities for systemic improvement. You might not be able to prevent every bug, but you can catch them before production, fail over gracefully when they hit, and detect problems faster.
What Went Well
Not everything during an incident goes wrong. Noting what worked reinforces good practices and recognizes effective responses:
- Monitoring detected the issue within 2 minutes
- Team coordination was effective
- Failover process worked as designed
- Customer communication was timely and clear
Highlighting successes provides balance and acknowledges that your systems and processes have strengths worth maintaining.
What Went Wrong
Beyond the technical failure, what about your response or systems could have been better?
- Initial diagnosis took longer than it should have
- Runbook for this scenario was outdated
- Communication about expected resolution time was overly optimistic
- We lacked test coverage that would have caught the bug
Action Items
Specific, actionable changes to prevent recurrence or improve future response. Each action item should have:
- Clear description of what will be done
- Owner responsible for completing it
- Priority level
- Target completion date
Action items turn learning into improvement. A postmortem without action items is just documentation—valuable for understanding but not for preventing future incidents.
Lessons Learned
Broader insights beyond specific action items. What did this incident teach you about your systems, processes, or assumptions?
"We assumed failover would be faster than it was." "Load testing didn't account for real-world traffic patterns." "Our monitoring had blind spots in critical areas."
These lessons inform future decision-making even when you can't create specific action items for them.
Creating the Postmortem
The creation process matters as much as the document.
Timing
Schedule the postmortem meeting within a few days of incident resolution—soon enough that details are fresh but allowing time for people to rest and gather data.
Participants
Include people from across the incident response:
- Technical responders who investigated and fixed the issue
- Incident commander who coordinated response
- Customer support representatives who handled user inquiries
- Product or business stakeholders affected by the incident
Diverse perspectives provide a more complete picture.
Facilitation
Designate a facilitator to guide the discussion, keep it on track, and ensure everyone participates. The facilitator should prevent the discussion from becoming a blame session—which can happen subtly, through tone and phrasing, even when everyone intends to be blameless.
Good facilitators ask open questions: "What made that step challenging?" "What information would have helped?" "What assumptions turned out to be wrong?"
Documentation
One person should document the discussion in real-time, capturing the timeline, causes, factors, and action items as they're discussed. This scribe role is crucial—you want to capture insights while they're fresh.
Review and Publication
After the meeting, someone (often the incident commander) writes up the formal postmortem document based on notes and further investigation.
The document should be reviewed by participants for accuracy, then shared widely. Good organizations make postmortems accessible to the entire company, not just the incident response team. Transparency builds trust and spreads learning across teams.
Common Postmortem Mistakes
Blame Creep: Despite intentions to be blameless, discussions subtly shift toward assigning fault. "Why did you deploy on Friday?" quickly becomes an accusation rather than a systems question about deployment practices.
Surface-Level Analysis: Stopping at immediate causes without digging into underlying factors. "The database crashed because of a code bug" doesn't explain why the bug wasn't caught in testing, why there was no graceful degradation, or why failover took so long.
Too Many Action Items: Creating dozens of action items dilutes focus and ensures most won't get done. Better to identify 3-5 high-impact improvements that will actually be implemented.
No Follow-Through: Creating action items but never tracking completion makes postmortems performative. Action items need owners and accountability.
Skipping "What Went Well": Focusing only on problems creates a discouraging atmosphere. Recognizing what worked provides balance.
Writing for Audit, Not Learning: Some postmortems read like legal documents written to deflect liability rather than share learning. This happens when organizations punish people for mistakes, making honest analysis unsafe.
No Sharing: Keeping postmortems private within small teams wastes their learning value. Others in your organization might prevent similar incidents if they learned from your experience.
Postmortem Culture
Effective postmortems require organizational culture that supports them.
People must feel safe sharing mistakes, admitting confusion, or acknowledging they made wrong decisions during incidents. When someone admits "I misread the monitoring graph and initially investigated the wrong service," the response should be "Thank you for sharing that—should we make the graphs clearer?" not criticism.
Organizations should treat postmortems as learning opportunities, not performance evaluations. The person whose code change caused an incident shouldn't fear for their job—they should be treated as someone with unique insight into what could be improved.
Postmortems create value only when action items get completed. Some teams dedicate a percentage of engineering time specifically to postmortem action items, ensuring they don't get perpetually deprioritized for feature work.
Organizations that view incidents as normal (they are) and opportunities to improve (they are) get more value from postmortems than organizations that view incidents as aberrations or failures.
Beyond Individual Incidents
Over time, postmortems accumulate into organizational knowledge.
Pattern Recognition: Reviewing multiple postmortems reveals patterns. If database issues appear in 40% of postmortems, that's a clear signal about where to invest in reliability improvements.
Training Material: Postmortems provide case studies for training new team members—real examples of how systems fail and how teams respond.
Cultural Artifacts: A library of honest, blameless postmortems signals to new employees that this organization values learning and transparency.
Decision Making: When deciding between architectural approaches or process changes, past postmortems provide evidence about what kinds of issues your systems actually encounter.
Frequently Asked Questions About Postmortems
Key Takeaways
- Postmortems transform costly incidents into valuable organizational learning by systematically analyzing what happened, why it happened, and how to improve
- Effective postmortems are blameless—if people fear blame, they'll hide the information you need to prevent the next incident
- "Don't make that mistake again" isn't a fix. Understanding why the mistake was possible, and changing the system so it's harder to make, is a fix
- A good postmortem includes summary, impact, timeline, root cause, contributing factors, what went well, what went wrong, action items, and lessons learned
- Action items must have clear owners and deadlines—a postmortem without follow-through is just documentation
- Contributing factors are often more valuable than root causes because they represent systemic improvement opportunities
- Over time, postmortems accumulate into organizational knowledge that reveals patterns and informs decisions
Was this page helpful?