Postmortems

Updated 10 hours ago

After the dust settles from an incident, teams face a choice: move on quickly and hope it doesn't happen again, or pause to learn from what happened. Organizations that consistently improve choose learning. That learning happens through postmortems.

What Is a Postmortem?

A postmortem is a documented analysis of an incident, created after the incident is resolved. It examines what happened, why it happened, how the team responded, and what should change to prevent similar incidents or improve future response.

The term comes from medicine—an examination after death to understand what caused it. In technology, we examine "dead" incidents to understand what killed our service, even temporarily.

An incident that costs your business money and damages customer trust becomes worthwhile only if it makes your systems and processes better. That's what postmortems do: transform expensive failures into valuable learning.

When to Conduct Postmortems

Not every incident deserves a full postmortem. Creating one requires time and effort that should be invested where it provides the most value.

Always conduct postmortems for:

Severity 1 incidents (complete outages or critical failures)
Incidents that violated SLAs or caused customer escalations
Incidents that resulted in data loss or security breaches
Any incident that affected customers for more than an hour
Incidents that revealed systemic issues, even if impact was minor

Consider postmortems for:

Repeated incidents of the same type
Near-misses that almost caused major problems
Incidents where response was particularly chaotic or slow
Incidents that surfaced interesting technical or process issues

Skip postmortems for:

Very minor incidents with obvious causes and fixes
Incidents identical to recent incidents already reviewed
Transient issues caused by external factors completely outside your control

The goal isn't to create postmortems for everything—it's to learn from incidents that provide valuable lessons.

The Blameless Principle

The most important postmortem principle: postmortems are blameless.

You're not investigating to find out whose fault it was. You're not looking for someone to punish. You're trying to understand how your systems, processes, and practices allowed this incident to occur.

This distinction is critical. If people fear blame, they'll hide the very information you need to prevent the next incident. They'll deflect responsibility, protect themselves, minimize their involvement. You'll learn nothing useful.

Blameless doesn't mean no accountability. It means recognizing that incidents result from complex interactions between systems, processes, and people making reasonable decisions with the information they had at the time. Finding someone to blame doesn't prevent future incidents. Understanding systemic factors and improving them does.

When someone made a mistake that contributed to an incident, a blameless postmortem examines why that mistake was possible: Was documentation unclear? Was the interface confusing? Was there time pressure? Was training inadequate? These are fixable system problems. "Don't make that mistake again" isn't a fix—it's a wish.

Postmortem Structure

Most effective postmortems follow a common structure.

Summary

A brief overview of what happened: "On December 11th at 14:23 UTC, our primary database crashed, causing complete service outage for 47 minutes affecting all users."

The summary gives readers immediate context without requiring them to read the entire document.

Impact

Quantify the incident's effects:

How many users were affected?
For how long?
What functionality was unavailable?
What was the business impact (revenue lost, SLA violations, customer escalations)?
Were there any data integrity issues?

Concrete numbers make impact clear and help prioritize future prevention work.

Timeline

A chronological sequence of events from incident start through resolution:

14:23 UTC - Monitoring alerts for database connection failures
14:25 UTC - On-call engineer acknowledges alert, begins investigation
14:28 UTC - Incident declared, additional responders paged
14:31 UTC - Database confirmed unresponsive, failover initiated
14:45 UTC - Failover to secondary database completed
15:03 UTC - Application services restored, monitoring for stability
15:10 UTC - Incident resolved, all systems operational

The timeline reconstructs what happened based on incident documentation, logs, and monitoring data—not memory. Memory is unreliable. Timestamps aren't.

Root Cause

What fundamentally caused this incident?

"The database crashed" isn't a root cause—it's the corpse. Why did it crash? Memory leak in a recent code change? Query load exceeded capacity? Hardware failure? Configuration error?

Good root cause analysis often reveals multiple contributing factors. Real incidents rarely have a single cause—they result from combinations of circumstances that aligned badly.

Contributing Factors

What conditions allowed the root cause to create an incident?

Lack of automated failover
Insufficient monitoring of memory usage
No load testing before deploying the problematic code
Alerts that didn't wake the on-call person
Documentation that was outdated

Contributing factors are often more valuable than root causes because they represent opportunities for systemic improvement. You might not be able to prevent every bug, but you can catch them before production, fail over gracefully when they hit, and detect problems faster.

What Went Well

Not everything during an incident goes wrong. Noting what worked reinforces good practices and recognizes effective responses:

Monitoring detected the issue within 2 minutes
Team coordination was effective
Failover process worked as designed
Customer communication was timely and clear

Highlighting successes provides balance and acknowledges that your systems and processes have strengths worth maintaining.

What Went Wrong

Beyond the technical failure, what about your response or systems could have been better?

Initial diagnosis took longer than it should have
Runbook for this scenario was outdated
Communication about expected resolution time was overly optimistic
We lacked test coverage that would have caught the bug

Action Items

Specific, actionable changes to prevent recurrence or improve future response. Each action item should have:

Clear description of what will be done
Owner responsible for completing it
Priority level
Target completion date

Action items turn learning into improvement. A postmortem without action items is just documentation—valuable for understanding but not for preventing future incidents.

Lessons Learned

Broader insights beyond specific action items. What did this incident teach you about your systems, processes, or assumptions?

"We assumed failover would be faster than it was." "Load testing didn't account for real-world traffic patterns." "Our monitoring had blind spots in critical areas."

These lessons inform future decision-making even when you can't create specific action items for them.

Creating the Postmortem

The creation process matters as much as the document.

Timing

Schedule the postmortem meeting within a few days of incident resolution—soon enough that details are fresh but allowing time for people to rest and gather data.

Participants

Include people from across the incident response:

Technical responders who investigated and fixed the issue
Incident commander who coordinated response
Customer support representatives who handled user inquiries
Product or business stakeholders affected by the incident

Diverse perspectives provide a more complete picture.

Facilitation

Designate a facilitator to guide the discussion, keep it on track, and ensure everyone participates. The facilitator should prevent the discussion from becoming a blame session—which can happen subtly, through tone and phrasing, even when everyone intends to be blameless.

Good facilitators ask open questions: "What made that step challenging?" "What information would have helped?" "What assumptions turned out to be wrong?"

Documentation

One person should document the discussion in real-time, capturing the timeline, causes, factors, and action items as they're discussed. This scribe role is crucial—you want to capture insights while they're fresh.

Review and Publication

After the meeting, someone (often the incident commander) writes up the formal postmortem document based on notes and further investigation.

The document should be reviewed by participants for accuracy, then shared widely. Good organizations make postmortems accessible to the entire company, not just the incident response team. Transparency builds trust and spreads learning across teams.

Common Postmortem Mistakes

Blame Creep: Despite intentions to be blameless, discussions subtly shift toward assigning fault. "Why did you deploy on Friday?" quickly becomes an accusation rather than a systems question about deployment practices.

Surface-Level Analysis: Stopping at immediate causes without digging into underlying factors. "The database crashed because of a code bug" doesn't explain why the bug wasn't caught in testing, why there was no graceful degradation, or why failover took so long.

Too Many Action Items: Creating dozens of action items dilutes focus and ensures most won't get done. Better to identify 3-5 high-impact improvements that will actually be implemented.

No Follow-Through: Creating action items but never tracking completion makes postmortems performative. Action items need owners and accountability.

Skipping "What Went Well": Focusing only on problems creates a discouraging atmosphere. Recognizing what worked provides balance.

Writing for Audit, Not Learning: Some postmortems read like legal documents written to deflect liability rather than share learning. This happens when organizations punish people for mistakes, making honest analysis unsafe.

No Sharing: Keeping postmortems private within small teams wastes their learning value. Others in your organization might prevent similar incidents if they learned from your experience.

Postmortem Culture

Effective postmortems require organizational culture that supports them.

People must feel safe sharing mistakes, admitting confusion, or acknowledging they made wrong decisions during incidents. When someone admits "I misread the monitoring graph and initially investigated the wrong service," the response should be "Thank you for sharing that—should we make the graphs clearer?" not criticism.

Organizations should treat postmortems as learning opportunities, not performance evaluations. The person whose code change caused an incident shouldn't fear for their job—they should be treated as someone with unique insight into what could be improved.

Postmortems create value only when action items get completed. Some teams dedicate a percentage of engineering time specifically to postmortem action items, ensuring they don't get perpetually deprioritized for feature work.

Organizations that view incidents as normal (they are) and opportunities to improve (they are) get more value from postmortems than organizations that view incidents as aberrations or failures.

Beyond Individual Incidents

Over time, postmortems accumulate into organizational knowledge.

Pattern Recognition: Reviewing multiple postmortems reveals patterns. If database issues appear in 40% of postmortems, that's a clear signal about where to invest in reliability improvements.

Training Material: Postmortems provide case studies for training new team members—real examples of how systems fail and how teams respond.

Cultural Artifacts: A library of honest, blameless postmortems signals to new employees that this organization values learning and transparency.

Decision Making: When deciding between architectural approaches or process changes, past postmortems provide evidence about what kinds of issues your systems actually encounter.

Frequently Asked Questions About Postmortems

Key Takeaways

Postmortems transform costly incidents into valuable organizational learning by systematically analyzing what happened, why it happened, and how to improve
Effective postmortems are blameless—if people fear blame, they'll hide the information you need to prevent the next incident
"Don't make that mistake again" isn't a fix. Understanding why the mistake was possible, and changing the system so it's harder to make, is a fix
A good postmortem includes summary, impact, timeline, root cause, contributing factors, what went well, what went wrong, action items, and lessons learned
Action items must have clear owners and deadlines—a postmortem without follow-through is just documentation
Contributing factors are often more valuable than root causes because they represent systemic improvement opportunities
Over time, postmortems accumulate into organizational knowledge that reveals patterns and informs decisions

Was this page helpful?

😔

🤨

😃