Root Cause Analysis

Updated 10 hours ago

When incidents occur, the immediate question is "how do we fix this?" But after restoration comes a deeper question: "why did this happen?"

Root cause analysis is how you answer that question. Not by finding what broke—that's usually obvious—but by finding where a small change would have prevented the whole thing.

What Root Cause Analysis Actually Is

Root cause analysis (RCA) is a structured way to move from "what happened" to "what should we change."

If an incident is like a disease, the proximate cause is the fever—what you notice first. The root cause is the infection. Treating fever makes you feel better temporarily. Treating the infection actually cures you.

The technique is simple: keep asking "why" until you reach something you can change that would prevent similar incidents.

The Trap of Proximate Causes

The most common mistake is stopping at the obvious trigger.

"The database crashed because of a bug in the new code."

True. But not useful beyond "fix that specific bug." You'll fix it, ship another bug next month, and crash again.

Root cause analysis digs deeper:

Why was this bug in the code? → Code review missed it
Why didn't testing catch it? → No tests for this edge case
Why was deployment done when on-call coverage was thin? → No deployment calendar
Why did one bug take down the entire database? → No isolation between components

Each question reveals a leverage point—a place where a small change prevents not just this incident but entire categories of future incidents.

The Five Whys

The classic technique is deceptively simple: ask "why" five times.

Problem: Website went down

Why? Database server crashed
Why? A query caused an out-of-memory error
Why? The query loaded an entire table into memory
Why? No query review process before deployment
Why? Team prioritized shipping speed over review thoroughness

Watch what happened. You started with a server crash—a technical problem. You ended with team priorities—a human problem. That's not a bug in the technique. That's the technique working.

The number five isn't magical. Sometimes you reach root causes in three whys. Sometimes it takes seven. The point is to keep going until you reach something actionable and systemic.

Multiple Factors Always

Real incidents never have a single root cause. They result from combinations of factors that happened to align.

Consider a service outage:

A deployment introduced a memory leak
Monitoring didn't alert on memory usage
The deployment happened right before a traffic spike
The on-call engineer was in a meeting when alerts fired
Failover failed because the backup was misconfigured
Documentation for manual failover was outdated

Each factor contributed. Address only one and you're still vulnerable to similar combinations causing future incidents.

The Swiss Cheese Model

James Reason's "Swiss Cheese Model" explains why incidents happen despite multiple safeguards.

Imagine each defensive layer—testing, code review, monitoring, redundancy—as a slice of Swiss cheese. Each slice has holes (weaknesses), but the holes are in different places. Problems get caught by at least one layer.

Incidents occur when holes in multiple layers happen to line up. A problem slips through every defense.

Every incident is a story of holes lining up. Your job is to move the cheese—change where the holes are, add more slices, make the holes smaller.

Common Root Cause Categories

While each incident is unique, root causes cluster into recognizable patterns.

Technical Debt: Systems with accumulated shortcuts are fragile. Many incidents trace back to "we knew this was brittle but hadn't prioritized fixing it."

Inadequate Testing: Missing test coverage, tests that don't reflect production, time pressure that caused skipping tests, test environments that don't match production.

Insufficient Monitoring: Gaps in what's monitored, thresholds that miss real problems, alerts that don't reach people, missing monitoring for dependencies.

Documentation Gaps: Runbooks that don't match current systems, undocumented tribal knowledge, missing context about why systems work certain ways.

Process Failures: Deployment processes that allow risky changes, on-call rotations that leave people unprepared, communication breakdowns between teams.

Capacity Issues: Databases exceeding designed load, networks hitting bandwidth limits, storage filling unexpectedly, APIs exceeding rate limits.

Human Factors: Unclear interfaces, time pressure encouraging shortcuts, inadequate training, cognitive load leading to mistakes. Note: "human error" is never a root cause. It's a starting point for asking why that error was possible.

How to Conduct Root Cause Analysis

Gather Everything

Collect comprehensive information:

Timeline of what happened
Logs, metrics, and traces
Changes made before the incident
Actions taken during response
Observations from everyone involved

More information enables deeper analysis.

Reconstruct the Sequence

Build a clear picture of how the incident unfolded:

What was the first indication of problems?
How did issues spread or escalate?
What factors amplified the impact?
What finally resolved the situation?

This reconstruction often reveals contributing factors invisible during the chaos of incident response.

List All Contributing Factors

Capture everything that played a role. Don't filter yet—comprehensiveness matters more than prioritization at this stage.

Apply the Whys

For each contributing factor, ask why it occurred. Look for common causes connecting multiple factors. Identify where defensive layers failed.

Distinguish Causes from Consequences

A server running out of memory is a consequence. The root cause is why that was able to happen and affect service.

Focus on What You Control

You can't change that a third-party service went down. You can change how your system handles that failure.

Prioritize

Not all root causes deserve equal attention. Consider:

How likely is this factor to contribute to future incidents?
How severe would those incidents be?
How difficult is it to address?
What's the return on investment?

Common Pitfalls

Stopping too soon: Accepting surface-level causes misses opportunities for systemic improvement.

Blame focus: Finding who was responsible instead of what factors contributed destroys learning. Blameless analysis examines systems, not individuals.

Single cause assumption: Most incidents require multiple factors to align. Missing some means incomplete understanding.

Hindsight bias: After an incident, causes seem obvious. But analysis should examine what people knew at the time, not what's clear afterward.

Analysis paralysis: Spending weeks on detailed analysis delays implementing improvements. Sometimes "good enough" understanding enables action.

No follow-through: Identifying root causes without implementing improvements makes the analysis worthless. Action items need owners and deadlines.

Communicating Results

Be specific: "Insufficient monitoring" doesn't help. "Memory usage wasn't monitored on database replicas, only the primary" is actionable.

Explain systemic factors: Help people understand how the incident was a product of systems and processes, not just individual decisions.

Connect to actions: Link each root cause to specific improvements. Make it clear how addressing these causes prevents recurrence.

Write for everyone: Technical precision matters, but so does accessibility. Engineers, managers, and executives all need to understand.

Building a Learning Organization

Organizations that excel at root cause analysis treat it as core to how they operate.

They analyze near-misses—situations that almost became incidents—to learn without paying the cost of actual incidents.

They share analyses widely, so learning spreads across teams and prevents similar issues in different parts of the organization.

They track recurring root causes to identify systemic problems requiring focused investment.

They celebrate thorough analysis, recognizing that understanding failures is how you build systems that don't fail.

Frequently Asked Questions About Root Cause Analysis

Was this page helpful?

😔

🤨

😃