Updated 10 hours ago
After detecting an incident, the clock starts ticking. Mean Time to Resolution (MTTR) measures how long it takes to fix problems and restore normal service. It's one of the most watched metrics in incident management.
But here's what MTTR actually measures: how long your users suffer. Every minute on that timer is someone staring at an error page, a transaction failing, trust eroding. MTTR isn't about operational efficiency. It's about limiting the damage.
What Is MTTR?
Mean Time to Resolution is the average duration from when you detect an incident until it's fully resolved and normal service is restored. Calculate it by adding up resolution times for all incidents in a period and dividing by the number of incidents.
Three incidents in a month:
- Incident 1: Detected at 14:03, resolved at 14:28 (25 minutes)
- Incident 2: Detected at 02:45, resolved at 04:15 (90 minutes)
- Incident 3: Detected at 11:21, resolved at 11:35 (14 minutes)
MTTR = (25 + 90 + 14) / 3 = 43 minutes
Simple math. What's not simple is everything that happens between detection and resolution.
Why MTTR Matters
MTTR directly correlates with user impact. Faster resolution means less time users spend experiencing problems, less revenue lost, less damage to trust.
MTTR also reveals operational maturity. Organizations with low MTTR typically have:
- Systems designed with failure recovery mechanisms
- Effective incident response processes
- Good documentation and runbooks
- Experienced, well-coordinated teams
- Tools that support fast diagnosis
Improving MTTR makes your service more reliable not by preventing failures but by limiting their impact when they occur.
The Anatomy of Resolution Time
Resolution time breaks down into phases. Each represents an opportunity to move faster—or a place where time silently drains away.
Investigation
After detection, responders must understand what's wrong:
- Examining monitoring data and alerts
- Reading logs and traces
- Forming hypotheses about causes
- Testing those hypotheses
Investigation time varies dramatically based on incident clarity. "Database replica 3 crashed" takes seconds to understand. "Users report things are slow sometimes" takes hours.
Coordination
Incidents often require multiple people. Coordination time includes:
- Paging additional responders
- Getting people into communication channels
- Sharing context about what's known
- Dividing work across team members
Good incident management processes minimize coordination overhead without sacrificing necessary collaboration. Bad processes turn incidents into meetings.
Implementation
Once you understand the problem and know the fix:
- Deploying code changes
- Updating configurations
- Restarting services
- Failing over to backup systems
- Scaling infrastructure
Implementation speed depends on your deployment practices and infrastructure automation.
Verification
After implementing fixes, you must confirm service is actually restored:
- Monitoring error rates returning to normal
- Checking that user-facing functionality works
- Verifying no new issues were introduced
Declaring resolution too early leads to recurring incidents. You're lying to yourself—and the timer keeps running whether you admit it or not.
What Determines MTTR
Many factors influence how quickly teams resolve incidents.
System Design
Systems designed with failure in mind resolve faster. Redundancy allows failing over to working components. Isolated failures prevent cascades. Graceful degradation maintains partial functionality.
Systems not designed for failure require more complex, time-consuming fixes—rebuilding components, recovering data, manually restoring service piece by piece.
Observability
Good observability dramatically reduces investigation time. When logs, metrics, and traces clearly show what's wrong, responders quickly identify issues.
Poor observability forces responders to guess, try random fixes, or add instrumentation during incidents to understand what's happening. You're debugging in the dark while users wait.
Runbooks and Documentation
Well-maintained runbooks provide step-by-step incident response procedures. When database performance degrades, responders follow the documented procedure rather than figuring it out from scratch.
Missing or outdated documentation means solving problems anew every time, even problems you've solved before.
Team Experience
Experienced teams resolve incidents faster. They've seen similar issues, know where to look, understand failure modes, and work effectively under pressure.
New team members contribute less effectively until they build this institutional knowledge. This isn't a criticism—it's why knowledge transfer matters.
On-Call Readiness
Responders who are prepared—with access to systems, familiarity with runbooks, understanding of escalation procedures—resolve incidents faster than those who must first figure out how to access logs.
Deployment Practices
Organizations with fast, safe deployment can quickly deploy fixes. Those with slow, risky deployments face longer MTTR because each deployment adds risk and delay.
Good deployment practices:
- Automated deployment pipelines
- Easy rollback mechanisms
- Gradual rollout capabilities
- Deployment frequency that keeps deployments routine
Incident Response Process
Clear processes reduce coordination overhead. Everyone knows their role, communication channels are established, escalation paths are clear.
Ad-hoc incident response wastes time figuring out who should be involved and how to coordinate.
Improving MTTR
Design for Failure
Build systems that fail gracefully and recover automatically:
- Implement redundancy and automatic failover
- Use circuit breakers to prevent cascade failures
- Design for degraded operation rather than complete failure
- Build idempotent operations that can be safely retried
These patterns often resolve incidents without human intervention. The best MTTR is zero—the incident that fixes itself before anyone notices.
Invest in Observability
Improve your ability to understand system state during incidents:
- Add comprehensive logging
- Implement distributed tracing
- Create dashboards for common failure modes
- Ensure logs include context needed for debugging
Better observability reduces investigation time, often the longest phase of incident response.
Create and Maintain Runbooks
Document procedures for common incidents:
- Step-by-step resolution procedures
- Where to find relevant logs and metrics
- Who to escalate to for specialized knowledge
- Known issues and their solutions
Keep runbooks updated as systems change. Outdated runbooks waste time and create confusion.
Improve Deployment Speed
Make deploying fixes fast and safe:
- Automate deployment pipelines
- Implement feature flags for instant rollback
- Use canary or blue-green deployments
- Practice deployments regularly so they're routine
Regular Training and Practice
Conduct incident response training:
- Tabletop exercises walking through hypothetical incidents
- Game day events simulating real failures
- Postmortem reviews sharing learning across teams
- Rotation of on-call responsibilities to spread knowledge
Practice builds the muscle memory that enables fast response.
Reduce Coordination Overhead
Streamline how teams work together:
- Establish clear incident commander role
- Use standard communication channels
- Define escalation procedures
- Create templates for incident declarations
Automate Common Fixes
Some incidents recur with the same resolution. Automate these:
- Automatic restart of crashed services
- Auto-scaling in response to load
- Automatic failover to healthy instances
- Self-healing infrastructure
Automation can reduce MTTR from minutes to seconds.
The Dark Side of MTTR
Aggressively optimizing for MTTR has potential downsides.
Hasty Fixes: Extreme pressure to resolve quickly leads to bandaid fixes that don't address root causes. The incident closes, but it will be back.
Incomplete Understanding: Rushing to resolution without understanding what happened makes postmortem analysis harder. You've stopped the bleeding but learned nothing.
Risk Taking: Desperation to resolve fast might lead to risky changes that make situations worse.
The goal is sustainably fast resolution, not resolution at any cost.
MTTR Targets
Appropriate targets depend on your service level requirements.
Critical Production Services: Major services might target 30 minutes or less for critical incidents.
Internal Services: Internal applications might target a few hours, balancing against other priorities.
Different Severity Levels: MTTR targets should vary by severity. Critical outages demand faster resolution than minor issues.
Whatever targets you set, track trends. Is MTTR improving or degrading? Changes signal changes in operational effectiveness.
Beyond the Mean
Examining MTTR distribution reveals insights the mean doesn't capture.
If most incidents resolve in 15 minutes but one took 6 hours, your MTTR is 45 minutes. The mean hides that you're actually quite good at resolution except for specific incident types.
Consider tracking:
- Median time to resolution: Less affected by outliers
- 95th percentile: Your worst-case resolution times
- MTTR by severity: Different expectations for different severities
- MTTR by incident type: Which incidents take longest
MTTR in Context
MTTR doesn't exist in isolation.
MTTD (Mean Time to Detect): How quickly you notice incidents. MTTD + MTTR = total incident duration from the user's perspective.
MTBF (Mean Time Between Failures): How often incidents occur. Rare incidents might matter more than fast resolution.
Impact Metrics: Short MTTR affecting millions of users has more impact than long MTTR affecting few.
The most reliable services combine all these—rare incidents, quickly detected, quickly resolved.
The Limits of MTTR
MTTR has limitations as a metric.
Gaming: Teams might rush to declare resolution before fully verifying fixes. MTTR improves on paper while reliability gets worse. A metric meant to improve reliability can be gamed in ways that make reliability worse—there's a dark irony there.
Impact Blindness: A 1-hour incident affecting 10 users has the same MTTR as a 1-hour incident affecting 10 million users.
Prevention Blindness: MTTR measures resolution speed but doesn't reflect efforts to prevent incidents in the first place.
Use MTTR alongside other metrics for a complete picture.
MTTR as a Learning Tool
Beyond tracking performance, MTTR reveals learning opportunities.
Incident Comparison: Why did similar incidents have vastly different resolution times? What went better or worse?
Team Effectiveness: How does MTTR vary across different on-call rotations? What can high-performing teams teach others?
System Complexity: Which services consistently have high MTTR? These might need architectural improvements.
Process Gaps: Do incidents with high MTTR reveal missing documentation, tooling, or skills?
Examining MTTR patterns reveals opportunities for systemic improvement. The number itself matters less than what it teaches you about your systems and your team.
Frequently Asked Questions About MTTR
Was this page helpful?