1. Library
  2. Monitoring Concepts
  3. Incident Management

Updated 10 hours ago

After detecting an incident, the clock starts ticking. Mean Time to Resolution (MTTR) measures how long it takes to fix problems and restore normal service. It's one of the most watched metrics in incident management.

But here's what MTTR actually measures: how long your users suffer. Every minute on that timer is someone staring at an error page, a transaction failing, trust eroding. MTTR isn't about operational efficiency. It's about limiting the damage.

What Is MTTR?

Mean Time to Resolution is the average duration from when you detect an incident until it's fully resolved and normal service is restored. Calculate it by adding up resolution times for all incidents in a period and dividing by the number of incidents.

Three incidents in a month:

  • Incident 1: Detected at 14:03, resolved at 14:28 (25 minutes)
  • Incident 2: Detected at 02:45, resolved at 04:15 (90 minutes)
  • Incident 3: Detected at 11:21, resolved at 11:35 (14 minutes)

MTTR = (25 + 90 + 14) / 3 = 43 minutes

Simple math. What's not simple is everything that happens between detection and resolution.

Why MTTR Matters

MTTR directly correlates with user impact. Faster resolution means less time users spend experiencing problems, less revenue lost, less damage to trust.

MTTR also reveals operational maturity. Organizations with low MTTR typically have:

  • Systems designed with failure recovery mechanisms
  • Effective incident response processes
  • Good documentation and runbooks
  • Experienced, well-coordinated teams
  • Tools that support fast diagnosis

Improving MTTR makes your service more reliable not by preventing failures but by limiting their impact when they occur.

The Anatomy of Resolution Time

Resolution time breaks down into phases. Each represents an opportunity to move faster—or a place where time silently drains away.

Investigation

After detection, responders must understand what's wrong:

  • Examining monitoring data and alerts
  • Reading logs and traces
  • Forming hypotheses about causes
  • Testing those hypotheses

Investigation time varies dramatically based on incident clarity. "Database replica 3 crashed" takes seconds to understand. "Users report things are slow sometimes" takes hours.

Coordination

Incidents often require multiple people. Coordination time includes:

  • Paging additional responders
  • Getting people into communication channels
  • Sharing context about what's known
  • Dividing work across team members

Good incident management processes minimize coordination overhead without sacrificing necessary collaboration. Bad processes turn incidents into meetings.

Implementation

Once you understand the problem and know the fix:

  • Deploying code changes
  • Updating configurations
  • Restarting services
  • Failing over to backup systems
  • Scaling infrastructure

Implementation speed depends on your deployment practices and infrastructure automation.

Verification

After implementing fixes, you must confirm service is actually restored:

  • Monitoring error rates returning to normal
  • Checking that user-facing functionality works
  • Verifying no new issues were introduced

Declaring resolution too early leads to recurring incidents. You're lying to yourself—and the timer keeps running whether you admit it or not.

What Determines MTTR

Many factors influence how quickly teams resolve incidents.

System Design

Systems designed with failure in mind resolve faster. Redundancy allows failing over to working components. Isolated failures prevent cascades. Graceful degradation maintains partial functionality.

Systems not designed for failure require more complex, time-consuming fixes—rebuilding components, recovering data, manually restoring service piece by piece.

Observability

Good observability dramatically reduces investigation time. When logs, metrics, and traces clearly show what's wrong, responders quickly identify issues.

Poor observability forces responders to guess, try random fixes, or add instrumentation during incidents to understand what's happening. You're debugging in the dark while users wait.

Runbooks and Documentation

Well-maintained runbooks provide step-by-step incident response procedures. When database performance degrades, responders follow the documented procedure rather than figuring it out from scratch.

Missing or outdated documentation means solving problems anew every time, even problems you've solved before.

Team Experience

Experienced teams resolve incidents faster. They've seen similar issues, know where to look, understand failure modes, and work effectively under pressure.

New team members contribute less effectively until they build this institutional knowledge. This isn't a criticism—it's why knowledge transfer matters.

On-Call Readiness

Responders who are prepared—with access to systems, familiarity with runbooks, understanding of escalation procedures—resolve incidents faster than those who must first figure out how to access logs.

Deployment Practices

Organizations with fast, safe deployment can quickly deploy fixes. Those with slow, risky deployments face longer MTTR because each deployment adds risk and delay.

Good deployment practices:

  • Automated deployment pipelines
  • Easy rollback mechanisms
  • Gradual rollout capabilities
  • Deployment frequency that keeps deployments routine

Incident Response Process

Clear processes reduce coordination overhead. Everyone knows their role, communication channels are established, escalation paths are clear.

Ad-hoc incident response wastes time figuring out who should be involved and how to coordinate.

Improving MTTR

Design for Failure

Build systems that fail gracefully and recover automatically:

  • Implement redundancy and automatic failover
  • Use circuit breakers to prevent cascade failures
  • Design for degraded operation rather than complete failure
  • Build idempotent operations that can be safely retried

These patterns often resolve incidents without human intervention. The best MTTR is zero—the incident that fixes itself before anyone notices.

Invest in Observability

Improve your ability to understand system state during incidents:

  • Add comprehensive logging
  • Implement distributed tracing
  • Create dashboards for common failure modes
  • Ensure logs include context needed for debugging

Better observability reduces investigation time, often the longest phase of incident response.

Create and Maintain Runbooks

Document procedures for common incidents:

  • Step-by-step resolution procedures
  • Where to find relevant logs and metrics
  • Who to escalate to for specialized knowledge
  • Known issues and their solutions

Keep runbooks updated as systems change. Outdated runbooks waste time and create confusion.

Improve Deployment Speed

Make deploying fixes fast and safe:

  • Automate deployment pipelines
  • Implement feature flags for instant rollback
  • Use canary or blue-green deployments
  • Practice deployments regularly so they're routine

Regular Training and Practice

Conduct incident response training:

  • Tabletop exercises walking through hypothetical incidents
  • Game day events simulating real failures
  • Postmortem reviews sharing learning across teams
  • Rotation of on-call responsibilities to spread knowledge

Practice builds the muscle memory that enables fast response.

Reduce Coordination Overhead

Streamline how teams work together:

  • Establish clear incident commander role
  • Use standard communication channels
  • Define escalation procedures
  • Create templates for incident declarations

Automate Common Fixes

Some incidents recur with the same resolution. Automate these:

  • Automatic restart of crashed services
  • Auto-scaling in response to load
  • Automatic failover to healthy instances
  • Self-healing infrastructure

Automation can reduce MTTR from minutes to seconds.

The Dark Side of MTTR

Aggressively optimizing for MTTR has potential downsides.

Hasty Fixes: Extreme pressure to resolve quickly leads to bandaid fixes that don't address root causes. The incident closes, but it will be back.

Incomplete Understanding: Rushing to resolution without understanding what happened makes postmortem analysis harder. You've stopped the bleeding but learned nothing.

Risk Taking: Desperation to resolve fast might lead to risky changes that make situations worse.

The goal is sustainably fast resolution, not resolution at any cost.

MTTR Targets

Appropriate targets depend on your service level requirements.

Critical Production Services: Major services might target 30 minutes or less for critical incidents.

Internal Services: Internal applications might target a few hours, balancing against other priorities.

Different Severity Levels: MTTR targets should vary by severity. Critical outages demand faster resolution than minor issues.

Whatever targets you set, track trends. Is MTTR improving or degrading? Changes signal changes in operational effectiveness.

Beyond the Mean

Examining MTTR distribution reveals insights the mean doesn't capture.

If most incidents resolve in 15 minutes but one took 6 hours, your MTTR is 45 minutes. The mean hides that you're actually quite good at resolution except for specific incident types.

Consider tracking:

  • Median time to resolution: Less affected by outliers
  • 95th percentile: Your worst-case resolution times
  • MTTR by severity: Different expectations for different severities
  • MTTR by incident type: Which incidents take longest

MTTR in Context

MTTR doesn't exist in isolation.

MTTD (Mean Time to Detect): How quickly you notice incidents. MTTD + MTTR = total incident duration from the user's perspective.

MTBF (Mean Time Between Failures): How often incidents occur. Rare incidents might matter more than fast resolution.

Impact Metrics: Short MTTR affecting millions of users has more impact than long MTTR affecting few.

The most reliable services combine all these—rare incidents, quickly detected, quickly resolved.

The Limits of MTTR

MTTR has limitations as a metric.

Gaming: Teams might rush to declare resolution before fully verifying fixes. MTTR improves on paper while reliability gets worse. A metric meant to improve reliability can be gamed in ways that make reliability worse—there's a dark irony there.

Impact Blindness: A 1-hour incident affecting 10 users has the same MTTR as a 1-hour incident affecting 10 million users.

Prevention Blindness: MTTR measures resolution speed but doesn't reflect efforts to prevent incidents in the first place.

Use MTTR alongside other metrics for a complete picture.

MTTR as a Learning Tool

Beyond tracking performance, MTTR reveals learning opportunities.

Incident Comparison: Why did similar incidents have vastly different resolution times? What went better or worse?

Team Effectiveness: How does MTTR vary across different on-call rotations? What can high-performing teams teach others?

System Complexity: Which services consistently have high MTTR? These might need architectural improvements.

Process Gaps: Do incidents with high MTTR reveal missing documentation, tooling, or skills?

Examining MTTR patterns reveals opportunities for systemic improvement. The number itself matters less than what it teaches you about your systems and your team.

Frequently Asked Questions About MTTR

Was this page helpful?

😔
🤨
😃