Mean Time to Detect (MTTD)

Updated 10 hours ago

Every minute of MTTD is a minute your users know something you don't.

Mean Time to Detect measures the average duration from when an incident begins affecting your systems until your team becomes aware of it. It's the embarrassment gap—the window where problems compound while you remain oblivious.

The Calculation

Add up detection times for all incidents in a period. Divide by the number of incidents.

Three incidents in a month:

Incident 1: Started at 14:00, detected at 14:03 (3 minutes)
Incident 2: Started at 02:15, detected at 02:45 (30 minutes)
Incident 3: Started at 11:20, detected at 11:21 (1 minute)

MTTD = (3 + 30 + 1) / 3 = 11.3 minutes

Simple math. The hard part is knowing when incidents actually started.

Why Detection Speed Matters

Fast detection doesn't fix anything. But nothing gets fixed until detection happens.

While you're unaware:

Errors cascade into other systems
Data gets corrupted
Customers get frustrated
Trust erodes

Customers notice when you're working on a problem versus when you're blindsided by it. Learning about outages from Twitter instead of your monitoring systems tells customers you don't know your own service.

The Silent Failure Problem

Some incidents are invisible to your dashboards while being very visible to your users.

A bug in your mobile app doesn't generate website errors. An issue affecting only European users disappears in metrics dominated by US traffic. A problem with one payment provider looks like normal conversion variance.

These silent failures can persist for days. Your dashboards show green. Your users experience red. Your monitoring lies to you by omission.

Good MTTD requires monitoring that catches edge cases, not just total service failures.

Measuring MTTD

Two questions must be answered precisely: when did the incident start, and when was it detected?

When Did It Start?

The incident starts when users are meaningfully affected—not when the first error occurs, and not when total failure happens.

A database slowly degrading over 20 minutes before failing completely: the incident started when response times reached levels that frustrated users, not at the first slow query and not at complete failure.

When Was It Detected?

Detection occurs when a human becomes aware that something needs response:

Monitoring alert fires
On-call engineer acknowledges
Customer support ticket arrives
Engineer notices something wrong

For consistency, most teams use "when monitoring alerts fired." This standardizes measurement and focuses improvement efforts on monitoring systems rather than human response time.

What Affects Detection Speed

Monitoring coverage: You detect only what you monitor. No database latency checks means no early warning on database problems. Every gap in monitoring is a place where incidents hide.

Alert quality: Monitoring that alerts on everything trains teams to ignore alerts. When your monitoring cries wolf constantly, real wolves get lost in noise. Good alerts balance catching real problems with not crying wolf.

Check frequency: A check running every 10 minutes can't detect incidents faster than 10 minutes. More frequent checks mean faster maximum detection time.

Synthetic vs. real user monitoring: Synthetic checks (automated tests) detect problems proactively. Real user monitoring detects problems reactively, after users encounter them. Synthetic catches issues faster; RUM catches issues synthetic tests miss. Use both.

Geographic distribution: Monitoring only from US locations means European-only incidents might run for hours undetected.

Off-hours coverage: Incidents at 3 AM persist until morning if you rely on human observation. Automated monitoring works 24/7/365.

Reducing MTTD

Find your blind spots: What critical paths aren't monitored? What external dependencies could fail silently? What user experiences happen without any checks?

Fix your noisy alerts: Which alerts fire constantly but don't indicate real problems? Remove them. Which incidents were detected by customers instead of monitoring? Those are missing alerts.

Increase check frequency: If you're checking every 5 minutes, try every minute. The cost increase is usually negligible; the detection improvement is not.

Add anomaly detection: Manual thresholds miss gradual degradation. Response time creeping from 100ms to 500ms over a week might never trigger a threshold alert. Anomaly detection catches the drift.

Ensure alerts reach people: If the on-call engineer's phone is dead, alerts should escalate. If the primary channel fails, backups should activate. Detection isn't complete until a human knows.

Detection Method Breakdown

Breaking down MTTD by how incidents were detected reveals where to focus:

Automated monitoring: Usually fastest—seconds to minutes with good configuration.

Customer reports: Slowest. By the time customers notice, complain, and their complaints reach engineers, significant damage is done.

Manual observation: Unpredictable. An engineer might catch something immediately, or it might persist until someone happens to look.

High rates of customer-reported incidents indicate monitoring gaps. Those gaps are where to invest.

The Costs of Faster Detection

Pushing toward zero MTTD has diminishing returns:

Monitoring costs: More coverage, higher frequency, multiple geographic locations all cost money.

Alert fatigue: Over-sensitive monitoring creates so many alerts that people tune them out, paradoxically increasing detection time.

System load: Very frequent health checks add load to production systems. At some point, the monitoring itself impacts performance.

The goal isn't zero MTTD at any cost. It's appropriately fast detection given your service's criticality.

Reasonable Targets

Critical services (payment processing, authentication, core product): Under 5 minutes. Major cloud providers aim for seconds.

Business services (internal tools, secondary features): 10-30 minutes.

Lower-criticality services: An hour or more might be acceptable.

Whatever your target, track the trend. Degrading MTTD means your monitoring isn't keeping pace with your system's complexity.

Beyond the Average

The mean hides important information.

If you detect most incidents in 2 minutes but one took 8 hours, your MTTD might be 30 minutes. That average doesn't show you're actually excellent at detection except for rare blind spots.

Track these alongside MTTD:

Median time to detect: Less skewed by outliers
95th percentile: How long your worst cases take
Detection method distribution: What percentage caught by monitoring versus customer reports

The distribution tells you whether you have a monitoring problem or just occasional blind spots.

MTTD in Context

MTTD is one piece of total incident impact. An incident with 1-minute MTTD but 4-hour resolution causes far more damage than 30-minute MTTD with 5-minute resolution.

But MTTD is often the easiest metric to improve. Better monitoring requires no architectural changes. Faster resolution often requires rebuilding systems.

Many teams focus on MTTD first because improvements are measurable, achievable, and lay groundwork for everything else.

Frequently Asked Questions About MTTD

Was this page helpful?

😔

🤨

😃