Updated 10 hours ago
Every minute of MTTD is a minute your users know something you don't.
Mean Time to Detect measures the average duration from when an incident begins affecting your systems until your team becomes aware of it. It's the embarrassment gap—the window where problems compound while you remain oblivious.
The Calculation
Add up detection times for all incidents in a period. Divide by the number of incidents.
Three incidents in a month:
- Incident 1: Started at 14:00, detected at 14:03 (3 minutes)
- Incident 2: Started at 02:15, detected at 02:45 (30 minutes)
- Incident 3: Started at 11:20, detected at 11:21 (1 minute)
MTTD = (3 + 30 + 1) / 3 = 11.3 minutes
Simple math. The hard part is knowing when incidents actually started.
Why Detection Speed Matters
Fast detection doesn't fix anything. But nothing gets fixed until detection happens.
While you're unaware:
- Errors cascade into other systems
- Data gets corrupted
- Customers get frustrated
- Trust erodes
Customers notice when you're working on a problem versus when you're blindsided by it. Learning about outages from Twitter instead of your monitoring systems tells customers you don't know your own service.
The Silent Failure Problem
Some incidents are invisible to your dashboards while being very visible to your users.
A bug in your mobile app doesn't generate website errors. An issue affecting only European users disappears in metrics dominated by US traffic. A problem with one payment provider looks like normal conversion variance.
These silent failures can persist for days. Your dashboards show green. Your users experience red. Your monitoring lies to you by omission.
Good MTTD requires monitoring that catches edge cases, not just total service failures.
Measuring MTTD
Two questions must be answered precisely: when did the incident start, and when was it detected?
When Did It Start?
The incident starts when users are meaningfully affected—not when the first error occurs, and not when total failure happens.
A database slowly degrading over 20 minutes before failing completely: the incident started when response times reached levels that frustrated users, not at the first slow query and not at complete failure.
When Was It Detected?
Detection occurs when a human becomes aware that something needs response:
- Monitoring alert fires
- On-call engineer acknowledges
- Customer support ticket arrives
- Engineer notices something wrong
For consistency, most teams use "when monitoring alerts fired." This standardizes measurement and focuses improvement efforts on monitoring systems rather than human response time.
What Affects Detection Speed
Monitoring coverage: You detect only what you monitor. No database latency checks means no early warning on database problems. Every gap in monitoring is a place where incidents hide.
Alert quality: Monitoring that alerts on everything trains teams to ignore alerts. When your monitoring cries wolf constantly, real wolves get lost in noise. Good alerts balance catching real problems with not crying wolf.
Check frequency: A check running every 10 minutes can't detect incidents faster than 10 minutes. More frequent checks mean faster maximum detection time.
Synthetic vs. real user monitoring: Synthetic checks (automated tests) detect problems proactively. Real user monitoring detects problems reactively, after users encounter them. Synthetic catches issues faster; RUM catches issues synthetic tests miss. Use both.
Geographic distribution: Monitoring only from US locations means European-only incidents might run for hours undetected.
Off-hours coverage: Incidents at 3 AM persist until morning if you rely on human observation. Automated monitoring works 24/7/365.
Reducing MTTD
Find your blind spots: What critical paths aren't monitored? What external dependencies could fail silently? What user experiences happen without any checks?
Fix your noisy alerts: Which alerts fire constantly but don't indicate real problems? Remove them. Which incidents were detected by customers instead of monitoring? Those are missing alerts.
Increase check frequency: If you're checking every 5 minutes, try every minute. The cost increase is usually negligible; the detection improvement is not.
Add anomaly detection: Manual thresholds miss gradual degradation. Response time creeping from 100ms to 500ms over a week might never trigger a threshold alert. Anomaly detection catches the drift.
Ensure alerts reach people: If the on-call engineer's phone is dead, alerts should escalate. If the primary channel fails, backups should activate. Detection isn't complete until a human knows.
Detection Method Breakdown
Breaking down MTTD by how incidents were detected reveals where to focus:
Automated monitoring: Usually fastest—seconds to minutes with good configuration.
Customer reports: Slowest. By the time customers notice, complain, and their complaints reach engineers, significant damage is done.
Manual observation: Unpredictable. An engineer might catch something immediately, or it might persist until someone happens to look.
High rates of customer-reported incidents indicate monitoring gaps. Those gaps are where to invest.
The Costs of Faster Detection
Pushing toward zero MTTD has diminishing returns:
Monitoring costs: More coverage, higher frequency, multiple geographic locations all cost money.
Alert fatigue: Over-sensitive monitoring creates so many alerts that people tune them out, paradoxically increasing detection time.
System load: Very frequent health checks add load to production systems. At some point, the monitoring itself impacts performance.
The goal isn't zero MTTD at any cost. It's appropriately fast detection given your service's criticality.
Reasonable Targets
Critical services (payment processing, authentication, core product): Under 5 minutes. Major cloud providers aim for seconds.
Business services (internal tools, secondary features): 10-30 minutes.
Lower-criticality services: An hour or more might be acceptable.
Whatever your target, track the trend. Degrading MTTD means your monitoring isn't keeping pace with your system's complexity.
Beyond the Average
The mean hides important information.
If you detect most incidents in 2 minutes but one took 8 hours, your MTTD might be 30 minutes. That average doesn't show you're actually excellent at detection except for rare blind spots.
Track these alongside MTTD:
- Median time to detect: Less skewed by outliers
- 95th percentile: How long your worst cases take
- Detection method distribution: What percentage caught by monitoring versus customer reports
The distribution tells you whether you have a monitoring problem or just occasional blind spots.
MTTD in Context
MTTD is one piece of total incident impact. An incident with 1-minute MTTD but 4-hour resolution causes far more damage than 30-minute MTTD with 5-minute resolution.
But MTTD is often the easiest metric to improve. Better monitoring requires no architectural changes. Faster resolution often requires rebuilding systems.
Many teams focus on MTTD first because improvements are measurable, achievable, and lay groundwork for everything else.
Frequently Asked Questions About MTTD
Was this page helpful?