Setting Alert Thresholds

Updated 8 hours ago

Every threshold is a bet: what's worth waking someone up at 3 AM? Set it wrong in one direction and real problems slip past. Set it wrong in the other and you're crying wolf until nobody listens.

That's the actual problem with alert thresholds. Not the math. The math is easy. The hard part is that every threshold represents a guess about the future—which deviations matter, which don't, and how much human attention you're willing to spend finding out.

The Tradeoff You Can't Escape

Sensitive thresholds catch problems early, before users notice. They also fire on normal variance, training your team to dismiss alerts reflexively. Conservative thresholds only fire on obvious disasters, which means they fire after users are already complaining.

There's no correct answer. Only the right balance for your systems, your users' tolerance for degradation, and your team's capacity to respond.

Measure Before You Guess

Without baseline data, you're not setting thresholds—you're making wishes.

Collect metrics for weeks, preferably months, before setting thresholds. You need to understand:

What's normal at 2 PM on Tuesday versus 2 AM on Saturday
How much values fluctuate even when nothing's wrong
Whether metrics correlate with traffic (response times often do)
Seasonal patterns—retail looks different in December than July

Calculate not just averages but percentiles and standard deviations. The average hides everything interesting. If your average response time is 200ms but the 99th percentile is 3 seconds, you have unhappy users that the average doesn't show.

Why Percentiles Beat Averages

Averages lie. A service with 99% of requests at 100ms and 1% at 10 seconds has an average of 199ms—which sounds fine while 1% of your users are staring at loading spinners.

Use percentiles for thresholds:

95th percentile: 95% of requests are faster than this. Catches sustained degradation while tolerating occasional slow requests.
99th percentile: Stricter. Catches more subtle problems but more sensitive to outliers.

"Alert when 95th percentile response time exceeds 2 seconds for 5 consecutive minutes" is a useful threshold. "Alert when average response time exceeds 1 second" will either fire constantly or miss real problems.

Rates, Not Counts

100 errors sounds bad. But 100 errors out of a million requests? That's 0.01%—excellent. 100 errors out of 200 requests? That's 50%—catastrophic.

Always threshold on rates: errors per request, failures per attempt, timeouts as a percentage of calls. Absolute counts grow with traffic, which means success looks like failure.

The same logic applies to growth rates. Disk usage at 40% seems fine. Disk usage at 40% and growing 1% per day means you have three weeks to fix it.

Dynamic Thresholds

Static thresholds assume the world doesn't change. It does.

Response times are higher during peak hours. Traffic patterns differ on weekends. Black Friday isn't like any other day. A fixed threshold either fires constantly during peaks or misses problems during valleys.

Dynamic thresholds adjust:

Different values for business hours versus overnight
Different baselines for weekdays versus weekends
Seasonal adjustments for predictable traffic changes

More sophisticated systems use anomaly detection—machine learning that establishes baselines and alerts when values deviate significantly from what the model expected. These reduce false positives from predictable variance while catching genuinely unusual behavior.

Compound Conditions

Single-metric thresholds are often too simple.

High CPU usage might be normal and fine. High CPU usage combined with elevated response times indicates actual problems. Alert when CPU > 80% AND response time > 2 seconds, not on CPU alone.

Duration requirements filter noise. "CPU > 90% for 10 consecutive minutes" catches sustained problems while ignoring momentary spikes during garbage collection or batch jobs.

Rate of change sometimes matters more than absolute values. CPU jumping from 20% to 60% in five minutes might indicate a problem even though 60% CPU is normally fine.

Matching Thresholds to Metrics

Resource utilization (CPU, memory, disk): Percentage thresholds work, but remember that performance often degrades before hitting 100%. Memory paging thrashes performance long before memory exhausts. 80% warning, 90% critical is common, but measure where your systems actually degrade.

Response times: Percentile-based, with duration requirements. Alert on sustained degradation, not single slow requests.

Error rates: Usually low thresholds—even 1% is often unacceptable for user-facing services. But batch jobs might tolerate more. Context matters.

Availability: Modern expectations are high. 99.9% warning, 99% critical is typical. But measurement window matters enormously—99% over five minutes means 3 seconds of downtime, while 99% over a month means 7 hours.

Validate Before You Trust

Test thresholds before relying on them:

Synthetic failures: Intentionally degrade systems and verify alerts fire at appropriate points
Historical replay: Apply new thresholds to old data—would they have caught past incidents? How many false positives?
Shadow mode: Log what would have alerted without actually sending notifications, then review

Start conservative. It's easier to make thresholds more sensitive than to recover from alert fatigue.

Document Your Reasoning

Thresholds without context become mysterious incantations nobody dares change.

Record why each threshold exists: "95th percentile response time > 2s based on 6 months of data showing normal range 0.8-1.2s." Note the business impact: "Response time > 2s causes 15% increase in cart abandonment." Reference incidents: "Set after incident #47 where 1.8s response time spiked support tickets."

Assign owners. Someone needs to understand each threshold well enough to adjust it when systems change.

Common Mistakes

Round numbers without justification: 80%, 90%, 100% are nice numbers but mean nothing if they don't reflect actual system behavior or business impact.

Same thresholds everywhere: Different services have different characteristics. Search is slower than homepage. Batch jobs tolerate more errors than checkout.

Set and forget: Thresholds go stale. Traffic patterns change. Infrastructure improves. What was appropriate last year might be wrong now.

Alerting on both symptom and cause: If slow disk causes slow responses, alert on response time (what users experience), not disk performance (the underlying cause). Alerting on both creates noise.

Keep Tuning

Threshold setting isn't a one-time task. It's ongoing maintenance.

After incidents, ask: Did thresholds detect the problem at the right time? Should they be more sensitive? Were false positives obscuring the real issue?

Track false positive rates. If a threshold fires falsely more than 10-20% of the time, fix it. Every false positive trains your team to ignore alerts.

When systems change—new infrastructure, performance optimizations, traffic growth—revisit thresholds. The baseline moved.

Talk to the engineers who respond to alerts. If they consistently dismiss certain alerts as noise, the threshold is wrong.

Frequently Asked Questions About Alert Thresholds

Was this page helpful?

😔

🤨

😃