Alert Severity Levels

Updated 10 hours ago

At 3 AM, your phone buzzes. Is the entire platform down, or did a batch job run slightly slow? Without severity levels, you can't know without looking. And if you have to look at everything, you'll eventually stop looking at anything.

When everything screams, nobody can hear. Severity levels exist to protect human attention—the scarcest resource in incident response.

What Breaks Without Severity

Without classification, every alert competes equally for attention:

Prioritization becomes impossible. Engineers can't distinguish "the site is down" from "a background job is slow." Both arrive as urgent notifications demanding the same response.

Alert fatigue sets in. When every alert feels urgent, none feels urgent. Engineers become desensitized, and eventually that 3 AM buzz gets ignored—even when it shouldn't be.

Escalation misfires. Minor issues reach senior engineers or executives because there's no way to route based on actual impact.

Communication fails. Without severity assessment, it's unclear which issues warrant status page updates or customer communication. Everything feels like it might be a crisis.

The Four Levels

Most organizations converge on four severity levels. The names vary (P1-P4, Sev1-Sev4, Critical/High/Medium/Low), but the structure is consistent.

Critical

Complete outages or severe degradation affecting most users. Revenue loss, data loss, or security breaches.

The entire website is down. Payment processing is failing. There's an active security breach. Authentication doesn't work for anyone.

Response: Immediate. Phone calls to on-call engineers. Acknowledgment within 5 minutes, active response within 15. All hands until resolved. If not acknowledged in 5 minutes, escalate to backup. If not resolved in 30 minutes, escalate to management.

High

Significant issues affecting substantial users or critical features. Degraded performance or partial outages.

A key feature is broken—users can't add items to cart, or search returns nothing. Response times are 10x normal. One region is down in a multi-region setup. Disk is 95% full and climbing.

Response: Urgent but not immediate. Push notifications and Slack. Acknowledgment within 30 minutes, active response within an hour. No phone calls unless it escalates.

Medium

Issues requiring attention but not immediately. Limited user impact or internal system problems.

A non-critical feature is degraded. A batch job failed but can be rerun. Performance is slow in a low-traffic feature. Resource usage is elevated but not critical.

Response: Business hours. Email, Slack without mentions, ticket creation. No expectation of response outside working hours.

Low

Informational events or very minor issues. Worth knowing about, rarely requiring action.

Metrics crossed a threshold but self-corrected. A maintenance task completed. Performance is outside optimal but within acceptable ranges.

Response: None required. Dashboard indicators, log entries, weekly summaries. Reviewed in batches during planning meetings.

How to Assess Severity

Severity comes down to one question: How many users can't do what they came to do?

All users blocked → Critical
Most users affected or critical journeys broken → High
Specific features, segments, or regions affected → Medium
Minimal or no user impact → Low

Other factors matter—revenue loss, data integrity, security—but user impact is the primary signal. A security breach is Critical even with zero current user impact because the potential impact is catastrophic.

Workarounds change severity. If checkout is broken on web but works on mobile, that's High, not Critical. Users can still accomplish their goal.

Time changes severity. A B2B SaaS outage at 3 AM might be High; the same outage at 10 AM is Critical. But be careful—a security breach is Critical at any hour.

Duration changes severity. A performance degradation that starts as Medium might escalate to High after four hours of persistence.

Severity Gaming

Here's what actually happens in organizations:

Everything becomes Critical. Teams learn that Critical gets faster response, so they mark everything Critical. Soon Critical means nothing.

Political pressure corrupts classification. An executive's pet project has issues? Suddenly it's Critical regardless of actual user impact.

Teams compete for resources. Engineering bandwidth is finite. Inflating severity becomes a strategy to get attention.

Crying wolf trains people to ignore you. After the third false Critical from the same team, engineers start deprioritizing their alerts.

The only defense is culture: value accurate assessment over speed of response. Audit severity classifications regularly. Call out inflation when you see it. Make it clear that accurate severity helps everyone, including the teams that might be tempted to inflate.

Making It Work

Document with examples. Abstract definitions invite interpretation. Concrete examples from your actual system make classification obvious. "Critical: the checkout flow from the October 15th incident" is clearer than "Critical: complete outages affecting most users."

Automate the obvious. If error rate exceeds 50%, that's Critical. If disk is over 95%, that's Critical. Don't make humans decide things that have clear thresholds.

Allow human override. Automated assignment makes a good default, but context matters. The on-call engineer might know something the rules don't.

Match response to severity. Critical invokes incident command, assigns a commander, establishes communication channels. Medium creates a ticket for business hours. If you treat Medium like Critical, you'll burn out your team. If you treat Critical like Medium, you'll lose customers.

Severity levels are a promise: we will protect your attention, so that when we ask for it urgently, you know it matters. Break that promise too often, and the system collapses. Keep it, and your team can respond to real emergencies without drowning in noise.

Frequently Asked Questions About Alert Severity Levels

Was this page helpful?

😔

🤨

😃