1. Library
  2. Monitoring Concepts
  3. Incident Management

Updated 10 hours ago

When your phone buzzes at 2 AM with an alert, you're not thinking about taxonomy. You're asking one question: Is something broken that I promised wouldn't break?

That question is the heart of incident classification. Everything else is detail.

What Makes Something an Incident

An incident is an unplanned interruption to service or reduction in service quality. Two words matter here.

Unplanned. Scheduled maintenance isn't an incident. A database migration that goes according to plan isn't an incident. An incident is something that shouldn't be happening—reality diverging from expectations.

Reduction. An incident doesn't require complete failure. If your website normally loads in 2 seconds but suddenly takes 30, that's an incident. If 10% of users see errors while 90% work fine, that's an incident. The service still "works" but you're breaking a promise about quality.

The Forms Incidents Take

Outages are complete failures. Your website returns 500 errors. Your API is unreachable. Nobody can use your service. These are obvious.

Degradations are subtler. The database queries run slow, tripling page load times. The API responds but with painful delays. Users can accomplish tasks, but they're frustrated. Still an incident.

Partial failures affect specific features or users. Checkout works everywhere except Germany. Mobile crashes while web works fine. Premium features fail while basic functionality continues.

Security incidents involve unauthorized access or vulnerabilities being exploited. These demand immediate response even before they've impacted availability.

Data incidents involve loss, corruption, or integrity problems. Backups failing silently. Sync breaking. Users reporting missing or wrong information.

What Isn't an Incident

Known limitations working as designed. If your API rate-limits at 1000 requests per minute and someone hits that limit, the system is working correctly.

User errors. Forgotten passwords, invalid input—these are support issues, not operational incidents. Unless your password reset system is broken. Then it's an incident.

Feature requests. Users wanting new functionality isn't an operational incident, no matter how unhappy they are.

Planned maintenance. If you communicated the window and followed change management, it's expected downtime.

Third-Party Failures Are Your Incidents

This trips people up. If a service you depend on goes down and your users suffer, that's your incident. You didn't cause it. You can't fix it directly. But your users are impacted, so you own the response.

Your customers don't care whose fault it is. They care that the thing they're paying for doesn't work.

The Operating Principle

Here's the rule that matters: If you're not sure whether it's an incident, it's an incident.

This isn't about being paranoid. It's about asymmetric costs. Responding to a non-incident wastes some time. Ignoring a real incident prolongs user impact, risks cascade failures, and damages trust.

The cost of false positives is low. The cost of false negatives is high. Bias toward action.

Declaring an Incident

Many teams use formal declaration. Someone suspects a problem, they don't wait for certainty—they declare an incident and mobilize response.

"We're going to incident status" creates immediate clarity. Everyone knows this situation gets urgent attention. If it turns out to be minor, you stand down. Better than delayed response on something major because nobody was sure it qualified.

Why Classification Matters

Resource allocation. Clear definitions tell everyone when to drop everything versus when to file a ticket for Monday.

Metrics. "Mean time to resolution" is meaningless if people classify events differently. You need consistency to measure improvement.

Learning. If you only review outages and ignore degradations, you miss chances to learn from incidents that didn't cause complete failures.

Customer communication. Incident definitions determine when to update your status page, send notifications, or reach out proactively.

SLA accounting. Your commitments depend on incident definitions. Disagreements about classification can have contractual implications.

Thresholds That Trigger Incidents

Many organizations define explicit criteria:

  • Any customer-reported outage
  • Any alert persisting beyond a time threshold
  • Error rates above a defined percentage
  • Response times degraded beyond acceptable bounds
  • Security alerts from critical systems
  • Any data loss or corruption

These remove ambiguity. They also calibrate your monitoring—if thresholds trigger too often, adjust them.

Human Judgment

Automated systems detect anomalies and fire alerts. Humans evaluate context, assess impact, and decide response levels.

This judgment improves with experience. Teams learn which alerts represent real impact and which are noise. They develop intuition about which warning signs escalate and which resolve naturally.

Good monitoring supports this judgment—providing context, historical trends, correlation with other metrics. It helps humans decide better, not replace human decision-making.

Frequently Asked Questions About Incidents

Was this page helpful?

😔
🤨
😃
What Is an Incident? • Library • Connected