Alert Routing

Updated 8 hours ago

When an alert goes to everyone, it goes to no one.

That's the core problem routing solves. A payment system failure needs the payments team—not the frontend team, not the data science team, not the entire engineering department. Database issues need the database team. Weekend issues might route differently than weekday issues. Without routing, you get one of two failure modes: either alerts go to everyone (creating noise so dense that people tune out) or alerts go to no one specific (and everyone assumes someone else will handle it).

Routing creates accountability. It answers the question: whose problem is this?

Service Ownership: The Foundation

The most fundamental routing principle is simple: every service has an owner, and alerts for that service go to that owner.

This requires a service catalog—a mapping from services to teams. When an alert fires for "Payment API," the routing system knows to notify the Payments team. When the authentication service fails, the Auth team gets paged. No ambiguity, no bouncing between teams trying to find the right owner.

Some services complicate this. Databases support multiple teams—the database team owns the infrastructure, but application teams own their queries and schemas. You need primary and secondary ownership. Third-party services might route to vendor management or account managers depending on the issue. Services without clear owners route to a default team (often platform or DevOps) who can triage and reassign.

But the principle holds: clear ownership eliminates the "not my problem" phenomenon.

Severity Determines the Channel

Not all alerts deserve the same interruption:

Critical: Phone call to on-call. Wake them up. Send to primary and backup simultaneously. Create a high-priority incident ticket. Post to emergency Slack with @here. This is the fire alarm.

High: Push notification and Slack message. Create an incident ticket. Don't call unless it escalates. This needs attention soon, but not "interrupt dinner" soon.

Medium: Email to team alias. Post to team Slack without @mentions. Create a standard ticket. This can wait until someone checks their queue.

Low: Dashboard indicators and log entries. Weekly summary emails. No immediate notification. This is informational.

Different teams might have different preferences for the same severity. The database team might want SMS for high-severity issues while the frontend team prefers push notifications. Let teams configure their channels.

Time-Aware Routing

A non-critical alert at 3 AM shouldn't page anyone. Route it to the morning queue.

Time-based routing accounts for reality:

Business hours vs. after hours: Some alerts warrant immediate response during the day but can wait until morning if they occur overnight.

Weekends: Different on-call rotations. Route to weekend coverage instead of weekday rotation.

Timezones: For distributed teams, route to the region where it's currently business hours. 3 PM Pacific pages US on-call. 3 AM Pacific pages Asia-Pacific on-call if they're covering. This is follow-the-sun—passing the pager westward as the earth turns, so someone is always awake to answer.

Maintenance windows: During planned maintenance, suppress expected alerts or route them to a maintenance channel instead of on-call.

Escalation: When No One Answers

Routing doesn't end at the first notification. When initial routing doesn't get response, escalate:

If primary on-call doesn't acknowledge within 5 minutes, route to backup
If backup doesn't respond within 10 minutes, route to team lead
If an issue persists beyond time thresholds (high severity unresolved after 1 hour), escalate to higher severity routing
For major incidents lasting beyond thresholds, automatically notify engineering management
After exhausting individual escalation tiers, broadcast to entire team as last resort

Escalation is the safety net. It ensures that even if the primary responder is unavailable, unreachable, or overwhelmed, someone will eventually see the alert.

Context Informs Routing

Alert content can determine destination:

Authentication errors route to auth team
Payment errors route to payments team
Database errors route to database team
Alerts affecting enterprise customers might escalate to include customer success
Infrastructure issues route to teams responsible for specific regions
Third-party service failures might route to vendor management

The routing system reads the alert and makes intelligent decisions. Machine learning can even learn patterns from historical incident responses—observing which teams actually resolve which types of issues.

Self-Service Configuration

Teams should control their own routing:

On-call rotations: Teams manage their own schedules, swaps, and coverage.

Service ownership: Teams claim ownership of services, automatically routing related alerts to them.

Notification preferences: Teams customize which alerts go to on-call vs. team channel vs. individual engineers.

Temporary overrides: Easy creation of routing changes for planned maintenance or special events.

Centralized routing rules should live in version control, be testable before deployment, and be documented. But the day-to-day management of who gets paged for what should be in the hands of the teams themselves.

What Goes Wrong

Common routing failures:

Broadcast everything: Sending all alerts to entire engineering. This creates massive alert fatigue and diffuses responsibility. When everyone is responsible, no one is.

Single point routing: Everything goes to one person or team. Overloads them and creates a single point of failure.

Static configuration: Never updating routing as team structures change. Alerts continue going to people who left months ago.

Unclear ownership: Services without clear owners lead to alerts bouncing between teams or going unaddressed.

The fix for all of these is the same: clear ownership, regular review, and feedback loops. When engineers report receiving irrelevant alerts, routing should adjust.

Key Takeaways

Every service needs an owner; alerts for that service go to that owner
Severity determines the channel—phone calls for critical, email for low
Time-aware routing respects sleep and routes to whoever is awake
Escalation ensures someone always responds, even if primary is unavailable
Teams should control their own rotations and preferences
Broadcast is the absence of routing—when alerts go to everyone, they go to no one

Frequently Asked Questions About Alert Routing

Was this page helpful?

😔

🤨

😃