On-Call Rotations

Updated 10 hours ago

Someone needs to respond when systems fail at 3 AM. On-call rotations distribute this responsibility across team members—but they're more than schedules. They're a social contract. Engineers agree to have their sleep interrupted, their dinners paused, their weekend plans held hostage by a phone that might ring. In exchange, the organization implicitly promises not to abuse that agreement.

Well-designed rotations honor this contract. Poorly designed ones break it, burning out engineers who eventually leave or stop caring.

Why On-Call Exists

Modern systems promise 24/7 availability. Humans don't work 24/7. This gap creates on-call.

Continuous coverage: Users expect services to work at midnight on Sunday. When they don't, someone must respond—regardless of what time the clock shows.

Distributed responsibility: One person handling all alerts burns out within months. Spreading the burden keeps any single engineer from drowning.

Knowledge distribution: Rotating on-call exposes different engineers to operational reality. The person who built the service learns how it actually behaves at scale. The person who never touches it learns why certain design decisions matter.

Skill development: Nothing teaches troubleshooting like a production incident at 2 AM with customers waiting. On-call accelerates system understanding in ways no documentation can.

Shared burden: Everyone who benefits from running production systems should share the cost of maintaining them. On-call makes that cost visible and distributable.

Rotation Schedules

Different durations have different tradeoffs.

Weekly Rotations

One engineer covers one week. Predictable, long enough to develop rhythm, short enough that bad weeks eventually end. The downside: a week of frequent pages means a week of disrupted sleep. Best for teams of 4-8 engineers.

Two-Week Rotations

Longer periods mean more time between rotations and deeper context on ongoing issues. But two weeks of nighttime pages is exhausting, and the unfairness feels sharper when your two weeks have a major incident while someone else's are quiet. Best for teams of 6-12.

Follow-the-Sun

Different regions cover different hours. European team handles European business hours, American team handles American hours. No nighttime pages—on-call aligns with work hours. Requires global distribution and excellent handoff documentation, but eliminates the sleep disruption that makes traditional on-call so costly.

Primary Plus Backup

Primary on-call takes alerts first. Backup responds if primary doesn't acknowledge or needs help. Provides redundancy and training opportunities, but means two people are partially disrupted instead of one fully on-call.

The Fairness Problem

Rotations feel unfair faster than almost anything else in engineering.

Equal frequency: Everyone rotates at the same rate. No one permanently avoids on-call while others carry the burden.

Weekend distribution: Rotations that consistently give some engineers every weekend while others never have weekend duty breed resentment quickly.

Holiday equity: Rotate through major holidays. No single person should always be on-call for Christmas.

Page volume tracking: If someone consistently gets more pages due to when they're scheduled (maybe deployments happen during their shift), adjust rotations or address the root cause.

Shift swapping: Let engineers trade shifts. Personal lives don't pause for rotation schedules.

Handoffs

The moment responsibility transfers is when context dies. Good handoffs prevent this.

Scheduled time: Clear handoff time (Monday 9 AM) when responsibility transfers. No ambiguity about who owns what.

Written summary: Outgoing on-call documents active incidents, recent alerts that might recur, systems in degraded state, planned maintenance, known issues and workarounds.

Synchronous call: Brief conversation where outgoing and incoming discuss current state. Written documentation alone misses nuance and the chance to ask "wait, what do you mean by that?"

Overlap period: 30-60 minutes where both engineers are available. Incoming can ask questions while outgoing still has fresh context.

Compensation

How organizations compensate on-call reveals what they actually believe about the burden.

On-call pay: Flat payment for being on-call regardless of pages. Acknowledges that availability itself has cost—you can't drink, you can't go to movies, you can't be far from laptop and phone.

Incident pay: Additional compensation for actually responding. Per-incident or hourly rate.

Comp time: Time off after on-call periods, especially after rough ones with many incidents.

No additional pay: Some organizations expect on-call as part of standard senior engineer responsibilities. This works only when on-call burden is genuinely light and evenly distributed.

The phone that might ring at 3 AM changes how you sleep, how you drink, whether you leave the house. That's not a scheduling problem—it's a lifestyle tax. Compensation should reflect this.

Burnout Prevention

On-call burnout is predictable and preventable.

Page volume limits: If a single engineer gets paged more than a threshold (perhaps 5 times in a shift), automatically escalate to backup and investigate why alerts are so frequent. That's not on-call—that's a system on fire.

Post-incident recovery: After major incidents involving hours of response, provide compensatory time off. Adrenaline debt is real.

Rotation caps: No one should be on-call more than 20-25% of the time. A team of four with weekly rotations is already at the limit.

No permanent backup: Avoid having senior engineers "always on-call" as unofficial backup. Everyone needs complete breaks where the phone genuinely cannot ring for them.

Alert quality investment: The single biggest burnout factor is frequent pages, especially false alarms. Every false positive trains engineers to ignore alerts. Invest in reducing noise.

Respect boundaries: Genuine emergencies warrant pages. Non-emergencies should wait. The question "is this worth waking someone up for?" should be asked seriously.

Escalation

Clear escalation ensures coverage when primary on-call is unavailable or overwhelmed.

Tiered escalation: Primary on-call first, backup if primary doesn't acknowledge in 5 minutes, team lead if backup doesn't respond in 10, engineering director for major incidents or no response.

Automatic escalation: Systems should escalate automatically based on acknowledgment. Humans shouldn't need to manually escalate at 3 AM.

Cross-team escalation: Some incidents require other teams. Escalation policies should include relevant teams for different incident types.

Junior Engineers

On-call is valuable learning but shouldn't throw inexperienced engineers into situations beyond their capabilities.

Shadow first: Junior engineers observe senior on-call, asking questions without primary responsibility.

Paired on-call: Junior handles initial triage with senior as backup and mentor.

Graduated severity: Start with low-severity alerts, increase responsibility as skills develop.

Low escalation threshold: Junior engineers should escalate early and often. Escalating unnecessarily is always better than struggling alone while an incident grows.

Documentation

Quality documentation makes on-call sustainable.

Runbooks: Step-by-step guides for common issues. On-call engineers shouldn't need to figure everything out from scratch at 2 AM.

Architecture diagrams: Visual representations of components and dependencies help troubleshoot unfamiliar issues.

Recent incident summaries: If a similar issue occurs, on-call knows how it was previously resolved.

Access instructions: How to access logs, dashboards, production systems—documented so midnight troubleshooting doesn't start with "how do I even get in here?"

Metrics

Track data to improve on-call experience over time.

Page frequency: Average pages per shift. Increasing trends indicate declining reliability or alert quality.

Time to acknowledge: Increasing times might indicate alert fatigue—engineers are slower to respond when they've been trained that most pages don't matter.

False positive rate: Percentage of pages that don't represent real problems. High rates cause burnout and erode trust in alerting.

Time distribution: When do pages occur? Concentration at specific times suggests addressable root causes.

Use this data to improve systems, refine alerts, and distribute burden more fairly.

Key Takeaways

On-call is a social contract, not just a schedule. Engineers agree to availability; organizations promise not to abuse it.

Fair rotations distribute weekends, holidays, and page volume equitably. Perceived unfairness destroys morale faster than heavy load.

Handoffs with written documentation and synchronous calls prevent context loss between rotations.

Burnout prevention requires page volume limits, compensatory time, and relentless investment in alert quality.

Compensation should reflect the reality that on-call changes how people live, not just how they work.

Frequently Asked Questions About On-Call Rotations

Was this page helpful?

😔

🤨

😃