Updated 8 hours ago
At 2 AM, a pager wakes you with an alert about database connection failures. Half-asleep, heart racing, you need to diagnose and fix the problem quickly. This is not the moment for creative problem-solving. This is the moment for a runbook.
A runbook is a letter from your past self to your future self—written when you're calm, read when you're not. It documents known issues, troubleshooting steps, and solutions so you don't have to figure everything out from scratch while the clock ticks and customers can't complete purchases.
What Makes a Runbook Work
The real test of a runbook: could a panicked engineer with shaking hands, reading on a phone screen in the dark, follow this? Every word that fails that test is a word that might cost you minutes you don't have.
Actionable steps: "Check database connection pool status" not "investigate database." Tell engineers exactly what to do, in order, numbered.
Assumes stress: Written for engineers woken at night. Simple language. Clear structure. No assumptions about context the reader might be missing. Your 2 AM brain has maybe 60% of its daytime capacity—write for that brain.
Complete context: What does this alert mean? Why does it fire? What are users experiencing right now? How urgent is this? Don't make engineers guess at stakes.
Decision trees: Real problems branch. "If connection count is high, go to step 5. If connection count is normal but queries are slow, go to step 8." Guide diagnosis, don't assume a single path.
Copy-paste commands: Include exact commands with placeholders clearly marked. Engineers shouldn't construct commands from memory under pressure. One typo in a command at 2 AM can turn a 10-minute fix into an hour-long disaster.
Links to everything: Dashboards, logs, configuration files, related docs. Every reference is a URL. No hunting for information while the incident clock runs.
Escalation criteria: "If issue not resolved within 30 minutes, escalate to database team lead." Engineers need permission to ask for help. Without explicit criteria, they'll either escalate too early (wasting others' time) or too late (prolonging outages).
Anatomy of a Runbook
A consistent structure means engineers always know where to look:
Header
Alert name, severity, affected service, responsible team. Orientation in seconds.
Impact
What this means for users and business. "Payment processing fails. Customers cannot complete purchases. Revenue loss ~$5000/minute during peak hours." This isn't drama—it's calibration. The severity of your response should match the severity of the impact.
Immediate Actions
First steps before deep diagnosis. Check the dashboard. Verify user impact. If the situation is catastrophic (>50% failure rate), skip diagnosis and go straight to mitigation.
Diagnosis
Systematic troubleshooting with decision points. Check connection pool status. If exhausted, determine cause: connection leak? Traffic spike? Database degradation? Each branch leads to different resolution steps.
Mitigation
How to reduce impact while you're still figuring out root cause. Increase pool size temporarily. Scale horizontally. Buy yourself time. Mitigation isn't resolution—it's stopping the bleeding.
Resolution
Permanent fixes for each root cause identified in diagnosis. Restart services for leaks. Scale for traffic. Escalate for database issues beyond your control.
Escalation
When to escalate, who to contact, how to reach them. PagerDuty links. Slack channels. Phone numbers for true emergencies. Don't make engineers search for contact info during an incident.
Post-Resolution
Return temporary changes to normal. Document what you learned. File the post-incident review. Update the runbook itself if you discovered something new.
Where Runbooks Come From
The best runbooks are born from incidents.
During incidents: Document troubleshooting steps as you perform them. What commands did you run? What worked? What didn't? These notes are raw material.
After incidents: Transform notes into structured runbooks. The troubleshooting path that led to resolution becomes the runbook for next time.
From repeated alerts: Any alert firing more than a few times per year deserves a runbook. Don't make engineers solve the same problem from scratch repeatedly.
From senior engineers' heads: When experienced engineers explain how to handle situations, capture that knowledge. Tribal knowledge dies when people leave. Runbooks persist.
Keeping Runbooks Alive
A runbook that was accurate six months ago might be dangerously wrong today. Systems change. Commands change. Thresholds change.
Update after every use: Did the runbook work? What was confusing? What was missing? What was wrong? Every incident is a test of the runbook. Every post-incident is a chance to improve it.
Update when systems change: New architecture, different tools, updated procedures—all require runbook updates. The deploy that changes your database connection handling should include a runbook update.
Assign ownership: Each runbook has an owner responsible for keeping it current. They review it quarterly and after every related incident.
Track staleness: Mark last-update dates. Flag runbooks not updated in 6+ months for review. Stale runbooks are worse than no runbooks—they inspire false confidence.
Make feedback easy: Include "Report problem with this runbook" links. Engineers who hit issues during incidents usually don't have time to file detailed bug reports. Make it trivially easy.
Finding Runbooks When You Need Them
The best runbook in the world is useless if you can't find it at 2 AM.
Link from alerts: Every alert notification includes a direct link to its runbook. No searching. The alert tells you something is wrong; the link tells you what to do about it.
One repository: Everyone knows where runbooks live. Not scattered across wikis, repos, and shared drives. One source of truth.
Searchable: By service name, alert name, symptoms, error messages. However the engineer thinks about the problem, they can find the answer.
Mobile-friendly: Engineers might respond from phones initially. Runbooks must be readable on small screens.
No permissions required: Anyone who might respond needs instant access. Don't require VPN, special accounts, or approval to read runbooks. When the pager goes off, authentication friction is incident prolongation.
Runbooks and Automation
Runbooks and automation aren't opposites—they're stages.
Start with runbooks: Document the manual process first. Understand what you're doing and why before automating.
Automate the obvious: Steps performed identically every time are automation candidates. "Restart service X" is a script waiting to happen.
Keep judgment human: "Determine whether this traffic spike is legitimate customers or a bot attack" requires human analysis. Keep these steps in runbooks.
Document automation failures: When auto-healing fails, runbooks explain manual procedures. The automation is the first attempt; the runbook is the fallback.
Progressive automation: Over time, more runbook steps become automated. The runbook evolves from "do these ten things" to "automation does steps 1-7, you do steps 8-10 if needed."
The Runbook Mindset
Writing good runbooks requires a specific kind of empathy—empathy for your future self, or a teammate, in a state you're not currently in. Calm you is writing for panicked you. Alert you is writing for exhausted you.
Every runbook is an act of kindness across time. The hour you spend documenting today saves hours of suffering during future incidents. The clarity you provide now reduces the terror someone will feel at 2 AM.
And every time a runbook helps someone resolve an incident quickly, that's a customer who completed their purchase, a patient whose medical record loaded, a family whose video call connected. The abstractions matter because the humans behind them matter.
Frequently Asked Questions About Runbooks
Was this page helpful?