Error Budgets

Updated 8 hours ago

Reliability engineering has always been about preventing failure. Error budgets flip that entirely.

An error budget is your allowance for failure—the maximum unreliability your service can have while still meeting its Service Level Objective. If your SLO promises 99.9% availability, then 0.1% unavailability is acceptable. That 0.1% is your error budget. You can spend it on deployments, experiments, or incidents without breaking your promises.

The question isn't "how do we prevent all failure?" It's "are we spending our failure allowance wisely?"

The Math Is Simple

Take an SLO of 99.9% availability over 30 days.

30 days is 43,200 minutes. At 99.9% availability, you're allowed 0.1% downtime:

43,200 × 0.001 = 43.2 minutes

Those 43.2 minutes are your monthly error budget. Spend them on:

Incidents: A 10-minute outage costs 10 minutes of budget
Deployments: A release causing 2 minutes of elevated errors costs 2 minutes
Experiments: Chaos engineering or production tests that increase errors temporarily
Maintenance: Planned downtime (unless your SLO excludes it)

Stay within budget and you're meeting your SLO. Exhaust it and you've broken your promise.

Error Budgets Work for Any SLI

Availability is just one example:

Request success rate: If your SLO is 99.9% successful requests and you handle 1 million requests daily, you're allowed 1,000 failures per day.

Latency: If 99% of requests should complete under 200ms, then 1% can be slow. For 1 million requests, that's 10,000 that can exceed the threshold.

Some organizations track separate budgets for availability, latency, and error rate—each with its own threshold and policy.

The Real Power: Decisions Become Obvious

Error budgets turn reliability into an explicit trade-off:

Budget healthy (under 50% spent): Push features aggressively. Deploy frequently. Experiment in production. Accept higher-risk changes. You have room to move fast.

Budget low (75-90% spent): Slow down releases. Increase testing rigor. Focus on reliability improvements. Defer risky changes. Investigate what's eating your budget.

Budget exhausted (100%+ spent): Stop feature releases entirely. Focus only on reliability. Root cause recent incidents. Don't resume feature work until budget recovers.

This creates a self-regulating system. Teams move fast when reliability is good. They're forced to prioritize reliability when it degrades. No arguments needed—the numbers decide.

The Sweet Spot

If you never fail, you're being too conservative. You're leaving velocity on the table—you could be shipping faster, experimenting more, taking bigger bets.

If you constantly exhaust your error budget, you're moving too fast and damaging user trust.

The sweet spot is consistently using 60-80% of your budget. This means you're pushing hard enough to innovate while maintaining the reliability users expect.

Organizations too far below their budget should ask: "Are we being too cautious? Could we ship faster?"

Organizations constantly exhausting their budget should ask: "Do we need stricter testing? Better architecture? Or a less aggressive SLO?"

Error Budget Policies

A policy formalizes what happens at different budget levels:

Budget > 50%: Normal operations. Regular deployment cadence. Standard approvals.

Budget 25-50%: Increased scrutiny on changes. Enhanced testing. Senior engineer approval for risky changes.

Budget 10-25%: Deployment freeze except for critical fixes and reliability work. Mandatory incident retrospectives.

Budget < 10%: Full freeze. War room for reliability. Executive awareness. No feature work until recovery.

The specific thresholds vary, but the principle is consistent: budget consumption drives priorities.

Know What's Eating Your Budget

Tracking where budget goes reveals where to invest:

By cause: If deployments consistently consume 30% of budget, you need better deployment practices—more testing, gradual rollouts, better canaries.

By team: When multiple teams share a service, track which team's changes consume budget. This creates accountability without blame.

By component: Maybe your payment service is rock-solid (5% of budget) while search is problematic (40%). This guides where reliability investment will have the most impact.

Quantifying Risk

Error budgets help evaluate proposed changes:

Before a major deployment, ask: "If this goes badly, how much budget might it consume?"

If you have 20 minutes remaining and a risky deployment could cause a 30-minute outage, you're taking a budget-negative bet. Either reduce the risk or wait for budget to recover.

If you have abundant budget and a low-risk deployment, proceed even late in the release cycle.

The conversation shifts from "should we deploy?" to "given our current budget and this deployment's risk, what's the right call?"

Better Alerting

Error budget consumption contextualizes severity:

High urgency: "Budget will be exhausted in 4 hours at current consumption rate." This is more actionable than "error rate is 2%"—2% might be fine with abundant budget, or catastrophic when nearly exhausted.

Medium urgency: "Budget 75% consumed with 10 days left in measurement period." You're on track to exhaust it.

Low urgency: "Budget 50% consumed—review recent incidents." Triggers investigation without immediate action.

The Limitations

Error budgets aren't perfect:

Granularity: A single incident can exhaust a monthly budget, leaving the rest of the month with zero room. Rolling windows or weekly budgets can smooth this.

User impact distribution: 100 one-minute outages and one 100-minute outage consume identical budget but affect users very differently.

External dependencies: A cloud provider outage that exhausts your budget still triggers your policies, even though it wasn't your fault. Some organizations exclude partner outages, but this can mask architectural weaknesses.

Gaming: Teams might spend budget early to justify slower development later. Clear policies and leadership oversight prevent this.

The Cultural Shift

Error budgets change how organizations think:

From blame to learning: When an incident consumes budget, the question isn't "who caused this?" but "what do we learn?"

From arbitrary mandates to data: Instead of executives demanding deployment freezes or faster shipping, budget status drives decisions objectively.

From prevention at all costs to acceptable risk: Error budgets acknowledge that some failure is acceptable—even desirable—if it enables faster innovation.

This is the deeper insight: error budgets give you permission to fail. They transform reliability from a vague aspiration into a measurable resource, balancing user needs against the velocity that keeps your product alive.

Frequently Asked Questions About Error Budgets

Was this page helpful?

😔

🤨

😃