SLAs and SLOs

Updated 10 hours ago

Every service makes an implicit promise: this will work. SLAs and SLOs are how you make that promise explicit—and how you decide what happens when you break it.

The distinction matters more than it seems. Get it wrong, and you'll either over-promise (burning out your team chasing impossible targets) or under-deliver (losing customers who expected better).

SLOs: What "Good Enough" Actually Means

A Service Level Objective is your internal definition of acceptable service quality. It answers the question: at what point have we failed our users?

SLOs have three components:

The indicator (SLI): What you're measuring. Request latency, error rate, availability—the metric that represents user experience.

The target: The threshold you're aiming for. "99.9% of requests succeed" or "95th percentile latency under 200ms."

The window: The time period for measurement. Rolling 30 days, calendar month, quarterly.

A typical service might have:

99.9% of requests receive a successful response
95% of requests complete in under 100ms
99% of requests complete in under 200ms

Different operations deserve different SLOs. Search queries might tolerate 500ms. Payment processing might demand 99.99% success rates. The SLO should reflect what users actually experience and care about—not what's convenient to measure.

The Error Budget: Where SLOs Get Interesting

Here's the insight that transforms how teams think about reliability.

If your SLO is 99.9% availability over 30 days, you have 43 minutes of acceptable downtime. That's your error budget.

This budget can be spent on:

Deploying new features (which might cause brief issues)
Planned maintenance
Production experiments
Incidents that inevitably occur

Error budgets change the conversation. "Don't break things" becomes "here's how much breaking you can afford." Reliability stops being a moral imperative and becomes a resource to manage.

When the budget is healthy, ship faster. When it's depleted, slow down and fix things. The SLO becomes a self-regulating mechanism for balancing velocity against stability.

Setting SLOs That Matter

The best SLOs come from understanding users, not picking impressive-sounding numbers.

Start with actual experience: What level of performance do users find acceptable? When do they become frustrated? User research reveals these thresholds better than engineering intuition.

Know your baseline: If you're currently at 99.5% availability, jumping to 99.99% isn't realistic. Start with achievable improvements.

Leave room for reality: If you need flawless execution to hit your SLO, it's too aggressive. Systems fail. Changes cause issues. You need buffer.

A common mistake: setting SLOs at current performance. If you're at 99.9% availability, setting your SLO there leaves no room for incidents. Set it at 99.5%, and suddenly you have an error budget to work with.

Most services need 3-5 SLOs that capture different dimensions: availability, latency, error rate. More than that diffuses focus. Fewer might miss something users care about.

SLAs: Promises With Consequences

A Service Level Agreement is what happens when SLOs meet contracts. It's a formal commitment to customers with financial consequences for failure.

SLAs include:

Committed levels: What you promise, expressed as measurable targets.

Measurement methodology: Exactly how compliance is determined. What counts as downtime? How are outages detected?

Exclusions: What doesn't count—planned maintenance, customer-caused issues, force majeure.

Remedies: What happens when you fail. Usually service credits: percentage refunds of fees.

The critical difference from SLOs: SLAs are contractual. An SLO violation triggers internal concern. An SLA violation triggers refunds.

The Buffer Between SLOs and SLAs

Smart organizations set internal SLOs stricter than external SLAs.

You might:

Target 99.95% availability internally (SLO)
Promise 99.9% availability to customers (SLA)

That 0.05% buffer protects everyone. You treat SLO violations seriously, but they don't trigger financial penalties. You have room for incidents and maintenance without breaching contracts.

The buffer also accounts for measurement differences. Your internal monitoring might be more granular than SLA measurement methodology. You want confidence you'll meet the SLA even when internal metrics occasionally dip.

The Uncomfortable Truth About SLA Credits

Here's what vendors rarely emphasize: SLA credits almost never compensate for actual damages.

A 10% credit on a $100/month service doesn't cover the $100,000 in revenue you lost during an outage. Credits are typically capped at 50-100% of monthly fees. They're the sole remedy—you usually can't sue for additional damages.

You're not buying reliability insurance. You're buying the right to complain when things break.

This is why prevention matters more than credits. The SLA defines failure; it doesn't make failure acceptable.

Evaluating Vendor SLAs

When choosing providers, read the fine print:

Understand measurement: "99.9% uptime" means nothing without knowing how it's calculated. Monthly? Annual? What counts as "down"?

Check exclusions: If planned maintenance is excluded and they do weekly 2-hour windows, effective availability is much lower than advertised.

Consider cascading dependencies: Your SLA to customers depends on your vendors' SLAs to you. If you promise 99.95% but your cloud provider only commits to 99.9%, you have a math problem.

Evaluate what credits actually cover: Calculate whether the remedy meaningfully addresses your risk, or just provides symbolic acknowledgment of failure.

SLOs as Common Language

Beyond measurement, SLOs provide vocabulary for reliability conversations:

With product teams: Should we ship this feature or fix that bug? Check the error budget. If it's depleted, reliability work takes precedence.

With leadership: "Improve reliability" is vague. "Achieve 99.95% availability" is concrete. "We've consumed 80% of our error budget" communicates urgency.

With customers: Publishing SLOs (even non-contractual ones) sets clear expectations about service quality.

Well-defined SLOs transform reliability from a vague aspiration into something you can actually manage—a resource with a budget, a target with a number, a promise with a meaning.

Frequently Asked Questions About SLAs and SLOs

Was this page helpful?

😔

🤨

😃