1. Library
  2. Performance and Reliability
  3. Reliability

Updated 10 hours ago

Your server's CPU utilization is at 20%. Memory is fine. No errors in the logs. Everything looks healthy.

Meanwhile, users are experiencing three-second page loads because a downstream service is throttling you. Your monitoring says green. Your users say broken.

This is why SLIs exist. Service Level Indicators measure what users actually experience, not what your infrastructure reports. They're the foundation beneath SLOs (your targets) and SLAs (your commitments)—the raw measurements that tell you whether reality matches your promises.

The User Experience Test

A good SLI passes a simple test: would a user care about this number?

Server CPU utilization fails this test. Users don't care about your CPU. They care whether their request succeeded and how long it took.

Request latency passes. Users absolutely care how long they wait.

This sounds obvious, but most monitoring systems are built around infrastructure metrics—CPU, memory, disk, network. These matter for capacity planning, but they're not SLIs. They don't tell you what users experience.

A server can be on fire and users won't care—as long as their requests succeed quickly. Conversely, a server humming along perfectly might be delivering terrible experiences. SLIs force you to measure reality, not comfort.

The Core SLIs

Most services need three to five SLIs. More than that diffuses focus. These three cover most situations:

Availability

What proportion of requests succeed?

Availability = Successful Requests / Total Requests

The trick is defining "successful." A 404 (page not found) is usually a success—the server worked correctly, the resource just doesn't exist. A 503 (service unavailable) is a failure. A timeout is a failure. A response with corrupted data is a failure, even if it returned 200 OK.

Be precise about what counts.

Latency

How long do requests take?

Here's where most teams go wrong: they measure average latency. Averages lie.

If 99% of your requests complete in 50ms and 1% take 10 seconds, your average is around 150ms. This number describes nobody's experience. Most users get 50ms. Some users get 10 seconds. The average of 150ms is a fiction.

Use percentiles instead:

  • P50 (median): Half of requests are faster, half slower
  • P95: 95% of requests are faster
  • P99: 99% of requests are faster

P95 and P99 reveal tail latency—the experience of your unluckiest users. These are the users who complain, who churn, who tweet about how slow your service is. The median user might be happy while your P99 users are suffering.

A good latency SLI: "P95 latency under 200ms, P99 under 500ms."

Error Rate

What proportion of requests fail?

Error Rate = Failed Requests / Total Requests

This overlaps with availability but isn't identical. Availability often means "the service responded." Error rate can include responses that technically succeeded but returned wrong data.

For services where correctness matters—financial calculations, health data, anything where wrong answers are worse than no answers—error rate should include correctness failures, not just HTTP 500s.

The Specification Problem

An SLI without a precise specification is useless. "We measure latency" means nothing until you answer:

Where do you measure? Latency at the server might be 50ms. Latency from the user's browser, including network time, might be 300ms. Both are valid measurements. They're measuring different things.

How do you measure? Server logs? Application instrumentation? Synthetic monitoring that pings every minute? Real user monitoring from actual browsers?

Over what window? Per minute? Per hour? Rolling 30-day average? Shorter windows catch brief outages. Longer windows smooth out noise.

What do you exclude? Requests from your own monitoring systems? Traffic during planned maintenance? DDoS attacks?

A real SLI specification:

Latency SLI: 95th percentile time from request received to response sent, measured at the application server, aggregated per minute, excluding requests to health check endpoints. Target: under 200ms.

Every word matters. Ambiguity in SLI definitions creates arguments later when you're trying to determine whether you met your targets.

Where Measurement Goes Wrong

Averaging percentiles is mathematically wrong. If Server A reports P99 of 100ms and Server B reports P99 of 200ms, the combined P99 is not 150ms. It's not the average of the individual P99s. Computing percentiles across distributed systems requires specialized approaches—streaming algorithms, centralized aggregation, or sampling with statistical correction.

This catches smart engineers who assume percentiles combine like averages. They don't.

Synthetic monitoring lies by omission. A health check that pings your homepage every minute will miss the 30-second outage that happened between checks. It also won't catch the specific API endpoint that's failing while everything else works.

Synthetic monitoring provides consistent baselines. Real user monitoring shows actual user experience. You need both.

Sampling introduces uncertainty. For high-traffic services, recording every request might be prohibitively expensive. Sampling 1% of requests works when you have millions of data points. It fails for low-volume critical operations where you might sample zero failures and conclude everything is fine.

Multi-Dimensional SLIs

Sometimes one number isn't enough.

Combined criteria: "99.9% of requests succeed AND complete under 200ms." A request that succeeds but takes 5 seconds fails this SLI. This is stricter than separate availability and latency SLIs because every request must pass both tests.

Segmented by importance: Premium customers might have a 99.95% availability target while free users have 99%. Payment transactions might require 99.99% success while read operations accept 99.9%. Different parts of your service have different stakes.

Segmented by region: P95 latency under 100ms in North America, under 200ms in Asia. Physics constrains what's achievable across oceans.

Choosing SLIs by Service Type

Different services need different SLIs:

Request-response services (APIs, web apps): Availability, latency, error rate.

Data pipelines: Throughput (records per hour), freshness (delay from data creation to processing), coverage (percentage of input successfully processed).

Storage systems: Availability, latency, durability (data not lost).

Batch jobs: Completion time (finished before deadline), success rate, freshness.

The common thread: measure what users of that service type actually care about.

SLIs Drive Everything Else

Once you have SLIs, they determine what you build:

Instrumentation must capture SLI-relevant data at the right granularity. If your SLI is P99 latency, you need timing data on every request, not just averages.

Storage must retain measurements long enough to check SLO compliance—typically 30 to 90 days of detailed data.

Dashboards should show current SLI performance and trends. If you can't see your SLIs at a glance, you won't manage to them.

Alerting should trigger when SLIs approach SLO thresholds, not just when they violate them. Alert on trajectory, not just position.

The observability system exists to serve SLI measurement. Design it that way.

SLIs Evolve

Your first SLIs will be wrong. That's fine.

You'll discover metrics that matter but aren't measured. Add them.

You'll find SLIs that don't drive decisions. Remove them.

You'll hit edge cases that reveal ambiguities in your definitions. Refine them.

SLIs aren't set once and forgotten. They're refined as you understand your service and your users better.

The goal isn't perfect SLIs from day one. The goal is SLIs that tell you the truth about what your users experience—and that truth becomes clearer over time.

Frequently Asked Questions About SLIs

Was this page helpful?

😔
🤨
😃