Updated 8 hours ago
Every service emits hundreds of metrics. CPU. Memory. Disk. Network. Threads. Queues. Connection pools. Request counts by endpoint, by status code, by client version. The data firehose is overwhelming.
Google's Site Reliability Engineering team asked a different question: What's the minimum set of metrics that actually tells you if users are happy?
The answer is four. Just four. They called them the Golden Signals.
The Four Signals
Latency: How Long Things Take
Latency measures the time from receiving a request to sending the complete response. This is what users feel. A slow service frustrates people even when it eventually succeeds.
But here's where most monitoring goes wrong: averages lie.
Consider a service where 95% of requests complete in 50ms, but 5% take 5 seconds. The average is around 300ms. That looks acceptable. But for 1 in 20 users, your service is painfully slow. The average hides their suffering.
This is why latency must be measured in percentiles:
- P50 (median): Half of requests are faster than this
- P95: 95% of requests are faster—this catches the slow tail
- P99: 99% of requests are faster—this catches the outliers
A healthy service might show P50=45ms, P95=120ms, P99=450ms. The gap between P50 and P99 reveals how consistent your service is.
One more subtlety: separate successful requests from errors. Errors often return fast (they fail quickly), which drags down latency numbers and hides the real user experience.
Traffic: How Much Demand Exists
Traffic measures the load on your system right now. Requests per second. Transactions per second. Active connections. Data transfer rates.
Traffic alone doesn't tell you if something's wrong. 5,000 requests per second might be normal Tuesday afternoon traffic or the beginning of a DDoS attack. The number only has meaning against a baseline.
What traffic reveals:
- Growth patterns: Is usage increasing over time?
- Cycles: Daily peaks, weekend drops, seasonal patterns
- Anomalies: Sudden spikes that need investigation
- Correlation: When traffic spikes, do other signals degrade?
Traffic is context. It's the denominator that makes other signals meaningful.
Errors: How Often Things Fail
Errors measure requests that don't succeed. This directly answers: can users accomplish what they're trying to do?
Errors come in flavors:
- Explicit failures: HTTP 5xx responses, exceptions, crashes
- Implicit failures: Requests that succeed technically but return wrong data
- Policy rejections: Rate limiting, authentication failures
A 0.1% error rate sounds tiny. But at 10,000 requests per second, that's 10 failures every second. 600 frustrated users per minute. 36,000 per hour.
Error rates need segmentation. A 1% overall error rate might mean 0.01% for most users and 50% for users in one region—a regional outage hidden by global averages.
Saturation: How Full Things Are
Saturation measures how close resources are to exhaustion. CPU utilization. Memory usage. Disk I/O. Connection pool depth. Queue lengths.
Here's what makes saturation different from the other three signals: saturation tells you about the future.
Latency tells you how fast things are now. Traffic tells you how busy things are now. Errors tell you what's failing now. But saturation tells you what's about to happen.
When CPU hits 85%, latency hasn't spiked yet—but it will. When the connection pool is 90% utilized, errors haven't increased yet—but they will. When the message queue is growing faster than it's draining, you're accumulating debt that will come due.
Saturation is your early warning system. Monitor it and you can scale before users notice. Ignore it and you're always reacting to fires instead of preventing them.
Healthy systems stay below 70-80% saturation during normal operation. The headroom isn't waste—it's capacity to absorb traffic spikes without degradation.
Why These Four Work Together
Each signal catches what the others miss.
High latency with normal saturation? The problem isn't resource exhaustion—look for a slow dependency or inefficient code path.
High errors with low traffic? A small number of users are hitting a broken code path. Check recent deployments.
High saturation with healthy latency? You're approaching the cliff but haven't fallen yet. Scale now or optimize fast.
Traffic spike with degrading everything else? Either legitimate viral growth or an attack. Either way, you need more capacity.
The signals form a diagnostic system. When something breaks, checking all four usually points you toward the cause.
Adapting the Signals
The original signals assume request-response services. For other systems, translate them:
Batch processing:
- Latency → Job completion time
- Traffic → Jobs processed per hour
- Errors → Failed job percentage
- Saturation → Queue depth, processing backlog
Data pipelines:
- Latency → Data freshness (time from event to availability)
- Traffic → Records processed per second
- Errors → Invalid records, processing failures
- Saturation → Consumer lag (how far behind real-time)
Storage systems:
- Latency → Read/write operation time
- Traffic → IOPS, throughput
- Errors → Failed operations, corruption
- Saturation → Disk space, IOPS capacity
The specific metrics change. The framework doesn't.
The Power of Focus
The Four Golden Signals aren't the only metrics that matter. But they're the metrics that matter first.
When all four are healthy, your service is almost certainly healthy. When any degrades, you have a real problem—not noise, not a false alarm, but something affecting users.
This focus is the framework's real value. Hundreds of metrics create noise. Four signals create clarity. You can fit them on a single dashboard, set meaningful alerts, and know at a glance whether to worry.
The question isn't what can you monitor. It's what should you watch. Start with the Golden Signals. Add more only when these four don't explain what you're seeing.
Frequently Asked Questions About the Four Golden Signals
Was this page helpful?