Choosing What to Monitor

Updated 10 hours ago

The question "What should we monitor?" seems simple. It's actually one of the hardest decisions in building reliable systems.

Monitor too little and critical issues go undetected. Monitor too much and important signals drown in noise. The answer requires understanding what actually matters—which is harder than it sounds.

Start with What Users Experience

Before instrumenting servers or databases, ask: "What does working look like from the user's perspective?"

For a web application, users care about three things:

Availability: Does it work at all? Can they reach your site, complete purchases, access their data?

Performance: How fast does it respond? Even a system that's technically "up" frustrates users when it's slow.

Correctness: Do operations produce accurate results? A payment system that charges wrong amounts is worse than one that's down.

These user-facing indicators should drive your initial monitoring decisions. Internal metrics about CPU usage or database connections matter primarily when they predict or explain user-facing problems.

The Four Golden Signals

Google's Site Reliability Engineering practices identified four metrics that apply to almost any user-facing system:

Latency: How long requests take. Track successful and failed requests separately—failures often fail fast and skew averages.

Traffic: Demand on your system. Requests per second, transactions per second, bytes per second. Traffic metrics distinguish capacity problems from other failures.

Errors: Failed requests, both explicit (HTTP 500s, exceptions) and implicit (success responses with wrong content, timeouts).

Saturation: How "full" your service is. If memory is your constraint, monitor memory. For I/O-bound services, track disk or network saturation. Saturation predicts problems before users feel them.

USE for Infrastructure

For servers, databases, and networks, Brendan Gregg's USE method provides systematic coverage:

Utilization: Percentage of time a resource is busy
Saturation: Work queued because the resource can't keep up
Errors: Failures at the resource level (disk read errors, dropped packets)

Apply this to every resource: CPU, memory, disk, network. The discipline prevents blind spots.

RED for Services

For microservices architectures, Tom Wilkie's RED method focuses on requests:

Rate: Requests per second
Errors: Failed requests as a percentage of total
Duration: How long requests take, using percentiles (p50, p95, p99) rather than averages

Percentiles matter because averages hide pain. If your p99 latency is 10 seconds, one in a hundred users waits 10 seconds—and that's a lot of frustrated people at scale.

Business Metrics Reveal What Technical Metrics Miss

Technical metrics measure system behavior. Business metrics measure whether the system achieves its purpose.

An e-commerce site might monitor orders per hour, revenue per minute, cart abandonment rate, conversion funnel progression. These often catch problems that technical monitoring misses entirely.

Consider: a bug prevents checkout for Safari users. Servers respond successfully. Databases perform normally. Technical dashboards stay green. But conversion rates drop, and business metrics catch it immediately.

Business metrics also help prioritize. When multiple systems fail simultaneously, business impact determines which to fix first.

What Not to Monitor

Choosing what to exclude matters as much as what to include.

Vanity metrics look impressive but don't drive decisions. "Total registered users" rarely helps with operational issues.

Duplicative metrics measure the same thing multiple ways. Five metrics all reflecting CPU usage is really one metric pretending to be five.

Non-actionable metrics track things you can't or won't change. Monitoring them wastes storage and attention.

Noise-prone metrics fluctuate so much they generate constant false alerts. Metrics that cry wolf train teams to ignore them—undermining their value even when they're right.

Balancing Coverage and Noise

The tension between comprehensive coverage and alert fatigue drives every monitoring decision.

Layer your monitoring. High-level dashboards show overall health at a glance. Detailed metrics exist for investigation but don't demand constant attention.

Alert only on actionable problems. If crossing a threshold doesn't require human intervention, it shouldn't generate an alert. Informational metrics belong on dashboards, not in pagers.

Use composite metrics. Rather than alerting on five metrics that indicate similar problems, create a health score that triggers when they collectively suggest issues.

Adjust based on experience. Metrics that never help diagnose problems can be removed. Thresholds that trigger too often need adjustment. Monitoring evolves as you learn what matters.

Building Monitoring Over Time

Most teams build monitoring incrementally:

Basic availability: Is the site up? Do APIs respond?
Performance: How fast? Where does time go?
Infrastructure: Server resources, database performance, cache hit rates
Business metrics: Connecting technical behavior to business outcomes
Predictive: Anticipating issues from trends rather than reacting to failures

Each phase builds on the last. Rushing to comprehensive monitoring before establishing basics creates confusion, not insight.

Frequently Asked Questions About Choosing What to Monitor

Was this page helpful?

😔

🤨

😃