Monitoring vs. Observability

Updated 10 hours ago

The terms "monitoring" and "observability" get used interchangeably, but they represent fundamentally different assumptions about failure. Monitoring says "I know what can go wrong." Observability says "I don't know what can go wrong, but I'll be ready to understand it when it does."

The Core Distinction

Monitoring collects predetermined metrics to detect known failure modes. You decide in advance what to measure—CPU usage, request rate, error count—and set thresholds that trigger alerts. Monitoring works exceptionally well for problems you've encountered before.

Observability, borrowed from control theory, measures how well you can understand a system's internal state from its external outputs. A system is observable when you can ask arbitrary questions about its behavior without having predicted those questions in advance.

The distinction matters because modern distributed systems exhibit emergent behavior that's impossible to predict. Your system will fail in ways you've never seen, triggered by combinations of conditions you didn't anticipate when you designed your monitoring.

What Monitoring Sees

Traditional monitoring follows a straightforward pattern: identify potential failure modes, define metrics that indicate those failures, collect those metrics, alert when thresholds are crossed.

A web server might be monitored by tracking:

Request rate (requests per second)
Error rate (percentage returning errors)
Response time (95th percentile latency)
CPU and memory usage
Disk space remaining

When any metric exceeds its threshold, you receive an alert. This works beautifully for known issues. If response times typically stay under 200 milliseconds, an alert at 500 milliseconds indicates a problem worth investigating.

But monitoring's strength is its limitation. You only see what you thought to measure. A new problem that doesn't trigger existing alerts goes unnoticed.

What Observability Reveals

Observability starts with comprehensive instrumentation capturing high-cardinality data—information with many unique values. Rather than pre-aggregating metrics, observability systems preserve details about individual events.

Consider a payment processing system. Monitoring might track:

Total payments per minute
Error rate across all payments
Average processing time

Observability captures details about each payment:

User ID
Payment method
Amount and currency
Geographic region
Processing time
Every service the request touched
All external API calls made

With this detailed data, you can ask questions that weren't anticipated: "Why are payments from users in Germany using Apple Pay failing more than other payment methods?"

This question wasn't built into your monitoring. Aggregate metrics would show "payments are mostly working." Meanwhile, a specific subset of users experiences silent failures. Observability data makes the question answerable.

Metrics, Logs, and Traces

Observability relies on three complementary data types, each answering different questions:

Metrics provide numerical measurements over time. They answer "How many?" and "How fast?" Their aggregated nature makes them storage-efficient and quick to query—excellent for dashboards and trends.

Logs capture discrete events with full context. Each entry represents something that happened at a specific moment with all available details. Logs answer "What happened?" with complete specificity.

Traces follow individual requests through distributed systems, connecting all operations triggered by a single user action. Traces answer "Where did time go?" and "What caused this request to fail?"

The power emerges from correlation. A spike in error metrics leads to relevant logs, which reveal a trace ID. Following that trace shows exactly which service call failed and why. Each data type alone provides limited insight; together they enable complete understanding.

The Cardinality Problem

Cardinality—the number of unique values in a dataset—represents the crucial technical distinction between monitoring and observability.

Low-cardinality data has few unique values. HTTP status codes have only a handful (200, 404, 500). Server hostnames might number in the hundreds. Low-cardinality data aggregates well and works beautifully for traditional monitoring.

High-cardinality data has many unique values. User IDs might number in millions. Request IDs are unique for every single request. High-cardinality data resists aggregation but enables detailed investigation.

Traditional monitoring systems struggle with high-cardinality data. Storing metrics for millions of unique user IDs becomes expensive and impractical. Observability systems embrace high-cardinality data, recognizing that filtering by any dimension is essential for investigating unexpected problems.

When Each Approach Wins

Monitoring excels at:

Known failure modes benefit from monitoring's efficiency. If disk space filling up causes problems, a simple metric and threshold provide immediate detection with minimal overhead.

Broad system health is quickly assessed through dashboards. A single view showing key metrics helps teams understand overall state at a glance.

Cost efficiency favors monitoring for long-term trends. Storing detailed observability data for years becomes prohibitively expensive; aggregated metrics remain affordable.

Automated response works well with monitoring. Metric crosses threshold, automated system scales infrastructure or restarts services. Observability data's complexity makes it less suitable for automation.

Observability excels at:

Novel problems require observability's flexibility. When something unexpected occurs, you need to slice data in ways you didn't anticipate. Pre-aggregated metrics can't answer questions you didn't know to ask.

User experience benefits from high-cardinality data. Individual user journeys reveal patterns that aggregate metrics obscure. Some users experience problems invisible in overall averages.

Distributed debugging demands traces. When a request traverses dozens of microservices, understanding which service introduced latency requires detailed traces connecting all operations.

Capacity planning improves with detail. Rather than guessing resource needs from aggregate trends, observability reveals exactly which users, features, or patterns drive consumption.

How They Work Together

Mature operations teams use both approaches where each provides the most value.

A typical workflow: monitoring detects an anomaly through threshold-based alerting. The alert includes links to observability tools pre-filtered to the relevant timeframe and service. Engineers use observability data to investigate, slicing high-cardinality data until they understand what happened. Once understood, the team adds new monitoring to catch this specific issue faster next time.

Monitoring provides always-on alerting for known issues. Observability enables investigation of the unknown. Neither replaces the other.

The Mindset Shift

Adopting observability requires cultural changes beyond technology. Monitoring encourages defining failure modes upfront. Observability encourages comprehensive instrumentation and exploration.

Teams accustomed to monitoring resist observability's upfront cost: "Why capture all this data if we don't know we'll need it?" The answer becomes clear during the first major incident resolved in minutes instead of hours because the necessary data already existed.

Observability also changes how teams think about health. Rather than claiming "the system is healthy" based on green dashboards, observability encourages humility: "All the things we thought to check look normal." Complex systems always have unknown unknowns. The question is whether you'll have the data to understand them when they emerge.

Frequently Asked Questions About Monitoring and Observability

Was this page helpful?

😔

🤨

😃