Updated 10 hours ago
The terms "monitoring" and "observability" get used interchangeably, but they represent fundamentally different assumptions about failure. Monitoring says "I know what can go wrong." Observability says "I don't know what can go wrong, but I'll be ready to understand it when it does."
The Core Distinction
Monitoring collects predetermined metrics to detect known failure modes. You decide in advance what to measure—CPU usage, request rate, error count—and set thresholds that trigger alerts. Monitoring works exceptionally well for problems you've encountered before.
Observability, borrowed from control theory, measures how well you can understand a system's internal state from its external outputs. A system is observable when you can ask arbitrary questions about its behavior without having predicted those questions in advance.
The distinction matters because modern distributed systems exhibit emergent behavior that's impossible to predict. Your system will fail in ways you've never seen, triggered by combinations of conditions you didn't anticipate when you designed your monitoring.
What Monitoring Sees
Traditional monitoring follows a straightforward pattern: identify potential failure modes, define metrics that indicate those failures, collect those metrics, alert when thresholds are crossed.
A web server might be monitored by tracking:
- Request rate (requests per second)
- Error rate (percentage returning errors)
- Response time (95th percentile latency)
- CPU and memory usage
- Disk space remaining
When any metric exceeds its threshold, you receive an alert. This works beautifully for known issues. If response times typically stay under 200 milliseconds, an alert at 500 milliseconds indicates a problem worth investigating.
But monitoring's strength is its limitation. You only see what you thought to measure. A new problem that doesn't trigger existing alerts goes unnoticed.
What Observability Reveals
Observability starts with comprehensive instrumentation capturing high-cardinality data—information with many unique values. Rather than pre-aggregating metrics, observability systems preserve details about individual events.
Consider a payment processing system. Monitoring might track:
- Total payments per minute
- Error rate across all payments
- Average processing time
Observability captures details about each payment:
- User ID
- Payment method
- Amount and currency
- Geographic region
- Processing time
- Every service the request touched
- All external API calls made
With this detailed data, you can ask questions that weren't anticipated: "Why are payments from users in Germany using Apple Pay failing more than other payment methods?"
This question wasn't built into your monitoring. Aggregate metrics would show "payments are mostly working." Meanwhile, a specific subset of users experiences silent failures. Observability data makes the question answerable.
Metrics, Logs, and Traces
Observability relies on three complementary data types, each answering different questions:
Metrics provide numerical measurements over time. They answer "How many?" and "How fast?" Their aggregated nature makes them storage-efficient and quick to query—excellent for dashboards and trends.
Logs capture discrete events with full context. Each entry represents something that happened at a specific moment with all available details. Logs answer "What happened?" with complete specificity.
Traces follow individual requests through distributed systems, connecting all operations triggered by a single user action. Traces answer "Where did time go?" and "What caused this request to fail?"
The power emerges from correlation. A spike in error metrics leads to relevant logs, which reveal a trace ID. Following that trace shows exactly which service call failed and why. Each data type alone provides limited insight; together they enable complete understanding.
The Cardinality Problem
Cardinality—the number of unique values in a dataset—represents the crucial technical distinction between monitoring and observability.
Low-cardinality data has few unique values. HTTP status codes have only a handful (200, 404, 500). Server hostnames might number in the hundreds. Low-cardinality data aggregates well and works beautifully for traditional monitoring.
High-cardinality data has many unique values. User IDs might number in millions. Request IDs are unique for every single request. High-cardinality data resists aggregation but enables detailed investigation.
Traditional monitoring systems struggle with high-cardinality data. Storing metrics for millions of unique user IDs becomes expensive and impractical. Observability systems embrace high-cardinality data, recognizing that filtering by any dimension is essential for investigating unexpected problems.
When Each Approach Wins
Monitoring excels at:
Known failure modes benefit from monitoring's efficiency. If disk space filling up causes problems, a simple metric and threshold provide immediate detection with minimal overhead.
Broad system health is quickly assessed through dashboards. A single view showing key metrics helps teams understand overall state at a glance.
Cost efficiency favors monitoring for long-term trends. Storing detailed observability data for years becomes prohibitively expensive; aggregated metrics remain affordable.
Automated response works well with monitoring. Metric crosses threshold, automated system scales infrastructure or restarts services. Observability data's complexity makes it less suitable for automation.
Observability excels at:
Novel problems require observability's flexibility. When something unexpected occurs, you need to slice data in ways you didn't anticipate. Pre-aggregated metrics can't answer questions you didn't know to ask.
User experience benefits from high-cardinality data. Individual user journeys reveal patterns that aggregate metrics obscure. Some users experience problems invisible in overall averages.
Distributed debugging demands traces. When a request traverses dozens of microservices, understanding which service introduced latency requires detailed traces connecting all operations.
Capacity planning improves with detail. Rather than guessing resource needs from aggregate trends, observability reveals exactly which users, features, or patterns drive consumption.
How They Work Together
Mature operations teams use both approaches where each provides the most value.
A typical workflow: monitoring detects an anomaly through threshold-based alerting. The alert includes links to observability tools pre-filtered to the relevant timeframe and service. Engineers use observability data to investigate, slicing high-cardinality data until they understand what happened. Once understood, the team adds new monitoring to catch this specific issue faster next time.
Monitoring provides always-on alerting for known issues. Observability enables investigation of the unknown. Neither replaces the other.
The Mindset Shift
Adopting observability requires cultural changes beyond technology. Monitoring encourages defining failure modes upfront. Observability encourages comprehensive instrumentation and exploration.
Teams accustomed to monitoring resist observability's upfront cost: "Why capture all this data if we don't know we'll need it?" The answer becomes clear during the first major incident resolved in minutes instead of hours because the necessary data already existed.
Observability also changes how teams think about health. Rather than claiming "the system is healthy" based on green dashboards, observability encourages humility: "All the things we thought to check look normal." Complex systems always have unknown unknowns. The question is whether you'll have the data to understand them when they emerge.
Frequently Asked Questions About Monitoring and Observability
Was this page helpful?