What Is Observability?

Updated 8 hours ago

Your service is slow. You don't know why.

The dashboards show everything is green. CPU is fine. Memory is fine. No errors in the logs. But users are complaining, and you have no idea where to even start looking.

This is the moment that reveals whether your system is observable.

The Difference That Matters

Monitoring watches for known problems. You define what to track—CPU usage, error rates, latency—and alert when thresholds are crossed. Monitoring answers the questions you thought to ask: Is the service up? Is latency acceptable? Is the error rate elevated?

Monitoring assumes you know what might fail.

Observability assumes you don't.

An observable system provides enough information that you can answer questions you didn't anticipate. When requests are slow but only for users in Europe, or only on Tuesdays, or only for requests with a specific combination of feature flags—observability gives you the tools to explore and understand what's happening without deploying new code or instrumentation.

Monitoring tells you when something is wrong. Observability helps you understand why.

Why This Matters Now

Simple systems don't need observability. If your application is a monolith talking to one database, you can enumerate the failure modes and monitor each one.

Modern systems aren't simple. Microservices, distributed databases, queues, caches, third-party APIs—they interact in ways that create emergent behaviors impossible to predict. The specific combination of circumstances causing today's outage has likely never occurred before and will never occur again in exactly the same way.

You can't define alerts for problems you haven't imagined.

What Observable Systems Emit

Observability data traditionally comes in three forms:

Metrics: Numerical measurements over time. Request counts, latencies, error rates, resource usage. Cheap to collect and store, excellent for dashboards and trends. But metrics are aggregates—you know average latency increased, not which specific requests were slow.

Logs: Discrete events describing what happened. "User 123 logged in." "Database query took 2.5 seconds." "Payment processed." Logs provide detail that metrics lack, but they're expensive to store at scale and hard to query across billions of events.

Traces: Records of requests flowing through distributed systems, showing the path and timing across services. Traces reveal where time is spent and which service is the bottleneck. Essential for distributed systems, but they generate substantial data volume.

These three provide complementary views. Metrics tell you something is wrong. Logs tell you what happened. Traces show you where.

The Real Power: High-Cardinality Data

Traditional metrics aggregate away the details you need. You know average latency for all requests, but not latency for requests from premium users in Europe using version 2.3 of your API with a specific feature flag enabled.

High-cardinality observability preserves these dimensions. Every event carries rich context: user ID, geographic region, customer plan, feature flags, code version, infrastructure details. You can filter and group by any combination without predicting in advance which combinations you'll need.

This is what transforms observability from "slightly better monitoring" into something fundamentally different. You can ask questions you didn't know to ask.

Investigation in Practice

Something seems wrong. Here's how observability actually works:

Metrics show an anomaly—latency spiked at 2:47 PM
You filter by dimension—it's only affecting requests to the /checkout endpoint
You filter further—it's only users with items in their cart over $500
You pull traces for affected requests—they all show slow calls to the fraud detection service
You check the fraud service—it's calling an external API that started timing out
You find the root cause in 15 minutes instead of 3 hours

Without observability, you'd be guessing. With it, you're following a trail.

What Makes Systems Observable

Structured output: Logs as unstructured text are nearly impossible to query at scale. Observable systems emit structured data—JSON or key-value pairs—that can be searched and analyzed efficiently.

Rich context on every event: User IDs, request IDs, feature flags, versions, regions. You don't know what you'll need to debug, so include everything reasonable.

Correlation IDs: Requests that span multiple services carry an ID that links events together. This lets you trace a single user action across your entire system.

Minimal performance overhead: Instrumentation must be cheap enough to run in production at full granularity. You can't observe what you can't afford to emit.

The Trade-offs Are Real

Cost: Comprehensive observability generates enormous data volumes. Cloud observability bills can become staggering.

Complexity: Multiple data types, instrumentation libraries, collection pipelines, storage systems, query interfaces—there's a lot to manage.

Privacy: Observability data often contains sensitive information. User IDs, IP addresses, request payloads. This data must be protected and potentially anonymized.

The appropriate level of observability depends on system criticality, team size, and budget. Not every system needs to answer arbitrary questions about its behavior.

The Mindset Shift

Observability isn't just tooling. It's thinking differently about production systems.

Instrument before problems occur. You can't retroactively add observability to events that already happened.

Design for investigation. When building features, think about how you'll debug them when they misbehave.

Form hypotheses and test them. Don't randomly try things. Use the data to narrow down possibilities systematically.

The goal is transforming your production systems from mysterious black boxes into transparent systems where problems can be understood and resolved—even problems you never predicted.

That's what observability actually is. Not a checklist of tools, but the ability to answer the questions you haven't thought to ask yet.

Frequently Asked Questions About Observability

Was this page helpful?

😔

🤨

😃