1. Library
  2. Computer Networks
  3. Performance and Reliability
  4. Observability

Updated 8 hours ago

When something goes wrong in a distributed system, you need to see what happened. But "seeing" isn't one thing—it's three different resolutions, each revealing what the others can't.

Metrics show you the satellite view: aggregate patterns, trends, the shape of normal and abnormal. Logs give you the forensic close-up: this specific error, this user, this stack trace. Traces draw the journey map: how a single request wandered through your services, where it got stuck, what it touched along the way.

These aren't three competing tools. They're three lenses. The question is always: which resolution do I need right now?

Metrics: The Satellite View

Metrics are numbers aggregated over time. They answer "how many?" and "how fast?" for the system as a whole.

Request rate: 1,200 requests per second. Error rate: 0.3%. P99 latency: 180ms. CPU utilization: 65%. These numbers paint a picture of system health at a glance.

Metrics excel at three things:

Dashboards: Real-time visualization of system state. You see the spike, the trend, the pattern—instantly.

Alerting: "Error rate above 1%" or "latency P99 above 500ms" are natural metric conditions. They tell you when to pay attention.

Capacity planning: Long-term trends in resource usage, traffic growth, and performance guide infrastructure decisions months in advance.

Metrics are also cheap. You can store years of metrics data at reasonable cost because you're storing aggregates, not individual events. A million requests become a single number: "1,000,000 requests."

But that aggregation is also the limitation. Metrics tell you average latency doubled, but not which requests were slow. They tell you errors spiked, but not which user was affected or why. For that, you need to zoom in.

Logs: The Forensic Close-Up

Logs are discrete events. Each entry describes one thing that happened at one moment: a user logged in, a query executed, an error occurred.

{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "ERROR",
  "service": "payment-api",
  "user_id": "user-123",
  "request_id": "req-789",
  "error": "Payment gateway timeout",
  "duration_ms": 5000
}

This single log entry tells a story: at this exact moment, this user's payment failed because the gateway timed out after 5 seconds. Metrics could never give you this level of detail.

Logs shine in:

Debugging specific issues: When you know approximately when something went wrong, logs show you exactly what happened—the stack trace, the parameters, the context.

Audit trails: Who did what and when? Logs provide the detailed record that compliance requires.

Error analysis: Understanding failure modes requires the actual error messages, not just error counts.

The cost is volume. A busy service generates millions of log entries per day. Storing, indexing, and querying that data is expensive. And even with good logs, distributed systems present a puzzle: when one request touches five services, you have five separate log entries to correlate.

For understanding how requests flow through your system, you need a different view entirely.

Traces: The Journey Map

Traces follow individual requests as they travel through distributed systems. A single trace shows every service touched, every operation performed, and how long each step took.

Trace ID: abc-123 (Total: 150ms)
  └─ API Gateway (150ms)
       ├─ Auth Service (20ms)
       ├─ Product Service (100ms)
       │    ├─ Database Query (80ms)
       │    └─ Price Service (15ms)
       └─ Recommendation Service (25ms)
            └─ Cache Lookup (5ms)

This trace tells a story that neither metrics nor logs could: the request touched four services, spent 80ms of its 150ms total in a single database query, and the recommendation service was fast because it hit cache.

Traces answer questions like:

Where is time actually spent? Not "latency is high" but "latency is high because the database query in the product service takes 80ms."

What depends on what? Traces reveal the actual call graph—which services talk to which, and in what order.

What happened to this specific request? Following a failed request across five services to find where and why it broke.

But traces have their own limitation: volume. Tracing every request in a system handling thousands per second would drown you in data. Sampling is necessary—trace 1% of requests, or 10%, depending on volume and budget.

And traces are bad at aggregation. Ask a tracing system "what's the average latency for all product page views?" and it shrugs. That's metrics' job.

How They Work Together

The pillars aren't alternatives—they're layers of an investigation.

Scenario: Slow requests

  1. Metrics alert you: P99 latency spiked from 200ms to 2 seconds
  2. Traces reveal: slow requests all spend excessive time in one specific database query
  3. Logs provide: the actual slow query SQL and the parameters that triggered it

Scenario: Elevated errors

  1. Metrics show: error rate jumped from 0.1% to 5%
  2. Logs reveal: the specific error message—"Database connection timeout"
  3. Traces show: errors only affect requests that call a particular downstream service
  4. Metrics (infrastructure) confirm: that service's database CPU is at 100%

Metrics tell you something is wrong. Logs tell you what went wrong. Traces tell you where it went wrong.

What to Implement First

Not every system needs all three pillars from day one.

Start with metrics: Essential for basic monitoring and alerting. Low cost, immediate value. You can't operate a production system blind.

Add structured logging: Crucial for debugging and compliance. Moderate cost. Use structured formats (JSON) from the start—retrofitting unstructured logs is painful.

Add distributed tracing: Valuable when your system is complex enough that requests cross multiple service boundaries. High cost, high value—but only when the complexity warrants it.

A simple service with one or two dependencies might only need metrics and logs. A distributed system with dozens of services talking to each other—tracing becomes essential for understanding what's actually happening.

The Storage Reality

Each pillar has specialized storage systems because the query patterns differ:

Metrics: Time-series databases (Prometheus, InfluxDB) optimized for "give me this number at these time intervals."

Logs: Log aggregation systems (Elasticsearch, Splunk) optimized for "find all events matching this pattern."

Traces: Tracing systems (Jaeger, Zipkin, Tempo) optimized for "show me this request's journey."

Some platforms (Datadog, Honeycomb, New Relic) store all three together, enabling unified queries—"show me traces for requests that contributed to this latency spike." This correlation across pillars is where modern observability gets powerful.

Frequently Asked Questions About Metrics, Logs, and Traces

Was this page helpful?

😔
🤨
😃