Distributed Tracing

Updated 10 hours ago

You have metrics. You have logs. You have dashboards for every service. And yet when a user reports that their checkout took 12 seconds, you have no idea what happened.

The authentication service logged a successful token validation. The cart service logged a retrieval. The payment service logged a charge. The inventory service logged a reservation. Each service did its job, logged its work, reported its metrics. Everything looks fine. But somewhere in the gaps between these services, 12 seconds disappeared.

This is the visibility problem that distributed tracing solves. Not "what did each service do?" but "what happened to THIS request?"

The Problem Tracing Solves

In a monolith, debugging a slow request is straightforward. You get a stack trace. You see every function call, every database query, every external API call—all in one place, all in order. The story tells itself.

In microservices, that story is shattered across machines. A single user request might touch ten services, each running on different servers, each with its own logs, its own metrics, its own view of the world. No service sees the whole picture. Each service sees only its fragment.

Logs tell you what happened inside each service. Metrics tell you aggregate behavior over time. But neither can answer: "This specific request from this specific user—what path did it take, what went wrong, and where did the time go?"

Distributed tracing reconstructs the story.

Traces and Spans

A trace represents one request's complete journey through the system. When a user clicks "checkout," a trace captures everything that happens in response—every service call, every database query, every cache lookup—stitched together into a single narrative with a unique trace ID.

A span represents one operation within that trace. The API gateway handling the initial request is a span. The authentication service validating the token is a span. The database query fetching the cart is a span. Each span records:

What operation occurred
When it started and how long it took
Which span triggered it (building the parent-child hierarchy)
Metadata: the SQL query executed, the HTTP status returned, the cache hit or miss

Spans nest inside each other, forming a tree that mirrors the actual execution:

Trace: Checkout request (2,400ms total)
  └─ API Gateway (2,400ms)
       ├─ Auth Service - Validate Token (45ms)
       ├─ Cart Service - Get Cart (180ms)
       │    └─ Database Query (150ms)
       ├─ Payment Service - Charge Card (1,800ms)  ← here's your problem
       │    ├─ Fraud Check (1,200ms)
       │    └─ Payment Gateway API (580ms)
       └─ Inventory Service - Reserve Items (320ms)

Now you see it. 1,800ms of your 2,400ms request is in the payment service. Within that, 1,200ms is fraud checking. The mystery is solved. You know exactly where to look.

How Context Propagates

For tracing to work, trace context must follow the request across service boundaries. When Service A calls Service B, it passes along the trace ID and its own span ID (which becomes the parent for Service B's span).

This typically happens via HTTP headers:

X-Trace-ID: abc123xyz
X-Parent-Span-ID: span-456

Message queues carry trace context in metadata. gRPC and similar frameworks have built-in propagation. The key is that every service in the chain receives the trace ID, creates its spans with that ID, and passes it forward. When the tracing backend receives all these spans, it assembles them into the complete tree by matching IDs.

Break this chain anywhere—a service that doesn't propagate headers, a message queue without metadata support—and your trace fragments. You see pieces but not the whole.

Instrumentation: Automatic and Manual

Automatic instrumentation wraps common frameworks and libraries. Install the tracing library for your web framework, and incoming requests automatically create spans. Install the database client wrapper, and queries automatically become spans. You get basic tracing with minimal code changes.

Manual instrumentation captures custom operations:

with tracer.start_span("fraud_check") as span:
    span.set_tag("user_id", user.id)
    span.set_tag("amount", amount)
    result = fraud_service.check(user, amount)
    span.set_tag("risk_score", result.score)

The goal is spans for every operation you might need to debug: external API calls, significant business logic, anything that could be slow or fail.

The Sampling Problem

A large system might handle millions of requests per second. Tracing every request would generate terabytes of data daily—expensive to store, expensive to query, and the instrumentation overhead would degrade performance.

Sampling decides which requests to trace:

Head-based sampling decides at the start. Trace 1% of requests, discard 99%. Simple, but you might miss the one slow request that mattered.

Tail-based sampling collects everything initially, then keeps only interesting traces: errors, slow requests, specific users. More powerful but more complex—you need somewhere to buffer all that data while deciding.

The practical approach: Always trace errors (100%). Always trace slow requests. Sample normal requests at 1-10%. This catches most problems while keeping costs manageable.

What Tracing Reveals

The actual bottleneck. Metrics tell you the checkout endpoint is slow. Tracing tells you it's slow because the fraud check calls a third-party API that's timing out, causing retries.

Cascading failures. The product page is failing. Tracing shows it calls the recommendation service, which calls the user preference service, which is overloaded. The root cause is three services away from the symptom.

Hidden dependencies. You thought the order service was standalone. Tracing reveals it calls five other services, any of which could cause failures.

Optimization opportunities. Two services are called sequentially when they could be parallel. A cache that should hit is missing. A database query runs five times when once would suffice.

The Tracing Ecosystem

OpenTelemetry has become the standard for instrumentation—a vendor-neutral way to generate traces that work with any backend. Instrument once, send data anywhere.

Jaeger and Zipkin are popular open-source tracing backends. Self-hosted, battle-tested, sufficient for many use cases.

Commercial platforms (Datadog, New Relic, Honeycomb) add advanced querying, anomaly detection, and integration with metrics and logs. The tradeoff is cost versus capability.

The choice matters less than having tracing at all. Start with something. Migrate later if needed.

Tracing Completes Observability

Metrics answer: "What is happening?" Latency is up. Error rate is spiking. CPU is high.

Logs answer: "What did each component do?" Service A processed request. Service B threw exception.

Traces answer: "What happened to this request?" It went here, then there, spent 400ms waiting for this database, failed when that service returned a 503.

Without tracing, debugging microservices means correlating logs across services by timestamp—possible but painful. With tracing, you query for slow checkouts and see exactly what happened, end to end, in a single view.

Distributed tracing transforms microservices from a collection of opaque boxes into a system you can actually understand. The request's journey becomes visible. The story tells itself again.

Frequently Asked Questions About Distributed Tracing

Was this page helpful?

😔

🤨

😃