1. Library
  2. Computer Networks
  3. Performance and Reliability
  4. Observability

Updated 8 hours ago

The RED Method is a monitoring methodology for microservices. RED stands for Rate, Errors, and Duration—three metrics that answer the only questions users actually ask:

  • Is it working? (Errors)
  • Is it fast? (Duration)
  • Is it keeping up? (Rate)

Created by Tom Wilkie at Grafana Labs, RED cuts through the noise of infrastructure metrics to focus on what matters: the user's experience, translated into numbers.

Rate: Is It Keeping Up?

Rate is requests per second. It measures load—how much demand your service faces right now.

Track total requests, requests by endpoint, requests by method. But the real value is pattern recognition: What's your normal Tuesday at 2pm? When traffic suddenly drops 80%, is your service broken or is something upstream failing to send requests? When traffic spikes 300%, is it a viral moment or an attack?

Rate establishes baseline. Without knowing normal, you can't recognize abnormal.

Segment by what matters to your business: customer tier (are premium users affected?), geographic region (is this a regional outage?), client version (did a new release break something?). Service-wide rate hides problems affecting specific populations.

Errors: Is It Working?

Error rate is failed requests divided by total requests. This is reliability from the user's perspective—the percentage of people who tried to do something and couldn't.

The distinction between error types matters:

Server errors (5xx) are your fault. The service failed. Fix them.

Client errors (4xx) are often legitimate. Authentication failures happen when passwords are wrong. Bad requests happen when clients misbehave. Don't conflate these with reliability problems—but do watch for spikes that indicate something changed.

Timeouts are a special category. The request didn't fail, exactly—it just never finished. These often indicate resource exhaustion or dependency problems.

Business errors are the sneaky ones. HTTP 200 OK, but the response body says "payment declined" or "item out of stock." Technically successful. Actually a failure from the user's perspective. Decide whether these belong in your error rate based on what you're trying to measure.

Error rate feeds directly into SLOs. "99.9% availability" means error rate must stay below 0.1%. That's 43 minutes of errors per month. Sounds generous until you realize that's total, not consecutive.

Duration: Is It Fast?

Duration is how long requests take. This is performance, and it directly determines user experience.

Here's the critical insight: average duration is a lie.

An average of 50ms sounds great. But averages hide the distribution. If 95% of requests complete in 20ms and 5% take 2 seconds, your average is still ~120ms—but 1 in 20 users has a terrible experience. They don't care about your average. They care about their request.

Use percentiles:

  • P50 (median): Half your users are faster, half are slower
  • P95: 95% of users are faster than this
  • P99: Only 1% of users wait longer

P95 and P99 reveal tail latency—the experience of your unluckiest users. These are often your most engaged users, the ones making the most requests, the ones most likely to notice when things are slow.

Track duration separately for successful and failed requests. Failed requests often have very different timing (immediate failures vs. timeout failures), and mixing them distorts your understanding of successful request performance.

RED vs. the Four Golden Signals

RED is a subset of Google's Four Golden Signals, optimized for microservices:

REDGolden Signals
RateTraffic
ErrorsErrors
DurationLatency
Saturation

The deliberate omission: Saturation (resource utilization).

The argument is philosophical: Saturation is an implementation detail. Users don't care if your CPU is at 80%—they care if the page loads. For microservices that auto-scale, the platform handles capacity. You care about outcomes (RED), not mechanisms (Saturation).

This is mostly true. But Saturation still matters for capacity planning, cost optimization, and understanding why RED metrics are degrading. Many teams use RED for service-level monitoring and track Saturation separately for infrastructure concerns.

Implementing RED

The good news: you probably don't have to build this yourself.

Service meshes (Istio, Linkerd) automatically collect RED metrics for every service without code changes. The mesh sees all traffic and measures it.

Application frameworks often include instrumentation. OpenTelemetry, Prometheus client libraries, and framework-specific middleware automatically emit request counts, error counts, and latency histograms.

API gateways (Kong, Ambassador, cloud provider gateways) measure everything flowing through them.

For custom protocols or specific business operations, add explicit instrumentation:

start_time = now()
try:
    result = process_request()
    record_success()
    return result
except Exception as e:
    record_error(e.type)
    raise
finally:
    record_duration(now() - start_time)
    increment_request_count()

Per-Endpoint Visibility

Service-wide RED metrics hide problems. If /api/users handles 1000 req/s at 0.01% errors and /api/payments handles 100 req/s at 1% errors, your service-wide error rate looks fine (~0.1%) while payments is failing ten times more than it should.

Break down by endpoint:

EndpointRateErrorsP95 Duration
/api/users450/s0.05%80ms
/api/payments120/s0.18%250ms
/api/search200/s0.02%320ms

Now you know where to focus. Payments has higher error rate and higher latency. Search is slow but reliable. Users is healthy.

RED for Dependencies

Track RED metrics for services you call, not just services you provide. When your payment processor has a bad day, you need to know.

DependencyRateErrorsP95
Payment Gateway120/s0.25%180ms
User Service500/s0.02%45ms
Inventory API300/s0.08%120ms

This reveals whether third-party services are meeting their SLAs and whether they're the bottleneck in your latency chain.

RED and SLOs

RED metrics map directly to Service Level Objectives:

Availability SLO: "99.9% of requests succeed" → Error rate must stay below 0.1%

Latency SLO: "95% of requests complete within 200ms" → P95 Duration must stay below 200ms

Throughput SLO: "Handle 10,000 requests per second" → Rate must reach 10,000/s without degrading Errors or Duration

The third one is subtle. Throughput SLOs aren't just about rate—they're about rate while maintaining quality. Handling 10,000 req/s at 5% errors isn't meeting the objective.

Beyond HTTP

RED applies to any request-response pattern:

gRPC: RPC calls per second, failed calls, call duration.

Database queries: Queries per second, failed queries, query duration. Segment by query type—reads and writes often have very different profiles.

Message queues: Messages processed per second, processing failures, processing time. For queues, also track queue depth—it's the closest analog to backpressure.

Background jobs: Jobs per hour, failed jobs, job duration. Longer time windows make sense for infrequent operations.

The pattern is universal: How much work? How much failed? How long did it take?

Alerting on RED

Set alerts that detect real problems without crying wolf:

Errors: Alert when error rate exceeds your SLO threshold for long enough to matter. "Error rate > 1% for 5 minutes" catches real problems. "Error rate > 0.1% for 30 seconds" catches noise.

Duration: Alert on percentile degradation. "P99 > 500ms for 5 minutes" means your slowest users are having a bad time. "P50 doubled from baseline" means something fundamental changed.

Rate: Alert on dramatic changes. Traffic dropping 80% often means something upstream broke. Traffic spiking 300% might be an attack or a viral moment. Both deserve investigation.

The meta-alert: Alert when you don't have data. Missing metrics often indicate a more serious problem than bad metrics.

Common Mistakes

Tracking only averages: Duration averages hide tail latency. The 5% of users waiting 2 seconds while your average is 50ms are real people having real bad experiences. Use percentiles.

Ignoring client errors: 4xx errors are often legitimate, but spikes indicate changes. If 401 errors jump 10x, something changed in authentication. Don't dismiss client errors entirely.

No segmentation: Service-wide metrics miss problems affecting specific endpoints, regions, or customer tiers. The CEO's demo failing won't show up in aggregate error rate.

Missing baselines: Is 500 req/s high or low? Is 100ms P95 good or bad? Without historical context, RED metrics are just numbers. Establish baselines, then detect deviations.

Alert fatigue: Over-sensitive alerts train teams to ignore them. Every alert should require action. If you're dismissing alerts as "just noise," your thresholds are wrong.

The Full Picture

RED provides service-level visibility, but it's not complete observability:

Logs tell you what failed when errors spike. RED says "0.5% of requests failed." Logs say "NullPointerException in PaymentProcessor.java line 247."

Traces tell you where time went when duration increases. RED says "P99 jumped to 800ms." Traces say "400ms in database query, 350ms waiting for downstream service."

Saturation tells you why capacity is constrained. RED says "latency is increasing." Saturation says "database connection pool is exhausted."

Business metrics tell you if users accomplished their goals. RED says "service is healthy." Business metrics say "but nobody is completing purchases." Technical health isn't the same as user success.

RED is the starting point. When it shows a problem, other observability tools help you understand and fix it.

Frequently Asked Questions About the RED Method

Was this page helpful?

😔
🤨
😃
The RED Method • Library • Connected