1. Library
  2. Performance and Reliability
  3. Observability

Updated 10 hours ago

The USE Method is a diagnostic framework for finding infrastructure bottlenecks. Created by Brendan Gregg, it asks three questions of every resource in your system: Is it busy (Utilization)? Is it drowning (Saturation)? Is it dying (Errors)?

The genius is in the specificity. Instead of staring at dashboards hoping something looks wrong, USE gives you a systematic way to interrogate every resource—CPU, memory, disk, network—and determine whether it's your bottleneck.

The Three Questions

Utilization: Is It Busy?

The percentage of time a resource spends doing work.

For CPU, this is time not idle. For disk, time performing I/O. For network, bandwidth in use as a percentage of capacity. For memory, percentage of available memory consumed.

High utilization means the resource is working hard. But here's what trips people up: high utilization isn't necessarily bad. A CPU running at 85% utilization might be perfectly healthy—you're getting your money's worth. The problem comes when utilization combines with the next metric.

Saturation: Is It Drowning?

The degree to which extra work is queued because the resource can't keep up.

This is where the real trouble lives. Saturation means demand exceeds capacity. Work is piling up. Users are waiting.

For CPU, saturation shows up as run queue length—processes waiting their turn. For memory, it's swapping to disk because RAM is exhausted. For disk, it's I/O queue depth—requests waiting for the drive. For network, it's dropped packets because buffers overflowed.

Here's the insight that changes how you think about performance: a resource at 60% utilization with significant saturation is a bigger problem than one at 90% utilization without saturation. The first is drowning. The second is just busy.

Saturation means someone is waiting. Utilization without saturation just means efficient use.

Errors: Is It Dying?

The count of error events for the resource.

Memory ECC errors. Disk read failures. Network CRC errors. Device timeouts. These indicate hardware problems, driver issues, or misconfiguration.

Errors are different from utilization and saturation because any non-zero count demands attention. High utilization is often fine. Saturation indicates a capacity problem. But errors? Errors mean something is wrong or failing.

A disk reporting read errors might fail completely next week. Memory ECC errors might indicate a failing DIMM. Network frame errors might mean a bad cable or failing NIC.

Why These Three, in This Order

Utilization tells you a resource is working. Saturation tells you it's drowning. Errors tell you it's dying.

That's the progression: busy → overwhelmed → broken.

When investigating performance problems, this order matters. A saturated resource is your bottleneck—work is queued, users are waiting. A resource with errors might be about to fail catastrophically. But high utilization alone? That might just be a resource earning its keep.

Applying USE to Your Resources

CPU

Utilization: Percentage of time not idle (check per-core, not just aggregate) Saturation: Run queue length, scheduler latency Errors: Machine check exceptions, thermal throttling

A CPU at 95% utilization with a run queue of 1 is busy but healthy. The same CPU with a run queue of 15 is drowning—12 processes are waiting for their turn at every moment.

Memory

Utilization: Percentage of RAM in use Saturation: Swap rate, page scanning rate Errors: ECC errors, failed allocations, OOM killer events

High memory utilization is normal—unused RAM is wasted RAM. But the moment you see swapping, you have saturation. The system is using disk as overflow memory, and disk is orders of magnitude slower than RAM.

Disk

Utilization: Percentage of time performing I/O Saturation: I/O queue depth, wait time Errors: Read/write errors, SMART warnings, timeouts

Modern SSDs can sustain high utilization without saturation. Spinning disks saturate more easily. Either way, check the queue—if requests are piling up, the disk can't keep pace.

Network

Utilization: Throughput as percentage of link capacity Saturation: Dropped packets, buffer overruns, retransmits Errors: Frame errors, CRC errors, collisions

A 1 Gbps link at 800 Mbps utilization might be fine—or might be dropping packets. Saturation tells you whether the pipe is full or overflowing.

USE in Practice

Users report slow responses. Before blaming the application, interrogate the infrastructure:

CPU: 75% utilized, run queue at 8 Verdict: Saturated. Processes are waiting. This is a bottleneck.

Memory: 92% utilized, zero swap activity Verdict: High utilization but no saturation. Memory is well-used, not overwhelmed.

Disk: 40% utilized, queue depth at 2 Verdict: Healthy. Low utilization, minimal queuing.

Network: 25% utilized, no dropped packets Verdict: Healthy. Plenty of headroom.

The CPU saturation explains the slow responses. Eight processes waiting at any moment means substantial queueing delay. The fix is either reducing CPU demand or adding capacity.

USE Complements RED

The RED Method asks about services: What's the request rate? What's the error rate? What's the duration?

USE asks about infrastructure: Is the CPU busy? Is memory drowning? Is the disk dying?

They work together. RED tells you a service is slow. USE tells you why—because the CPU is saturated, or memory is swapping, or the network is dropping packets.

Example: RED shows request duration spiking from 50ms to 500ms. Something is slow. USE reveals CPU run queue jumped from 1 to 12. The service is slow because processes are waiting for CPU time. Now you know where to look.

Proactive USE Monitoring

Don't wait for outages. Track USE metrics continuously:

Capacity planning: Which resources are approaching limits? Scale before saturation.

Trend analysis: Is CPU utilization growing 2% per week? You'll hit saturation in a few months.

Alerting: Alert on saturation immediately—it means users are waiting. Alert on high utilization as a warning—you're approaching the cliff.

What USE Doesn't Cover

USE finds infrastructure bottlenecks. It won't find:

  • Slow database queries
  • Inefficient algorithms
  • Deadlocks and race conditions
  • External service dependencies

If USE shows all resources healthy but the application is slow, the problem is in the code, not the infrastructure.

Frequently Asked Questions About the USE Method

Was this page helpful?

😔
🤨
😃
The USE Method • Library • Connected