1. Library
  2. Monitoring Concepts
  3. Fundamentals

Updated 10 hours ago

Monitoring checks generate data. Data alone doesn't tell you what's happening or what to do about it. The gap between raw results and genuine understanding is where monitoring either becomes useful or degenerates into noise.

Check States Are Symptoms, Not Diagnoses

Monitoring systems classify results into states:

OK/Success/Passing means the check met its criteria. Response arrived on time, status code matched, content appeared as expected. OK doesn't mean perfect—response time might be slower than ideal while still passing thresholds—but it means nothing requires immediate action.

Warning/Degraded signals something noteworthy without being critical. Response time exceeded the ideal range but not the critical threshold. Error rates increased but remain manageable. Warnings exist to catch problems before they become emergencies.

Critical/Error/Failing indicates clear problems. The service is unreachable, responses consistently fail, or performance degraded to unacceptable levels. Critical states demand investigation.

Unknown/Indeterminate appears when the monitoring system couldn't complete the check. Network issues, monitoring infrastructure problems, or configuration errors produce unknown states. Unknown demands a different response than critical—fix the monitoring before investigating the monitored system.

A warning isn't an emergency. An unknown might reflect monitoring problems, not service problems. Reading states correctly prevents both panic and neglect.

Context Changes Everything

The same result means different things in different contexts. A 10% error rate might be:

  • Acceptable for an experimental feature with minimal usage
  • Concerning for a mature feature during normal operations
  • Expected during a planned deployment
  • Catastrophic for payment processing where errors mean lost revenue

Effective interpretation considers:

Time patterns. Response times typically increase during peak hours. Batch processing consumes resources overnight. Weekend traffic differs from weekday traffic. Understanding when you're measuring prevents false alarms when systems behave normally for the current timeframe.

Recent changes. Did performance degrade immediately after a deployment? Did error rates spike after a configuration change? Connecting results to change events often points directly to root causes.

Correlated events. If one service reports high latency, that service might have problems. If twenty services simultaneously report latency, the network has problems.

Seasonal baselines. Traffic doubling over six months is growth. Traffic doubling in six hours is an anomaly. Comparing current metrics to same-day-last-week reveals whether changes reflect evolution or incident.

Single Points Don't Tell Stories

A single elevated latency measurement might result from:

  • A brief network hiccup affecting one request
  • The beginning of a degradation that will worsen
  • Normal variation within acceptable range
  • The monitoring probe happening to hit a slow endpoint

You can't know which from one data point. Understanding requires patterns:

Duration distinguishes transient blips from sustained problems. A brief spike might not merit attention. Sustained elevation indicates issues requiring investigation.

Trend direction reveals whether problems are improving, worsening, or stable. Error rates increasing over hours suggest escalation. Error rates decreasing suggest recovery. Stable elevated rates might indicate a new normal requiring threshold adjustment.

Frequency helps assess intermittent issues. A problem appearing once per week suggests different causes than one appearing every few minutes. Rare failures might not justify investigation. Frequent intermittent failures definitely do.

The Average Lies to You

Consider a service where:

  • 95% of requests complete in under 100ms
  • 4% of requests take 500ms
  • 1% of requests take 5 seconds

The average response time: 200ms. Sounds acceptable.

But that average hides the fact that 1 in 100 users waits 5 seconds—a terrible experience invisible in the summary statistic. The average lies by hiding the suffering at the edges.

Percentile metrics reveal what averages obscure:

  • 50th percentile (median): typical experience. Half faster, half slower.
  • 95th percentile: experience for most users. Only 5% worse.
  • 99th percentile: tail latency. The unluckiest users.
  • 99.9th percentile: extreme outliers indicating rare but serious problems.

Monitoring that only examines averages misses critical issues hiding in the tail. Your dashboard glows green while users suffer.

HTTP 200 OK Is the Most Dangerous Lie

A check receives a 200 OK response—technically a success—while:

  • The response contains an error message instead of expected content
  • The page loaded but critical JavaScript failed
  • The API returned success but with incomplete data
  • Response time was technically acceptable but noticeably degraded

Green doesn't mean good. It means the check passed. The check might not be testing what matters.

Sophisticated monitoring goes deeper:

Content validation verifies the response contains expected data. Searching for specific text, validating JSON structure, or checking for known error messages catches failures hiding behind successful status codes.

Performance thresholds ensure success responses arrive within acceptable timeframes. A page that takes 30 seconds to load has technically succeeded while practically failing.

Completeness checks verify all expected elements loaded. A partially rendered page might return 200 OK while missing critical components.

False Positives Train You to Ignore Real Problems

False positives occur when monitoring reports problems that don't exist. Network issues at the monitoring location. Thresholds too strict for normal variation. False positives waste time and, worse, train teams to dismiss alerts.

False negatives happen when monitoring misses real problems. Checks run too infrequently to catch brief outages. The specific functionality that broke wasn't covered. False negatives create dangerous false confidence.

Understanding your system's accuracy helps calibrate response. High false positive rates suggest loosening thresholds or improving check design. False negatives indicate gaps in coverage.

Correlation Isn't Causation

CPU usage and response time both spiked. Did high CPU cause slow responses? Or did slow responses cause CPU to spike while threads waited for external services?

Untangling correlation from causation:

Timeline analysis: which metric changed first? The initial change often indicates root cause.

Change event correlation: did metric changes align with deployments or configuration changes? Immediate correlation suggests causation.

System knowledge: understanding dependencies helps establish cause-and-effect chains. Knowing a service queries a database before responding helps interpret scenarios where both database latency and API latency increased.

Unknown States Deserve Attention

Unknown or indeterminate states typically indicate:

  • The monitoring system couldn't reach the target
  • The check timed out before completing
  • The monitoring infrastructure itself has problems
  • Configuration errors prevent the check from running

The critical question: monitoring problem or service problem?

Other checks from the same location still succeeding suggests the target has problems, not the monitoring infrastructure.

Checks from other locations succeeding indicates location-specific network issues—possibly affecting real users in that region.

Multiple unrelated checks all showing unknown simultaneously points to monitoring infrastructure problems.

Unknown states get ignored because they're ambiguous. But network partitions, firewall changes, and infrastructure failures can all manifest as unknown checks. Ambiguity isn't permission to ignore.

Frequently Asked Questions About Interpreting Check Results

Was this page helpful?

😔
🤨
😃
Interpreting Check Results • Library • Connected