Health Checks and Failover

Updated 9 hours ago

A load balancer without health checks is like a receptionist forwarding calls to an office that burned down yesterday—technically doing the job, catastrophically unhelpful.

Health checks are how load balancers answer a simple question: is this server still alive? They ask constantly, and when the answer changes, they act immediately.

The Problem Health Checks Solve

Without health checks, a load balancer has no idea what's happening behind it. Server crashed? Still sending traffic. Application frozen? Still sending traffic. Database connection pool exhausted? Still sending traffic.

Users hit these dead servers and get errors. The load balancer keeps distributing requests evenly across a pool that includes corpses.

Health checks give the load balancer eyes. Now it can see when a server stops responding and stop sending traffic there. When the server recovers, traffic resumes automatically. No operator needed.

Active Health Checks

Active health checks are the load balancer poking each server: "You alive? You alive? You alive?"

TCP checks try to establish a connection. If the server accepts the connection, it's alive. This proves the server is running and reachable but says nothing about whether the application works.

HTTP checks send actual requests—typically to /health or /ping—and verify they get 200 OK back. This proves the application is running, not just the server.

Custom checks go deeper. The health endpoint might test database connectivity, cache availability, or other dependencies. Return 200 only if everything the application needs is working.

Three parameters control active checks:

Interval: How often to check (typically 5-30 seconds)
Timeout: How long to wait for a response (typically 2-10 seconds)
Threshold: How many consecutive failures before marking unhealthy (typically 2-3)

Requiring multiple failures prevents a single dropped packet from triggering failover.

Passive Health Checks

Passive checks don't send test requests. They watch real traffic.

If a server starts returning 500 errors or timing out on actual requests, the load balancer notices and marks it unhealthy. No explicit health check needed.

The advantage: catches problems that wouldn't show up on a simple health check. Maybe the server responds to /health fine but fails on real requests under load.

The disadvantage: requires real traffic to detect problems. During quiet periods, a dead server might go unnoticed. And some users hit errors before the pattern becomes clear.

Most production systems use both. Active checks catch obvious failures fast. Passive monitoring catches subtle problems that slip through.

Designing Health Endpoints

A good health check endpoint:

Responds fast. Milliseconds, not seconds. Health checks run constantly—if they're slow, they'll stress the servers they're trying to protect.

Tests what matters. If your app needs a database, verify the database connection. If it needs a cache, verify the cache. Return 200 only when the server can actually do its job.

Stays lightweight. Don't run expensive queries or heavy computation. A health check that takes 2 seconds of CPU time defeats the purpose.

Lives at a dedicated URL. /health, /ping, /status—something separate from your application routes. This isolates health checking from normal request handling.

Some health endpoints return detailed JSON:

{
  "status": "healthy",
  "database": "connected",
  "cache": "available",
  "uptime": "3d 14h 22m"
}

The load balancer only checks the status code. The details help operators debug when something's wrong.

The Failover Sequence

When health checks start failing:

First failure: noted, but no action yet
Consecutive failures exceed threshold: server marked unhealthy
Load balancer updates its routing: unhealthy server removed from rotation
New requests go only to healthy servers
Existing connections: either terminated, allowed to finish, or drained gradually (depends on configuration)
Health checks continue against the unhealthy server

The server is out of rotation but not forgotten. The load balancer keeps checking, waiting for recovery.

The Recovery Sequence

When an unhealthy server starts responding again:

Health checks start succeeding
Consecutive successes exceed recovery threshold: server marked healthy
Server added back to rotation
Traffic resumes

Recovery thresholds are usually higher than failure thresholds. If 2 failures marks a server down, you might require 3 successes to bring it back up. This prevents flapping—a marginal server bouncing between healthy and unhealthy.

Some load balancers ramp traffic gradually to recovered servers rather than immediately sending full load. A server that just came back might not handle a sudden traffic spike well.

Cascading Failures

This is the nightmare scenario.

One server fails. Its traffic shifts to the remaining servers. Now they're handling more load. They slow down. Health checks start timing out. More servers marked unhealthy. More traffic shifts to fewer servers. Those servers overload. More failures. Dominoes.

A cluster running at 80% capacity with four servers seems fine. One dies, the others jump to 107% effective load. They start failing. Soon you have no healthy servers.

Prevention:

Capacity headroom: Never run at 100% even when healthy. If losing one server would overload the rest, you need more servers.
Tolerant thresholds: Configure health checks to tolerate slowness, not just failures. A slow server is better than no server.
Circuit breakers: When things go wrong, fail fast rather than queuing requests indefinitely.
Cluster-level monitoring: Individual server health matters, but so does overall cluster capacity.

Health Check Examples

Basic HTTP check:

URL: /health
Interval: 10 seconds
Timeout: 3 seconds
Unhealthy after: 2 consecutive failures
Healthy after: 2 consecutive successes
Expected: 200 OK

Deep application check:

URL: /health/full
Interval: 15 seconds
Timeout: 5 seconds
Unhealthy after: 2 consecutive failures
Healthy after: 3 consecutive successes
Expected: 200 OK with {"status": "healthy"}

TCP check (for databases or non-HTTP services):

Port: 3306
Interval: 5 seconds
Timeout: 2 seconds
Unhealthy after: 3 consecutive failures
Healthy after: 2 consecutive successes

What to Monitor

Server status: Which servers are healthy, unhealthy, or transitioning.

Health check success rate: A server that fails 5% of health checks is flaky. Investigate before it fails completely.

Failover events: Alert when servers go down, even though the system handles it automatically. Automatic doesn't mean invisible.

Recovery time: How long do servers stay unhealthy? Long recovery times suggest persistent problems, not transient blips.

Frequently Asked Questions About Health Checks and Failover

Was this page helpful?

😔

🤨

😃