TCP Keep-Alive: Detecting Dead Connections

Updated 9 hours ago

TCP connections are designed to be reliable, but reliability has a blind spot: what happens when the other side simply vanishes?

A power failure. A kernel panic. Someone trips over a network cable. The remote machine doesn't gracefully close the connection—it can't. It's gone. And your application? It's still holding that connection open, waiting for data that will never arrive.

This is the half-open connection problem. Without intervention, these zombie connections persist indefinitely, consuming file descriptors, memory, and slots in connection pools. TCP keep-alive exists to solve this: a mechanism built into the protocol itself that periodically checks whether anyone's still on the other end.

The Problem Keep-Alive Solves

A crashed server can't say goodbye. It just vanishes. And the connections it held? They wait. Forever, if you let them.

TCP's normal reliability mechanisms—acknowledgments, retransmissions, timeouts—only activate when you're trying to send data. If your connection is idle, sitting in a pool waiting to be used, there's nothing to timeout. The connection looks perfectly healthy until you try to use it and discover the other end died hours ago.

Keep-alive adds a heartbeat to idle connections. During periods when no application data flows, the operating system periodically sends a tiny probe: "Are you still there?" If the answer comes back, the connection lives. If silence persists after multiple attempts, the connection is declared dead and cleaned up.

How the Probes Work

The probe packet is a trick—it sends a sequence number that's technically wrong, specifically to provoke a correction. The remote TCP stack can't help itself; it has to respond with "actually, I'm expecting byte X." That compulsion to correct is how you know someone's still there.

When TCP keep-alive is enabled on a socket, the operating system starts a timer after the last data exchange. When the timer expires with no activity, a probe goes out. If the peer responds, the timer resets. If not, another probe follows. After enough failed probes, the connection is marked dead.

The probes are tiny—just TCP headers, no payload. They don't interfere with normal data flow and only activate when the connection would otherwise be silent.

The Default Timers Are Absurd

Here's where theory meets reality: the default keep-alive settings are wildly conservative.

On Linux, the default keep-alive time is 7200 seconds. That's two hours before the first probe. Then 75 seconds between probes, with 9 probes required before declaring death. Do the math: a dead connection takes over two hours to detect.

These defaults made sense when bandwidth was expensive and the Internet was fragile. They don't make sense for a database connection pool where you need to know within minutes—or seconds—that a backend server has failed.

Three parameters control the behavior:

Keep-Alive Time: How long to wait before the first probe (default: 7200 seconds)
Keep-Alive Interval: Time between subsequent probes (default: 75 seconds)
Keep-Alive Probes: Failed probes before declaring death (default: 9)

Production systems typically override these via socket options (TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT). Common configurations probe after 60-300 seconds of inactivity, with 10-30 second intervals between probes. This detects dead connections in minutes rather than hours.

What Keep-Alive Is Good For

Detecting dead peers: The core use case. A server that loses power, panics, or gets unplugged can't send a FIN or RST. Keep-alive probes discover the absence.

Preventing resource exhaustion: Half-open connections accumulate. On a server handling thousands of connections, even a small percentage of zombies creates pressure on file descriptors, memory, and connection limits.

Keeping NAT mappings alive: NAT devices and stateful firewalls track connections. Let a connection sit idle too long, and the middlebox forgets about it—future packets get dropped even though both endpoints think they're connected. Keep-alive probes generate just enough traffic to keep the mapping fresh.

Validating pooled connections: Before handing a connection from a pool to application code, you want confidence it actually works. Keep-alive provides passive validation—if the connection survived in the pool, it's probably usable.

What Keep-Alive Cannot Tell You

TCP keep-alive confirms one thing: the remote kernel's TCP stack is responding. That's it.

It cannot tell you whether the application is healthy. A database might respond to TCP probes while completely deadlocked, unable to execute queries. A web server might acknowledge keep-alive while its worker processes are stuck in an infinite loop.

This is the fundamental limitation. Keep-alive operates at the transport layer. It sees the network. It doesn't see your application.

For detecting application-level failures—hung processes, exhausted thread pools, deadlocks—you need application-level heartbeats. A database client that periodically runs SELECT 1 validates the entire stack: network connectivity, database availability, query execution, lock acquisition. TCP keep-alive validates none of that.

The right approach combines both. Use TCP keep-alive as a baseline to catch network failures and dead machines. Use application heartbeats to catch everything else.

The Trade-offs

Network overhead: Probes consume bandwidth. For thousands of idle connections, the traffic adds up. Rarely significant on modern networks, but worth considering in constrained environments.

Battery drain: On mobile devices, frequent probes prevent network interfaces from sleeping. Mobile applications often disable or dramatically reduce keep-alive frequency.

False positives: Aggressive timing risks declaring connections dead during transient network hiccups that would have resolved themselves. A 10-second probe interval with 3 probes means 30 seconds of network disruption kills your connections—even if connectivity returns at second 31.

Configuration complexity: The right settings depend on your failure modes, network characteristics, and tolerance for stale connections. There's no universal answer.

Key Takeaways

TCP keep-alive detects dead connections by probing during idle periods. The operating system handles everything—applications just enable the feature and configure timing.

The defaults are too conservative for most production use. Two hours before the first probe is an eternity when you need to detect failures quickly. Override the timers.

Keep-alive catches network failures and dead machines. It doesn't catch application failures. For comprehensive health checking, combine transport-layer keep-alive with application-layer heartbeats.

Half-open connections are silent resource leaks. They don't announce themselves. They just accumulate until something breaks. Keep-alive is how you find them before they find you.

Frequently Asked Questions About TCP Keep-Alive

Was this page helpful?

😔

🤨

😃