TCP Retransmissions and Reliability

Updated 1 day ago

TCP doesn't trust the network. It can't afford to.

Every packet carries an implicit question: Did you get this? The receiver answers with acknowledgments. When those answers stop—or come back wrong—TCP assumes the worst and sends the data again. This paranoia is the price of reliability.

How TCP Knows Something Went Wrong

TCP uses sequence numbers and acknowledgments to track every byte in transit. The sender numbers each segment. The receiver responds with an ACK indicating the next byte it expects. This creates a feedback loop: send, confirm, send more.

When the feedback breaks down, TCP notices. There are two signals:

Timeout. The sender waits for an ACK. If none arrives within a calculated window, TCP assumes the segment was lost and retransmits.

Duplicate ACKs. The receiver gets segments out of order—say, segments 1, 2, 3, then 5. Segment 4 is missing. The receiver can't acknowledge segment 5 because ACKs are cumulative (acknowledging up to a sequence number). So it sends another ACK for segment 3. And another when segment 6 arrives. Each out-of-order segment triggers a duplicate ACK.

TCP doesn't wait for the network to tell it something went wrong. It infers loss from patterns—silence where there should be acknowledgment, repetition where there should be progress.

The Retransmission Timeout

The Retransmission Timeout (RTO) determines how long TCP waits before deciding a segment was lost.

Too short: TCP retransmits unnecessarily, wasting bandwidth and worsening congestion.

Too long: Applications stall when packets actually are lost.

TCP calculates RTO dynamically from measured round-trip time (RTT)—the time for a segment to reach its destination and for the ACK to return. But RTT varies constantly. Congestion, routing changes, processing delays all affect it.

To handle this variance, TCP maintains a smoothed RTT (SRTT) that averages recent measurements, plus an RTT variance (RTTVAR) that tracks fluctuation. The RTO formula from RFC 6298¹:

RTO = SRTT + (4 × RTTVAR)

This is deliberately conservative—typically several times longer than average RTT to avoid mistaking normal delay variation for loss.

When a timeout fires, TCP doesn't just retransmit—it doubles the RTO for subsequent attempts. This exponential backoff reduces load on a potentially congested network. If several retransmissions fail, TCP eventually gives up and reports an error to the application.

Fast Retransmit: Reading the Room

Waiting for timeouts is slow. On a connection with 50ms RTT but 200ms RTO, timeout-based recovery takes four times longer than necessary.

When the sender receives three duplicate ACKs for the same sequence number—four identical ACKs total—it doesn't wait for the timeout. It retransmits immediately.

Why three? A single duplicate ACK might just mean packets arrived out of order. Two could still be reordering. But three duplicate ACKs—the receiver asking for the same thing over and over—is TCP's signal that something is genuinely missing.

This "triple duplicate ACK" rule recovers from loss in roughly one RTT instead of waiting for the full timeout period.

Selective Acknowledgment

Standard TCP acknowledgments are cumulative: "I've received everything up to byte 1000." This works when one segment is lost. But when multiple segments are lost from the same window, the sender knows something is missing but not what specifically.

Selective Acknowledgment (SACK), defined in RFC 2018², fixes this. With SACK enabled, the receiver reports: "I've received bytes 1-1000, and also bytes 2001-3000, but I'm missing 1001-2000."

Now the sender knows exactly which segments to retransmit. Without SACK, it might retransmit everything from the first loss onward. With SACK, it retransmits only what's actually missing.

SACK is negotiated during connection setup—both endpoints must support it. Modern operating systems enable it by default. It's especially valuable on high-bandwidth, high-latency connections where many segments can be in flight simultaneously.

The Cost of Retransmission

Every retransmission is waste—the network carried that data twice, the application needed it once.

Best case: fast retransmit works, costing one extra RTT of latency.

Worst case: timeout-based retransmission with exponential backoff turns a single lost packet into seconds of delay.

Retransmissions also trigger congestion control. A timeout tells TCP the network is overwhelmed, so TCP slashes its transmission rate—sometimes by 90% or more. It takes time to ramp back up. Fast retransmit is gentler (halving the rate instead of resetting it), but still impacts throughput.

A healthy network has retransmission rates below 1%. Rates above 3-5% indicate problems: congestion, faulty hardware, wireless interference, routing issues. Interactive traffic (SSH, gaming) suffers from increased latency. Bulk transfers suffer from reduced throughput.

Diagnosing Retransmission Issues

When retransmission rates climb, look at the pattern.

Constant low-rate retransmissions suggest persistent hardware issues: duplex mismatch, damaged cable, marginal wireless signal.

Bursty retransmissions point to congestion or interference that comes and goes with traffic load.

Packet captures (tcpdump, Wireshark) reveal specifics. Are the same segments retransmitted multiple times? That's severe loss. Are retransmissions scattered across different sequence numbers? That's random loss. Do they correlate with traffic spikes? That's congestion.

System statistics help too. On Linux, netstat -s shows aggregate retransmission counters. Tracking these over time reveals when problems started and whether they correlate with changes—new configurations, traffic pattern shifts, hardware swaps.

Common culprits: buffer bloat (excessive buffering causing timeouts), congestion without proper queue management, wireless interference, faulty NICs, middleboxes dropping packets unpredictably. In data centers, watch for asymmetric routing where ACKs take different paths than data, potentially triggering spurious retransmissions.

Frequently Asked Questions About TCP Retransmissions

Sources

TCP Flow Control: Sliding Window

TCP Congestion Control Explained

Was this page helpful?

😔

🤨

😃