TCP Retransmissions and Reliability

Updated 10 hours ago

TCP doesn't trust the network. It can't afford to.

Every packet sent carries an implicit question: Did you get this? The receiver answers with acknowledgments. When those answers stop coming—or come back wrong—TCP assumes the worst and sends the data again.

This paranoia is what makes TCP reliable. Unlike UDP, which fires packets into the void and moves on, TCP watches every byte until it's confirmed received. The cost is latency and bandwidth. The benefit is guaranteed delivery. For anything that matters—financial transactions, file transfers, remote sessions—TCP's suspicion is worth the overhead.

How TCP Knows Something Went Wrong

TCP uses sequence numbers and acknowledgments to track every byte in transit. The sender numbers each segment. The receiver responds with an ACK indicating the next sequence number it expects. This creates a feedback loop: send, confirm, send more.

When the feedback breaks down, TCP notices. There are two ways it detects loss:

Timeout. The sender waits for an ACK. If none arrives within a calculated window, TCP assumes the segment was lost and retransmits.

Duplicate ACKs. The receiver gets segments out of order—say, segments 1, 2, 3, then 5. Segment 4 is missing. The receiver can't acknowledge segment 5 because ACKs are cumulative (acknowledging up to a sequence number). So it sends another ACK for segment 3. And another when segment 6 arrives. Each out-of-order segment triggers a duplicate ACK.

TCP doesn't wait for the network to tell it something went wrong. It infers loss from patterns—silence where there should be acknowledgment, repetition where there should be progress.

The Retransmission Timeout

The Retransmission Timeout (RTO) is how long TCP waits before deciding a segment was lost. Getting this value right is critical.

Too short: TCP retransmits unnecessarily, wasting bandwidth and potentially making congestion worse.

Too long: Applications wait unacceptably long when packets actually are lost.

TCP calculates RTO dynamically from measured round-trip time (RTT)—the time for a segment to reach its destination and for the ACK to return. But RTT varies constantly. Congestion, routing changes, processing delays—all affect it.

To handle this variance, TCP maintains a smoothed RTT (SRTT) that averages recent measurements, plus an RTT variance (RTTVAR) that tracks how much the measurements fluctuate. The RTO formula from RFC 6298:

RTO = SRTT + (4 × RTTVAR)

This is deliberately conservative. The timeout is typically several times longer than average RTT to avoid mistaking normal delay variation for loss.

When a timeout fires, TCP doesn't just retransmit—it doubles the RTO for subsequent attempts. This exponential backoff reduces load on a potentially congested network. If several retransmissions fail, TCP eventually gives up and reports an error to the application.

Fast Retransmit: Reading the Room

Waiting for timeouts is slow. On a connection with 50ms RTT but 200ms RTO, timeout-based recovery takes four times longer than necessary. TCP has a faster option.

When the sender receives three duplicate ACKs for the same sequence number—four identical ACKs total—it doesn't wait for the timeout. It retransmits immediately.

Why three? A single duplicate ACK might just mean packets arrived out of order. Two could still be reordering. But three duplicate ACKs—the receiver asking for the same thing over and over—is TCP's signal that something is genuinely missing.

This "triple duplicate ACK" rule is called fast retransmit. It recovers from loss in roughly one RTT instead of waiting for the full timeout period. For isolated packet loss, it's dramatically faster.

Selective Acknowledgment

Standard TCP acknowledgments are cumulative: "I've received everything up to byte 1000." This works when one segment is lost. But when multiple segments are lost from the same window, the sender has a problem. It knows something is missing but not what specifically.

Selective Acknowledgment (SACK), defined in RFC 2018, fixes this. With SACK enabled, the receiver can report: "I've received bytes 1-1000, and I've also received bytes 2001-3000, but I'm missing 1001-2000."

Now the sender knows exactly which segments to retransmit. Without SACK, it might have to retransmit everything from the first loss onward. With SACK, it retransmits only what's actually missing.

SACK is negotiated during connection setup. Both endpoints must support it. Modern operating systems enable it by default. It's especially valuable on high-bandwidth, high-latency connections where many segments can be in flight simultaneously.

The Cost of Retransmission

Every retransmission is waste—the network carried that data twice, the application needed it once.

Best case: fast retransmit works, and the cost is one extra RTT of latency.

Worst case: timeout-based retransmission with exponential backoff turns a single lost packet into seconds of delay.

Retransmissions also trigger congestion control. A timeout tells TCP the network is overwhelmed, so TCP slashes its transmission rate—sometimes by 90% or more. It takes time to ramp back up. Fast retransmit is gentler (halving the rate instead of resetting it), but still impacts throughput.

A healthy network has retransmission rates below 1%. Rates above 3-5% indicate problems: congestion, faulty hardware, wireless interference, routing issues. The impact depends on the application. Interactive traffic (SSH, gaming) suffers from increased latency. Bulk transfers suffer from reduced throughput.

Finding the Problem

When retransmission rates climb, look at the pattern.

Constant low-rate retransmissions suggest persistent hardware issues: duplex mismatch, damaged cable, marginal wireless signal.

Bursty retransmissions point to congestion or interference that comes and goes with traffic load.

Packet captures (tcpdump, Wireshark) reveal specifics. Are the same segments retransmitted multiple times? That's severe loss. Are retransmissions scattered across different sequence numbers? That's random loss. Do they correlate with traffic spikes? That's congestion.

System statistics help too. On Linux, netstat -s shows aggregate retransmission counters. Tracking these over time reveals when problems started and whether they correlate with changes—new configurations, traffic pattern shifts, hardware swaps.

Common culprits: buffer bloat (excessive buffering causing timeouts), congestion without proper queue management, wireless interference, faulty NICs, middleboxes dropping packets unpredictably. In data centers, watch for asymmetric routing where ACKs take different paths than data, potentially triggering spurious retransmissions.

Frequently Asked Questions About TCP Retransmissions

Was this page helpful?

😔

🤨

😃