中文版 | English
[TOC]
TCP provides a connection-oriented, reliable byte stream service.
- ARQ (Automatic Repeat Request)
- ACK (Acknowledgment)
- RTT (Round-trip-time estimation)
The TCP header immediately follows the IP header or IPv6 extension header, usually 20 bytes long (without TCP options). With options, the TCP header can be up to 60 bytes. Common options include Maximum Segment Size, Timestamp, Window Scaling, and Selective ACK.
-
Source PortCombined with the source address in the IP header to form an endpoint, uniquely identifying the sender. -
Destination PortCombined with the destination address in the IP header to form an endpoint, uniquely identifying the receiver. -
Sequence NumberIdentifies a byte in the data stream from the TCP sender to the TCP receiver; this byte represents the first byte of data in the segment containing this sequence number. -
Acknowledgment NumberThe next sequence number the sender expects to receive. -
Header LengthDefines the length of the header in 32-bit words; the TCP header is limited to 60 bytes, and the size without options is 20 bytes. -
Reserved -
CWRCongestion Window Reduced, used by the sender to reduce its sending rate. -
ECEECN Echo, indicates the sender has received an earlier congestion notification. -
URGUrgent, makes theUrgent Pointerfield valid, rarely used. -
ACKAcknowledgment, makes theAcknowledgment Numberfield valid, enabled after connection establishment. -
PSHPush -
RSTReset connection -
SYNInitialize sequence number -
FINEnd of data transmission -
Window SizeUsed to advertise the window size (in bytes, max 65535), implements flow control. -
TCP ChecksumMandatory, calculated and stored by the sender, verified by the receiver. -
Urgent PointerOnly valid when theURGfield is set. -
Options(variable length)Kind Length Name Description & Purpose 0 1 EOL End of option list 1 1 NOP No operation (used for padding) 2 4 MSS Maximum Segment Size 3 3 WSOPT Window scaling factor (left shift) 4 2 SACK-Permitted Sender supports SACK option 5 var SACK SACK block (received out-of-order) 8 10 TSOPT Timestamp option 28 4 UTO User timeout (terminate after idle) 29 var TCP-AO Authentication option (various algos) 253 var Experimental Reserved for experimental use 254 var Experimental Reserved for experimental use
TCP uses 4-tuple demultiplexing to obtain segments, including:
- Destination IP address
- Destination port
- Source IP address
- Source port
These four together form the local and remote node information.
A typical TCP connection establishment and termination. Usually, the client initiates a three-way handshake. During this process, the client and server exchange their initial sequence numbers using SYN segments (including the client's and server's initial sequence numbers). After both sides send a FIN packet and receive the corresponding ACK from the other, the connection is terminated.
- The client sends a SYN segment (sets the SYN flag), specifying the port and initial sequence number ISN(c).
- The server sends a SYN segment, including initial sequence number ISN(s), ACK = ISN(c)+1.
- Sends ACK = ISN(s)+1 segment.
- The client sends a FIN segment, including: sequence number (K), and an ACK segment to acknowledge the last data (L) sent by the peer.
- The server sets ACK to K+1, indicating it has received the client's FIN segment.
- The server sends its own FIN, sequence number L, indicating it has finished sending data.
- The client sends an ACK to acknowledge the previous FIN.
In a TCP half-close operation, one direction of the connection is closed, while the other direction continues to transmit data until it is also closed; (rarely used by applications)
Segments exchanged during simultaneous open. Compared to normal connection establishment, one more segment is needed. The SYN bit remains set until an ACK is received.
Segments exchanged during simultaneous close. Similar to normal close, but the order of segments is interleaved.
If one side closes or terminates the connection without notifying the other, the TCP connection is considered to be in a half-open state.
The TIME_WAIT state, also known as 2MSL wait state or double wait, means TCP will wait for twice the Maximum Segment Lifetime (MSL).
There are two reasons for the TIME_WAIT state:
- To reliably implement full-duplex connection termination.
- To allow old duplicate segments to disappear from the network.
Factors affecting the 2MSL wait state:
- When TCP performs an active close and sends the final ACK, the connection must remain in TIME_WAIT for twice the MSL. This allows TCP to retransmit the final ACK if needed. Retransmitting the final ACK is not because TCP retransmits ACKs (they do not consume sequence numbers and are not retransmitted by TCP), but because the peer may retransmit its FIN (which does consume a sequence number).
- While in the wait state, both sides define the connection (client IP, client port, server IP, server port) as unusable. Only after the 2MSL wait ends, or a new connection uses an initial sequence number greater than the previous instance, or timestamps are used to distinguish segments from previous connections, can the connection be reused.
There are mechanisms to bypass the 2MSL limit, such as the SO_REUSEADDR socket option...
Quiet Time: [RFC0793] states that after a crash or restart, TCP should wait for one MSL before creating a new connection; this period is called quiet time.
How to handle too many TIME_WAIT states?
In the FIN_WAIT_2 state, one TCP endpoint has sent a FIN and received an ACK. Unless a half-close occurs, this endpoint will wait for the peer application to recognize the end-of-file notification and close its end, causing a FIN to be sent. Only after this close (and the FIN is received) does the closing TCP move from FIN_WAIT_2 to TIME_WAIT. This means one side can remain in this state indefinitely. The other side may also remain in CLOSE_WAIT indefinitely until the application decides to close.
Each active TCP connection maintains the following windows:
- Send window structure
- Receive window structure
-
Advertised windowThe window advertised by the receiver -
Usable windowThe amount of data that can be sent immediately, value =$SND.UNA + SND.WND - SND.NXT$
The TCP sender's sliding window structure records acknowledged, in-flight, and unsent data sequence numbers. The size of the advertised window is controlled by the window size field in the ACK from the receiver.
Window boundaries move:
CloseLeft boundary moves right as sent data is acknowledged, reducing the window.OpenRight boundary moves right as acknowledged data is processed and the receiver's buffer increases, enlarging the window.ShrinkRight boundary moves left.
The TCP receiver's sliding window structure helps track the next expected data sequence number. If received data is within the window, it is stored; otherwise, it is discarded.
TODO
Congestion window (cwnd) reflects the network's transmission capacity.
Flight size is the amount of data sent but not yet acknowledged.
Optimal window size is the amount of data that can be stored in the network, close to the bandwidth-delay product (BDP), calculated as RTT times the minimum path rate (the bottleneck between sender and receiver).
The sender's actual usable window
-
$awnd$ Receiver's advertised window -
$cwnd$ Congestion window
Flow control: forces the sender to slow down when the receiver can't keep up. There are two methods:
- Rate-based flow control: Specifies a rate for the sender, ensuring data never exceeds this rate; suitable for streaming applications.
- Window-based flow control: Uses a sliding window, with a variable window size; the sender uses window advertisement or window update to adjust the window size.
TCP congestion control operates based on the principle of packet conservation. Due to limited transmission capacity, packets (
- Packet loss analysis
- Delay measurement
- Explicit Congestion Notification (ECN)
Nagle Algorithm: When a TCP connection has data in flight (sent but unacknowledged), small segments (less than SMSS) cannot be sent until all in-flight data is acknowledged. After receiving an ACK, TCP collects small data and sends them in one segment. This forces TCP to follow a stop-and-wait protocol—only after all in-flight data is acknowledged can it send more. The algorithm achieves self-clocking: the faster ACKs return, the faster data is sent. In high-latency WANs, reducing small packets is more important, and this algorithm reduces the number of segments sent per unit time. RTT controls the sending rate.
Example:
With Nagle enabled, at most one packet is in flight at any time, reducing small packets but increasing latency;
Combining delayed ACK and Nagle can cause a kind of deadlock (both sides wait for the other), leaving the connection idle and degrading performance. This deadlock is not permanent; it is resolved when the delayed ACK timer expires. Example:
For latency-sensitive applications, Nagle should be disabled:
Berkeley sockets: Set the TCP_NODELAY option.Windows: Set the registry valueHKLM\SOFTWARE\Microsoft\MsMQ\parameters\TCPNoDelayto 1.
When a new TCP connection is established or a retransmission timeout (RTO) occurs, slow start is performed. This allows TCP to reach a suitable
Initial Window (IW): TCP starts slow start by sending a certain number of segments (after SYN exchange). The formula is:
Example:
Classic slow start operation: Without delayed ACK, each good ACK allows the sender to send two new packets (left). The sender window grows exponentially over time (right, upper curve). With delayed ACK (one ACK per two packets),
Congestion avoidance algorithm: To obtain more transmission resources without affecting other connections, once the slow start threshold is set, TCP enters congestion avoidance, and
Analyzing the formula, suppose
Congestion avoidance operation: Without delayed ACK, each good ACK allows the sender to send
TODO
TODO
cwnd < ssthresh: Use slow startcwnd > ssthresh: Use congestion avoidancecwnd = ssthresh: Either algorithm
Regardless of algorithm, the slow start threshold (ssthresh) is updated as:
At the start of a TCP connection, slow start is used (
When three duplicate ACKs (or other fast retransmit signals) are received:
- ssthresh is updated to
$ssthresh = max(flight size / 2, 2*SMSS)$ - Fast retransmit is triggered,
$cwnd$ is set to$(ssthresh + 3 * SMSS)$ - For each duplicate ACK,
$cwnd$ temporarily increases by$1SMSS$ - When a good ACK is received,
$cwnd$ is reset to ssthresh
Slow start is always performed in the following cases:
- New connection establishment or retransmission timeout
- When the sender is idle for a long time, or if
$cwnd$ may not accurately reflect current congestion
Fast retransmit: The TCP sender retransmits a possibly lost segment after observing at least dupthresh duplicate ACKs, without waiting for the retransmission timer to expire. New data may also be sent. Duplicate ACKs indicate loss and trigger congestion control. Without SACK, at most one segment can be retransmitted per RTT before a valid ACK is received. With SACK, ACKs can carry extra info, allowing multiple holes to be filled per RTT.
If no ACK is received for a timed segment within the RTO, TCP triggers a timeout retransmission.
SACK: Describes received data using the cumulative ACK field in the TCP header.
Hole: The gap between the ACK number and other data in the receiver's buffer.
Out-of-order data: Data with a sequence number higher than the hole, not contiguous with previous data.
The receiver can generate SACKs after receiving the SACK-permitted option during connection setup. Whenever out-of-order data exists in the buffer, the receiver generates SACKs. Out-of-order data may be caused by loss or by new data arriving before old data.
Selective retransmission/repeat: The sender uses received SACK blocks to retransmit lost data efficiently.
When the sender receives SACK or duplicate ACKs, it can send new or retransmit old data. SACK info provides the receiver's data ranges, so the sender can infer which data to retransmit. The simplest method is to fill the receiver's holes first, then send new data [RFC3517].
Spurious retransmission: Retransmission may occur even without data loss, mainly due to spurious timeout, packet reordering, duplication, ACK loss, etc.
Solutions:
Detection algorithmsto determine if a timeout is spuriousResponse algorithmsto undo or mitigate the effects of a spurious timeout
Segment 1401 is deliberately dropped twice, causing the sender to trigger a timeout retransmission. Only when an ACK advances the send window are srtt, rttvar, and RTO updated. ACKs with a star (*) contain SACK info.
Packet reordering in IP networks occurs because IP does not guarantee in-order delivery. This is sometimes beneficial (at least for IP), as IP can choose a different path (e.g., a faster one) without worrying about new packets arriving before old ones. This causes the receive order to differ from the send order (other causes exist as well).
The keepalive mechanism probes the peer without affecting the data stream. It is implemented by a keepalive timer. When triggered, one side sends a keepalive probe segment, and the other side responds with an ACK.
- The peer is still working and reachable.
- The peer has crashed, shut down, or restarted.
- The client has crashed and restarted.
- The peer is working but unreachable for some reason.
- TCP segment forgery: By choosing the right sequence number, IP address, and port, anyone can forge a TCP segment to disrupt a normal connection [RFC5961].
TIME_WAIT Assassination (TWA): If segments, especially reset segments, are received during TIME_WAIT, the connection can be disrupted.
- Low-rate DoS attack: The attacker sends large amounts of data to keep the victim in retransmission timeout. The attacker can predict when the victim will retransmit and send more data at each retransmission. The victim always senses congestion and, per Karn's algorithm, reduces its rate and backs off, making normal bandwidth use impossible. The solution is to randomize RTO so the attacker cannot predict retransmission times.
- Slowing/speeding up TCP: The attacker can make RTT estimates too large, so the victim delays retransmission after loss. Conversely, the attacker can forge ACKs before data arrives, making the victim believe RTT is much smaller, causing excessive sending and wasted bandwidth.
- SYN flood - TCP denial attack: Malicious clients generate many SYNs (with spoofed source IPs) to a server. The server allocates resources for each half-open connection, and after exhausting memory, refuses new legitimate connections.
- Forged ICMP PTB attack: A forged ICMP PTB message with a very small MTU forces the victim to use tiny packets, greatly reducing performance.
- Sequence number attack: Disrupts or hijacks existing TCP connections, usually by desynchronizing the endpoints so they use incorrect sequence numbers.
- Spoofing attack: The attacker crafts a reset segment to disrupt or alter a TCP connection. If the 4-tuple and checksum are correct and the sequence number is in range, either endpoint can be forced to fail.
- SYN cookies: To solve SYN flood, most connection info is encoded in the SYN+ACK sequence number. The host does not allocate resources for incoming connections until the SYN+ACK is acknowledged and the initial sequence number is returned. All important parameters can be recovered, and the connection is set to ESTABLISHED.
- Client multi-"SYN cookies" technique: Exploits known timer flaws so all necessary connection state can be offloaded to the victim, exhausting its resources with minimal attacker effort.
- ACK splitting attack: Splits the original ACK range into multiple ACKs. Since TCP congestion control is based on ACK arrival (not the ACK field), the sender's
$cwnd$ grows faster than normal. - Duplicate ACK spoofing: In fast recovery, each duplicate ACK increases
$cwnd$ . This attack generates extra duplicate ACKs, causing$cwnd$ to grow faster than normal. - Optimistic ACK attack: ACKs are generated for segments that have not yet arrived. Since TCP congestion control is based on end-to-end RTT, early ACKs make the sender believe RTT is smaller, causing it to send faster than normal.
- SSH has an application-layer keepalive mechanism (server/client keepalive messages), which differs from TCP keepalive in that it is sent over an encrypted channel and contains data. TCP keepalive contains no user data and is only minimally encrypted.
TODO
[1] Kevin R. Fall, W. Richard Stevens. TCP/IP Illustrated, 3rd Edition [2] https://github.com/google/bbr/tree/v2alpha

















