TCP Deepdive

= TCP =

Source: TCP/IP Protocol-Suite, B.Forouzan


 * TCP uses the services of IP, a connectionless protocol, but itself is connection-oriented.
 * TCP uses the services of IP to deliver individual segments to the receiver, but it controls the connection itself.
 * If a segment is lost or corrupted, it is retransmitted. IP is unaware of this retransmission.
 * If a segment arrives out of order, TCP holds it until the missing segments arrive; IP is unaware of this reordering.
 * Sequence number of packet is the number of the first byte in the packet.
 * Together with length in the TCP header, we know which packet has which bytes

TCP Connection
Connection Establishment Data Transfer Connection Termination
 * TCP transmits data in full-duplex mode.
 * When two TCPs in two machines are connected, they are able to send segments to each other simultaneously.
 * In TCP, connection-oriented transmission requires three phases:

Three way handshake

 * Server program tells its TCP that it is ready to accept a connection.
 * This request is called a Passive Open.
 * The client program issues a request for an active open.
 * TCP can now start the three-way handshaking process




 * 1st Packet:
 * SYN segment is for synchronization of sequence numbers.
 * The client chooses a random number as the first sequence number called Initial Sequence Number(ISN) and sends it to the Server.
 * This segment does not contain an Acknowledgment Number.
 * It does not define the window size either; a window size definition makes sense only when a segment includes an Acknowledgment.
 * This can include some options - WSF, MSS, SACK_PERM
 * SYN segment is a control segment and carries no data, However it consumes one sequence number.
 * When the data transfer starts, the ISN is incremented by 1.
 * We can say that the SYN segment carries no real data, but we can think of it as containing one imaginary byte.


 * 2nd Packet:
 * The server sends a SYN + ACK segment with two flag bits set: SYN and ACK.
 * This segment has a dual purpose.
 * First, it is a SYN segment for communication in the other direction.
 * The server uses this segment to initialize a sequence number for numbering the bytes sent from the server to the client.
 * The server also acknowledges the receipt of the SYN segment from the client by setting the ACK flag and displaying the next sequence number it expects to receive from the client.
 * Because it contains an acknowledgment, it also needs to define the receive window size, rwnd, to be used by the client.


 * 3rd Packet:
 * The client sends the third segment which is just an ACK segment.
 * It acknowledges the receipt of the second segment with the ACK flag and Acknowledgment Number field.
 * Sequence number in this segment is the same as the one in the SYN-ACK segment; the ACK segment does not consume any sequence numbers.
 * The client must also define the server window size.
 * Third segment usually does not carry data and consumes no sequence numbers.


 * Note:
 * A SYN cannot carry data, but it consumes one Sequence number.
 * A SYN+ACK cannot carry data, but consumes one Sequence number.
 * A ACK if carrying no data, consumes no sequence number.

Simultaneous Open

 * In rare situation when both processes issue an active open.
 * In this case, both TCPs transmit a SYN + ACK segment to each other.
 * Only one single connection is established between them.

SYN Flooding Attack

 * TCP handshake is susceptible to SYN flooding attack.
 * This happens when a malicious attackers send a large number of SYN segments.
 * The server, assuming that the clients are issuing an active open, allocates the necessary resources and setting timers.
 * The TCP server then sends the SYN+ACK segments to the fake clients, which are lost.
 * When the server waits for the third packet, resources are allocated without being used.
 * If the number of SYN segments is large, the server eventually runs out of resources.
 * It may be unable to accept connection requests from valid clients.
 * This SYN flooding attack belongs to denial of service attack group.
 * One strategy is to postpone resource allocation until the server can verify that the connection request is coming from a valid IP address, by using a Cookie.
 * SCTP uses this strategy.

Data Transfer

 * After connection is established, bidirectional data transfer can take place.
 * The client and server can send data and acknowledgments in both directions.
 * Data traveling in the same direction as an acknowledgment are carried on the same segment.
 * The acknowledgment is piggybacked with the data.

Connection Termination
Three-way Termination Four-way Termination with a half-close option.
 * Any of the two parties involved in exchanging data (client or server) can close the connection, it is usually initiated by the client.
 * Most implementations today allow two options for connection termination:

Three-Way Termination



 * 1st Packet:
 * The client TCP, after receiving a close command from the client process, sends the FIN segment.
 * A FIN segment can include the last chunk of data sent by the client or it can be just a control segment.
 * If it is only a control segment, it consumes only one sequence number.


 * 2nd Packet:
 * The server TCP after receiving the FIN, informs its process
 * It then sends a FIN+ACK to confirm the receipt of the FIN from the client and to announce the closing of the connection in the other direction.
 * This segment can also contain the last chunk of data from the server.
 * If it does not carry data, it consumes only one sequence number.


 * 3rd Packet:
 * The client TCP sends an ACK segment to confirm the receipt of the FIN from the TCP server.
 * This segment contains the acknowledgment number, which is one plus the sequence number received in the FIN segment from the server.
 * This segment cannot carry data and consumes no sequence numbers.


 * Note:
 * The FIN segment consumes one sequence number if it does not carry data.
 * The FIN + ACK segment consumes one sequence number if it does not carry data.

Half-Close

 * In TCP, one end can stop sending data while still receiving data. This is called a Half-Close.
 * Either the server or the client can issue a half-close request.
 * It can occur when the server needs all the data before processing can begin.
 * An example is sorting.
 * When the client sends data to the server to be sorted, the server needs to receive all the data before sorting can start.
 * This means the client, after sending all data, can close the connection in the client-to-server direction.
 * However, the server-to-client direction must remain open to return the sorted data.
 * The server, after receiving the data, still needs time for sorting; its outbound direction must remain open.




 * The data transfer from the client to the server stops.
 * The client half-closes the connection by sending a FIN segment.
 * The server accepts the half-close by sending the ACK segment.
 * The server, however, can still send data.
 * When the server has sent all of the processed data, it sends a FIN segment, which is acknowledged by an ACK from the client.
 * After half closing the connection, data can travel from server to client and acknowledgments can travel from client to server.
 * The client cannot send any more data to the server.

Connection Reset
Deny a connection request Abort an existing connection Terminate an idle connection
 * TCP at any end may
 * All of these are done with the RST flag.

Maximum Segment Life

 * The TCP standard defines MSL as being a value of 120 seconds (2 minutes).
 * In modern networks TCP allows implementations to choose a lower value.
 * The common value for MSL is between 30 seconds and 1 minute.
 * The MSL is the maximum time a segment can exist in the Internet before it is dropped.
 * TCP segment is encapsulated in an IP datagram, which has a limited lifetime (TTL).
 * When the IP datagram is dropped, the encapsulated TCP segment is also dropped.

TIME-WAIT state and 2SML timer
There are two reasons for the existence of the TIME-WAIT state and the 2SML timer:


 * 1st Reason:
 * If the last ACK segment is lost, the server TCP, which sets a timer for the last FIN, assumes that its FIN is lost and resends it.
 * If the client goes to the CLOSED state and closes the connection before the 2MSL timer expires, it never receives this resent FIN segment, and consequently, the server never receives the final ACK.
 * The server cannot close the connection.
 * The 2MSL timer makes the client wait for a duration that is enough time for an ACK to be lost (one SML) and a FIN to arrive (another SML).
 * If during the TIME-WAIT state, a new FIN arrives, the client sends a new ACK and restarts the 2SML timer.


 * 2nd Reason:
 * A duplicate segment from one connection might appear in the next one.
 * Assume a client and a server have closed a connection.
 * After a short period of time, they open a connection with the same socket addresses (same source and destination IP addresses and same source and destination port numbers).
 * This new connection is called an incarnation of the old one.
 * A duplicated segment from the previous connection may arrive in this new connection and be interpreted as belonging to the new connection if there is not enough time between the two connections.
 * To prevent this problem, TCP requires that an incarnation cannot occur unless 2MSL amount of time has elapsed.
 * Some implementations, however, ignore this rule if the initial sequence number of the incarnation is greater than the last sequence number used in the previous connection.

TCP Windows
Send window Receive window
 * TCP uses two windows for each direction of data transfer:
 * Four windows for a bidirectional communication.

Send Window



 * The window shown here is of size 100 bytes (normally thousands of bytes).
 * The send window size is dictated by the receiver (flow control) and the congestion in the underlying network (congestion control).
 * The figure shows how a send window opens, closes, or shrinks.

Receive Window


rwnd = buffer size − number of waiting bytes to be pulled
 * TCP allows the receiving process to pull data at its own pace.
 * This means that part of the allocated buffer at the receiver may be occupied by bytes that have been received and acknowledged, but are waiting to be pulled by the receiving process.
 * The receive window size is then always smaller or equal to the buffer size
 * The receiver window size determines the number of bytes that the receive window can accept from the sender before being overwhelmed (flow control).

Flow Control

 * Flow control balances the rate a producer creates data with the rate a consumer can use the data.
 * TCP separates flow control from error control.




 * Data travels from Sending Process to Sending TCP, then to the Receiving TCP, and finaly to the receiving process (paths 1, 2, and 3).
 * Flow control feedback's are traveling from the receiving TCP to the sending TCP and from the sending TCP up to the sending process (paths 4 and 5).
 * Most implementations of TCP do not provide flow control feedback from the receiving process to the receiving TCP; they let the receiving process pull data from the receiving TCP whenever it is ready.
 * Thus receiving TCP controls the sending TCP; the sending TCP controls the sending process.
 * Flow control feedback from the Sending TCP to the Sending Process (path 5) is achieved through simple rejection of data by sending TCP when its window is full.
 * Windows are used to achieve flow control from Receiving TCP to Sending TCP, as discussed in below section.

Opening and Closing Windows

 * To achieve flow control, TCP forces the sender and the receiver to adjust their window sizes.
 * The size of the buffer for both parties is fixed when the connection is established.
 * The receive window closes (moves its left wall to the right) when more bytes arrive from the sender;
 * It opens (moves its right wall to the right) when more bytes are pulled by the process.
 * Assume that it does not shrink (the right wall does not move to the left).
 * The opening, closing, and shrinking of the send window is controlled by the receiver.
 * The send window closes (moves its left wall to the right) when a new acknowledgement allows it to do so.
 * The send window opens (its right wall moves to the right) when the RWND advertised by the receiver allows it to do so.



The diagram shows 8 segments:

1. Client sends the server a SYN to request connection. The client announces its ISN = 100. The server, allocates a buffer size of 800 (assumption) and sets its window to cover the whole buffer (rwnd = 800). The number of the next byte to arrive starts from 101.

2. This is an ACK + SYN segment. The segment uses ack no = 101 to show that it expects to receive bytes starting from 101. It also announces that the client can set a buffer size of 800 bytes.

3. The third segment is an ACK segment from client to server.

4. After the client has set its window with the size (800) dictated by the server, the process pushes 200 bytes of data. The TCP client numbers these bytes 101 to 300. It creates a segment and sends it to server. The segment has starting byte number as 101 and the segment carries 200 bytes. The window of client is then adjusted to show 200 bytes of data are sent but waiting for acknowledgment. When this segment is received at the server, the bytes are stored, and the receive window closes to show that the next byte expected is byte 301; the stored bytes occupy 200 bytes of buffer.

5. The fifth segment is the feedback from the server to the client. The server acknowledges bytes up to and including 300 (expecting to receive byte 301). The segment also carries the size of the receive window after decrease (600). The client, after receiving this segment, purges the acknowledged bytes from its window and closes its window to show that the next byte to send is byte 301. The window size decreases to 600 bytes. Although the allocated buffer can store 800 bytes, the window cannot open (moving its right wall to the right) because the receiver does not let it.

6. Sent by the client after its process pushes 300 more bytes. The segment defines seq no as 301 and contains 300 bytes. When this segment arrives at the server, the server stores them, but it has to reduce its window size. After its process has pulled 100 bytes of data, the window closes from the left for the amount of 300 bytes, but opens from the right for the amount of 100 bytes. The result is that the size is only reduced 200 bytes. The receiver window size is now 400 bytes.

7. The server acknowledges the receipt of data, and announces that its window size is 400. When this segment arrives at the client, the client has no choice but to reduce its window again and set the window size to the value of rwnd = 400. The send window closes from the left by 300 bytes, and opens from the right by 100 bytes.

8. This one is also from the server after its process has pulled another 200 bytes. Its window size increases. The new rwnd value is now 600. The segment informs the client that the server still expects byte 601, but the server window size has expanded to 600. After this segment arrives at the client, the client opens its window by 200 bytes without closing it. The result is that its window size increases to 600 bytes.


 * Shrinking of Windows
 * The receive window cannot shrink.
 * The send window can shrink if the receiver defines a value for rwnd that results in shrinking the window.

Window Shutdown

 * Shrinking the send window by moving its right wall to the left is discouraged.
 * There is one exception: the receiver can temporarily shut down the window by sending a RWND of 0.
 * This can happen if the receiver does not want to receive data from the sender for a while.
 * The sender do not actually shrink the size of the window, but stops sending data until a new advertisement has arrived.
 * Even when the window is shut down by an order from the receiver, the sender can always send a segment with 1 byte of data.
 * This is called Probing and is used to prevent a deadlock.

Silly Window Syndrome

 * A serious problem can arise in the sliding window operation when either the sending application program creates data slowly or the receiving application program consumes data slowly, or both.
 * Any of these situations results in the sending of data in very small segments, which reduces the efficiency of the operation.
 * If TCP sends segments containing only 1 byte of data, it means that a 41-byte datagram (20 bytes TCP header and 20 bytes IP header) transfers only 1 byte of user data.
 * The Overhead is 41:1
 * The inefficiency is even worse after accounting for the data link layer and physical layer overhead.


 * Syndrome due to Sender


 * The sending TCP may create a silly window syndrome if it is serving an application program that creates data slowly(e.g:1 byte at a time).
 * The application program writes 1 byte at a time into the buffer of the sending TCP.
 * If the sending TCP does not have any specific instructions, it may create segments containing 1 byte of data.
 * The result is a lot of 41-byte segments that are traveling through an internet.
 * The solution is to prevent the sending TCP from sending the data byte by byte.
 * The sending TCP must be forced to wait and collect data to send in a larger block.
 * If it waits too long, it may delay the process.
 * If it does not wait long enough, it may end up sending small segments.


 * Solution - Nagle’s Algorithm


 * The sending TCP sends the first piece of data it receives from the sending application program even if it is only 1 byte.
 * After sending the first segment, the sending TCP accumulates data in the output buffer and waits until either the receiving TCP sends an acknowledgment or until enough data has accumulated to fill a maximum-size segment.
 * Above Step is repeated for the rest of the transmission.


 * Syndrome Created by the Receiver


 * if Receiving TCP is serving an application that consumes data slowly (like 1 byte at a time) Syndrome may occur.
 * Assume that the sender creates data in blocks of 1000 byte, but the receiver consumes data 1 byte at a time.
 * Also assume that the input buffer of the receiving TCP is 4 kilobytes. The sender sends the first 4 kilobytes of data.
 * The receiver stores it in its buffer.
 * Now its buffer is full.
 * It advertises a window size of zero, which means the sender should stop sending data.
 * The receiving application reads the first byte of data from the input buffer of the receiving TCP.
 * Now there is 1 byte of space in the incoming buffer.
 * The receiving TCP announces a window size of 1 byte, which means that the sending TCP takes this advertisement as good news and sends a segment carrying only 1 byte of data.
 * The procedure will continue.
 * One byte of data is consumed and a segment carrying 1 byte of data is sent.
 * This is again an efficiency problem.


 * Two solutions are possible


 * Clark’s Solution


 * Announce a window size of zero until either
 * 1) There is enough space to accommodate a segment of maximum size
 * 2) At least half of the receive buffer is empty.


 * Delayed Acknowledgment


 * The second solution is to delay sending the acknowledgment.
 * This means that when a segment arrives, it is not acknowledged immediately.
 * The receiver waits until there is a decent amount of space in its incoming buffer before acknowledging the arrived segments.
 * The delayed acknowledgment prevents the sending TCP from sliding its window.
 * After the sending TCP has sent the data in the window, it stops.
 * This removes the syndrome.
 * Delayed acknowledgment also has another advantage: it reduces traffic.
 * The receiver does not have to acknowledge each segment.
 * However, there also is a disadvantage in that the delayed acknowledgment may result in the sender unnecessarily retransmitting the unacknowledged segments.
 * TCP adjusts this by defining that the acknowledgment should not be delayed by more than 500 ms.

Error Control

 * TCP is a reliable transport layer protocol.
 * This means that an application program that delivers a stream of data to TCP relies on TCP to deliver the entire stream to the application program on the other end in order, without error, and without any part lost or duplicated.
 * TCP provides reliability using error control. Error control includes mechanisms for detecting and resending corrupted segments, resending lost segments, storing out-of-order segments until missing segments arrive, and detecting and discarding duplicated segments.
 * Error control in TCP is achieved through the use of three simple tools: checksum, acknowledgment, and time-out.

Checksum

 * Each segment includes a checksum field, which is used to check for a corrupted segment.
 * If a segment is corrupted as deleted by an invalid checksum, the segment is discarded by the destination TCP and is considered as lost.
 * TCP uses a 16-bit checksum that is mandatory in every segment.

Acknowledgment

 * TCP uses acknowledgments to confirm the receipt of data segments.
 * Control segments that carry no data but consume a sequence number are also acknowledged.
 * ACK segments are never acknowledged.

There are two types of acknowledgment:
 * Acknowledgment Type:


 * Cumulative Acknowledgment (ACK)
 * TCP was originally designed to acknowledge receipt of segments cumulatively.
 * The receiver advertises the next byte it expects to receive, ignoring all segments received and stored out of order.
 * Also called Positive Cumulative Acknowledgment or ACK.
 * "Positive” indicates that no feedback is provided for discarded, lost, or duplicate segments.
 * The 32-bit ACK field in the TCP header is used for cumulative acknowledgments
 * Its value is valid only when the ACK flag bit is set to 1.


 * Selective Acknowledgment (SACK)
 * A SACK does not replace ACK, but reports additional information to the sender.
 * A SACK reports a block of data that is out of order.
 * Also reports a block of segments that is duplicated.
 * There is no provision in the TCP header for adding this type of information.
 * SACK is implemented as an option at the end of the TCP header.


 * Acknowledgment Generation

1. When end A sends a data segment to end B, it must include (piggyback) an acknowledgment that gives the next sequence number it expects to receive. This rule decreases the number of segments needed and therefore reduces traffic.

2. When the receiver has no data to send and it receives an in-order segment (with expected sequence number) and the previous segment has already been acknowledged, the receiver delays sending an ACK segment until another segment arrives or until a period of time (normally 500 ms) has passed. In other words, the receiver needs to delay sending an ACK segment if there is only one outstanding in-order segment. This rule reduces ACK segment traffic.

3. When a segment arrives with a sequence number that is expected by the receiver, and the previous in-order segment has not been acknowledged, the receiver immediately sends an ACK segment. In other words, there should not be more than two in-order unacknowledged segments at any time. This prevents the unnecessary retransmission of segments that may create congestion in the network.

4. When a segment arrives with an out-of-order sequence number that is higher than expected, the receiver immediately sends an ACK segment announcing the sequence number of the next expected segment. This leads to the fast retransmission of missing segments.

5. When a missing segment arrives, the receiver sends an ACK segment to announce the next sequence number expected. This informs the receiver that segments reported missing have been received.

6. If a duplicate segment arrives, the receiver discards the segment, but immediately sends an acknowledgment indicating the next in-order segment expected. This solves some problems when an ACK segment itself is lost.

Retransmission

 * The heart of the error control mechanism is the retransmission of segments.
 * When a segment is sent, it is stored in a queue until it is acknowledged.
 * When the retransmission timer expires or when the sender receives three duplicate ACKs for the first segment in the queue, that segment is retransmitted.


 * Retransmission after RTO


 * The sending TCP maintains one retransmission time-out (RTO) for each connection.
 * When the timer matures, i.e. times out, TCP sends the segment in the front of the queue (the segment with the smallest sequence number) and restarts the timer.
 * Note that again we assume Sf < Sn.
 * This version of TCP is sometimes referred to as Tahoe.
 * We will see later that the value of RTO is dynamic in TCP and is updated based on the round-trip time (RTT) of segments.
 * RTT is the time needed for a segment to reach a destination and for an acknowledgment to be received.


 * Retransmission after Three Duplicate ACK Segments(Reno)


 * The previous rule about retransmission of a segment is sufficient if the value of RTO is not large.
 * To help throughput by allowing sender to retransmit sooner than waiting for a time out, most implementations today follow the three duplicate ACKs rule and retransmit the missing segment immediately.
 * This feature is called fast retransmission, and the version of TCP that uses this feature is referred to as Reno.
 * In this version, if three duplicate acknowledgments (i.e., an original ACK plus three exactly identical copies) arrives for a segment, the next segment is retransmitted without waiting for the time-out.


 * Out-of-Order Segments


 * TCP implementations today do not discard out-of-order segments.
 * They store them temporarily and flag them as out-of-order segments until the missing segments arrive.
 * Out-of-order segments are never delivered to the process.
 * TCP guarantees that data are delivered to the process in order.


 * Lost Segment




 * A lost segment is discarded somewhere in the network; a corrupted segment is discarded by the receiver itself.
 * Both are considered lost.
 * We are assuming that data transfer is unidirectional: one site is sending, the other receiving.
 * In our scenario, the sender sends segments 1 and 2, which are acknowledged immediately by an ACK (rule 3).
 * Segment 3, however, is lost.
 * The receiver receives segment 4, which is out of order.
 * The receiver stores the data in the segment in its buffer but leaves a gap to indicate that there is no continuity in the data.
 * The receiver immediately sends an acknowledgment to the sender displaying the next byte it expects (rule 4).
 * Note that the receiver stores bytes 801 to 900, but never delivers these bytes to the application until the gap is filled.
 * The sender TCP keeps one RTO timer for the whole period of connection.
 * When the third segment times out, the sending TCP resends segment 3, which arrives this time and is acknowledged properly (rule 5).


 * Fast Retransmission


 * Here RTO has a larger value.
 * Each time the receiver receives the fourth, fifth, and sixth segments, it triggers an acknowledgment (rule 4).
 * The sender receives four acknowledgments with the same value (three duplicates).
 * Although the timer has not matured, the rule for fast transmission requires that segment 3, the segment that is expected by all of these duplicate acknowledgments, be resent immediately.
 * After resending this segment, the timer is restarted.


 * Delayed Segment
 * TCP uses the services of IP, which is a connectionless protocol.
 * Each IP datagram encapsulating a TCP segment may reach the final destination through a different route with a different delay.
 * Hence TCP segments may be delayed.
 * Delayed segments sometimes may time out.
 * If the delayed segment arrives after it has been resent, it is considered a duplicate segment and discarded.


 * Duplicate Segment
 * A duplicate segment can be created, for example, by a sending TCP when a segment is delayed and treated as lost by the receiver.
 * Handling the duplicated segment is a simple process for the destination TCP.
 * The destination TCP expects a continuous stream of bytes.
 * When a segment arrives that contains a sequence number equal to an already received and stored segment, it is discarded.
 * An ACK is sent with ackNo defining the expected segment.


 * Automatically Corrected Lost ACK




 * A key advantage of using cumulative acknowledgments.
 * Figure shows a lost acknowledgment sent by the receiver of data.
 * In the TCP acknowledgment mechanism, a lost acknowledgment may not even be noticed by the source TCP.
 * TCP uses an accumulative acknowledgment system.
 * We can say that the next acknowledgment automatically corrects the loss of the acknowledgment.




 * If the next acknowledgment is delayed for a long time or there is no next acknowledgment (the lost acknowledgment is the last one sent), the correction is triggered by the RTO timer.
 * A duplicate segment is the result.
 * When the receiver receives a duplicate segment, it discards it, and resends the last ACK immediately to inform the sender that the segment or segments have been received.
 * Note that only one segment is retransmitted although two segments are not acknowledged.
 * When the sender receives the retransmitted ACK, it knows that both segments are safe and sound because acknowledgment is cumulative.


 * Deadlock Created by Lost Acknowledgment


 * There is one situation in which loss of an acknowledgment may result in system deadlock.
 * This is the case in which a receiver sends an acknowledgment with rwnd set to 0 and requests that the sender shut down its window temporarily.
 * After a while, the receiver wants to remove the restriction; however, if it has no data to send, it sends an ACK segment and removes the restriction with a nonzero value for rwnd.
 * A problem arises if this acknowledgment is lost.
 * The sender is waiting for an acknowledgment that announces the nonzero rwnd.
 * The receiver thinks that the sender has received this and is waiting for data.
 * This situation is called a deadlock; each end is waiting for a response from the other end and nothing is happening.
 * A retransmission timer is not set.
 * To prevent deadlock, a persistence timer was designed.

Congestion Control

 * Congestion control in TCP is based on both open-loop and closed-loop mechanisms.
 * TCP uses a congestion window and a congestion policy that avoid congestion and detect and alleviate congestion after it has occurred.

Actual window size = Minimum (rwnd, cwnd)
 * Congestion Window
 * It is not only the receiver that can dictate to the sender the size of the sender’s window.
 * The network can also dectate the size.
 * If the network cannot deliver the data as fast as it is created by the sender, it must tell the sender to slow down.
 * So Receiver and Network determine the size of the sender’s window.
 * The sender has two pieces of information: the Receiver-Advertised window size and the Congestion window size.
 * The actual size of the window is the minimum of these two:


 * Congestion Policy
 * TCP’s general policy for handling congestion is based on three phases:
 * Slow Start
 * Congestion Avoidance
 * Congestion Detection


 * In the slow start phase, the sender starts with a slow rate of transmission, but increases the rate rapidly to reach a threshold.
 * When the threshold is reached, the rate of increase is reduced.
 * Finally if ever congestion is detected, the sender goes back to the slow start or congestion avoidance phase, based on how the congestion is detected.


 * Slow Start - Exponential Increase




 * The slow start algorithm is based on the idea that the size of the congestion window (cwnd) starts with 1 MSS.
 * The MSS is determined during connection establishment using an option of the same name.
 * The size of the window increases one MSS each time one acknowledgement arrives.
 * The algorithm starts slowly, but grows exponentially.
 * Assume that rwnd is much longer than cwnd, so that the sender window size always equals cwnd.
 * Ignore delayed-ACK policy for now and assume that each segment is acknowledged individually.
 * The sender starts with cwnd = 1 MSS.
 * This means that the sender can send only one segment.
 * After the first ACK arrives, the size of the congestion window is increased by 1, which means that cwnd is now 2.
 * Now two more segments can be sent.
 * When two more ACKs arrive, the size of the window is increased by 1 MSS for each ACK, which means cwnd is now 4.
 * Now four more segments can be sent.
 * When four ACKs arrive, the size of the window increases by 4, which means that cwnd is now 8.
 * In the slow start algorithm, the size of the congestion window increases exponentially until it reaches a threshold.


 * Congestion Avoidance - Additive Increase




 * In slow start algorithm, the size of the congestion window increases exponentially.
 * To avoid congestion before it happens, one must slow down this exponential growth.
 * TCP's Congestion avoidance feature increases the cwnd additively instead of exponentially.
 * When the size of the congestion window reaches the slow start threshold, the slow start phase stops and the additive phase begins.
 * Each time the whole “window” of segments is acknowledged, the size of the congestion window is increased by one.
 * A window is the number of segments transmitted during RTT.
 * The increase is based on RTT, not on the number of arrived ACKs.
 * Therefore the size of the congestion window increases additively until congestion is detected.


 * Congestion Detection - Multiplicative Decrease


 * If congestion occurs, the congestion window size must be decreased.
 * The only way a sender can guess that congestion has occurred is the need to retransmit a segment.
 * This is a major assumption made by TCP.
 * Retransmission is needed to recover a missing packet which is assumed to have been dropped by a router due to overloaded or congested.
 * Retransmission can occur in one of two cases: when the RTO timer times out or when three duplicate ACKs are received.
 * In both cases, the size of the threshold is dropped to half (multiplicative decrease).

Most TCP implementations have two reactions:

1. If a time-out occurs, there is a stronger possibility of congestion; a segment has probably been dropped in the network and there is no news about the following sent segments.

In this case TCP reacts strongly:
 * a. It sets the value of the threshold to half of the current window size.


 * b. It reduces cwnd back to one segment.


 * c. It starts the slow start phase again.

2. If three duplicate ACKs are received, there is a weaker possibility of congestion; a segment may have been dropped but some segments after that have arrived safely since three duplicate ACKs are received. This is called fast transmission and fast recovery.

In this case, TCP has a weaker reaction as shown below:
 * a. It sets the value of the threshold to half of the current window size.


 * b. It sets cwnd to the value of the threshold (some implementations add three segment sizes to the threshold).


 * c. It starts the congestion avoidance phase.

TCP Timers
Most TCP implementations use at least four timers
 * Retransmission
 * Persistence
 * Keepalive
 * TIME-WAIT


 * Retransmission Timer

To retransmit lost segments, TCP employs one retransmission timer for the whole connection period that handles the retransmission time-out (RTO), the waiting time for an acknowledgment of a segment.

The following rules apply to the retransmission timer:

1. When TCP sends the segment in front of the sending queue, it starts the timer.

2. When the timer expires, TCP resends the first segment in front of the queue, and restarts the timer.

3. When a segment (or segments) are cumulatively acknowledged, the segment (or segments) are purged from the queue.

4. If the queue is empty, TCP stops the timer; otherwise, TCP restarts the timer.

To calculate the retransmission time-out (RTO), we first need to calculate the RTT.
 * Round-Trip Time (RTT)
 * Measured RTT - The measured round-trip time for a segment is the time required for the segment to reach the destination and be acknowledged, although the acknowledgment may include other segments. In TCP only one RTT measurement can be in progress at any time.
 * Smoothed RTT - The measured RTT is likely to change for each round trip. The fluctuation is so high in today’s Internet that a single measurement alone cannot be used for retransmission time-out purposes.
 * RTT Deviation - Most implementations use RTT deviation


 * Retransmission Time-out (RTO)
 * The value of RTO is based on the smoothed round-trip time and its deviation.
 * Take the running smoothed average value of Smoothed RTT, and add four times the running smoothed average value of RTT Deviation (normally a small value).


 * Karn’s Algorithm
 * Do not consider the round-trip time of a retransmitted segment in the calculation of RTTs.
 * Do not update the value of RTTs until you send a segment and receive an acknowledgment without the need for retransmission.
 * TCP does not consider the RTT of a retransmitted segment in its calculation of a new RTO.


 * Exponential Backoff
 * Most TCP implementations use an exponential backoff strategy to calculate the value of RTO if a retransmission occurs.
 * The value of RTO is doubled for each retransmission.
 * So if the segment is retransmitted once, the value is two times the RTO.
 * If it transmitted twice, the value is four times the RTO.


 * Persistence Timer
 * To deal with a zero-window-size advertisement, TCP needs Persistence Timer.
 * If the receiving TCP announces a window size of zero, the sending TCP stops transmitting segments until the receiving TCP sends an ACK segment announcing a nonzero window size.
 * This ACK segment can be lost.
 * Remember - ACK segments are not acknowledged nor retransmitted in TCP.
 * Both TCPs might continue to wait for each other forever (a deadlock).
 * To correct this deadlock, TCP uses a persistence timer for each connection.
 * When the sending TCP receives an acknowledgment with a window size of zero, it starts a persistence timer.
 * When the persistence timer goes off, the sending TCP sends a special segment called a Probe.
 * This segment contains only 1 byte of new data.
 * It has a sequence number, but its sequence number is never acknowledged; it is even ignored in calculating the sequence number for the rest of the data.
 * The probe causes the receiving TCP to resend the acknowledgment.
 * The value of the persistence timer is set to the value of the retransmission time.
 * If a response is not received from the receiver, another probe segment is sent and the value of the persistence timer is doubled and reset.
 * The sender continues sending the probe segments and doubling and resetting the value of the persistence timer until the value reaches a threshold (generally 60s).
 * After that the sender sends one probe segment every 60 s until the window is reopened.


 * Keepalive Timer
 * A keepalive timer is used in some implementations to prevent a long idle connection between two TCPs.
 * If a client opens a TCP connection to a server, transfers some data, and becomes silent.
 * Perhaps the client has crashed. In this case, the connection remains open forever.
 * To remedy this situation, most implementations equip a server with a keepalive timer.
 * Each time the server hears from a client, it resets this timer.
 * The time-out is usually 2 hours.
 * If the server does not hear from the client after 2 hours, it sends a probe segment.
 * If there is no response after 10 probes, each of which is 75s apart, it assumes that the client is down and terminates the connection.

Options
The TCP header can have up to 40 bytes of optional information.


 * 1-byte options
 * 1) End of option list
 * 2) No operation


 * Multiple-byte options
 * 1) Maximum Segment Size
 * 2) Window Scale Factor
 * 3) Timestamp
 * 4) SACK-permitted
 * 5) SACK


 * End of Option
 * EOP is a 1-byte option used for padding at the end of the option section.
 * It can only be used as the last option. There are no more options in the header after EOP.
 * Only one occurrence of this option is allowed.
 * After this option, the receiver looks for the payload data.
 * Data from the application program starts at the beginning of the next 32-bit word.


 * No Operation
 * NOP option is also a 1-byte option used as a filler.
 * It normally comes before another option to help align it in a four-word slot.
 * NOP can be used more than once.


 * Maximum Segment Size (MSS)


 * MSS option defines the size of the biggest unit of data that can be received by the destination of the TCP segment.
 * It defines the maximum size of the data, not the maximum size of the segment.
 * The field is 16 bits long, the value can be 0 to 65,535 bytes.
 * Each party defines the MSS for the segments it will receive during the connection.
 * If a party does not define this, the default values is 536 bytes.
 * The value of MSS is determined during connection establishment and does not change during the connection.


 * Window Scale Factor
 * Window size field in the header defines the size of the sliding window.
 * This field is 16 bits long, which means that the window can range from 0 to 65,535 bytes.
 * It may not be sufficient if the data are traveling through a long channel with a wide bandwidth.
 * To increase the window size, a window scale factor is used.
 * The new window size is found by first raising 2 to the number specified in the window scale factor.
 * Then this result is multiplied by the value of the window size in the header.

New Window Size = Window Size in Header × 2 Window Scale Factor

If Window Scale Factor is 3. An end point receives an acknowledgment in which the window size is advertised as 32,768. New Window Size = 32,768 × 23 = 262,144 bytes.


 * Although the scale factor could be as large as 255, the largest value allowed by TCP/IP is 14.
 * Maximum window size is 216 × 214 = 230, which is less than the maximum value for the sequence number.
 * The size of the window cannot be greater than the maximum value of the sequence number.
 * The value of the window scale factor can also be determined only during connection establishment; it does not change during the connection.
 * During data transfer, the size of the window (specified in the header) may be changed, but it must be multiplied by the same window scale factor.
 * One end may set the value of the window scale factor to 0, which means although it supports this option, it does not want to use it for this connection.


 * Timestamp


 * This is a 10-byte option.
 * TS is announced in the SYN.
 * If SYN + ACK from the other end also has TS, it is allowed; otherwise it does not use it any more.
 * The timestamp option has two applications: RTT Calc and PAWS attack prevention.


 * Measuring RTT


 * Timestamp can be used to measure the round-trip time (RTT).
 * TCP, when ready to send a segment, reads the value of the system clock and inserts this value, a 32-bit number, in the timestamp value field.
 * The receiver, when sending an acknowledgment for this segment or an accumulative acknowledgment that covers the bytes in this segment, copies the timestamp received in the timestamp echo reply.
 * The sender, upon receiving the acknowledgment, subtracts the value of the timestamp echo reply from the time shown by the clock to find RTT.


 * Note that there is no need for the sender’s and receiver’s clocks to be synchronized because all calculations are based on the sender clock.
 * Also note that the sender does not have to remember or store the time a segment left because this value is carried by the segment itself.


 * The receiver needs to keep track of two variables. The first, lastack, is the value of the last acknowledgment sent.
 * The second, tsrecent, is the value of the recent timestamp that has not yet echoed.
 * When the receiver receives a segment that contains the byte matching the value of lastack, it inserts the value of the timestamp field in the tsrecent variable.
 * When it sends an acknowledgment, it inserts the value of tsrecent in the echo reply field.
 * The sender simply inserts the value of the clock (for example, the number of seconds past midnight) in the timestamp field for the first and second segment.
 * When an acknowledgment comes (the third segment), the value of the clock is checked and the value of the echo reply field is subtracted from the current time.
 * RTT is 12 s in this scenario.
 * The receiver’s function is more involved.
 * It keeps track of the last acknowledgment sent (12000).
 * When the first segment arrives, it contains the bytes 12000 to 12099.
 * The first byte is the same as the value of lastack.
 * It then copies the timestamp value (4720) into the tsrecent variable.
 * The value of lastack is still 12000 (no new acknowledgment has been sent).
 * When the second segment arrives, since none of the byte numbers in this segment include the value of lastack, the value of the timestamp field is ignored.
 * When the receiver decides to send an accumulative acknowledgment with acknowledgment 12200, it changes the value of lastack to 12200 and inserts the value of tsrecent in the echo reply field.
 * The value of tsrecent will not change until it is replaced by a new segment that carries byte 12200 (next segment).
 * Note that as the example shows, the RTT calculated is the time difference between sending the first segment and receiving the third segment.
 * This is actually the meaning of RTT: the time difference between a packet sent and the acknowledgment received.
 * The third segment carries the acknowledgment for the first and second segments.


 * PAWS


 * The timestamp option has another application, protection against wrapped sequence numbers (PAWS).
 * The sequence number defined in the TCP protocol is only 32 bits long.
 * Although this is a large number, it could be wrapped around in a high-speed connection.
 * This implies that if a sequence number is n at one time, it could be n again during the lifetime of the same connection.
 * Now if the first segment is duplicated and arrives during the second round of the sequence numbers, the segment belonging to the past is wrongly taken as the segment belonging to the new round.
 * One solution to this problem is to increase the size of the sequence number, but this involves increasing the size of the window as well as the format of the segment and more.
 * The easiest solution is to include the timestamp in the identification of a segment.
 * In other words, the identity of a segment can be defined as the combination of timestamp and sequence number.
 * This means increasing the size of the identification.
 * Two segments 400:12,001 and 700:12,001 definitely belong to different incarnations.
 * The first was sent at time 400, the second at time 700.


 * SACK-Permitted and SACK Options


 * Acknowledgment field is designed as cumulative acknowledgment, which means it reports the receipt of the last consecutive byte.
 * It does not report the bytes that have arrived Out of order or Duplicate segments.
 * This may have a negative effect on TCP’s performance.
 * If some packets are lost or dropped, the sender must wait until a time-out and then send all packets that have not been acknowledged.
 * The receiver may receive duplicate packets.
 * To improve performance, selective acknowledgment (SACK) was proposed.
 * Selective acknowledgment allows the sender to have a better idea of which segments are actually lost and which have arrived out of order.
 * The new proposal even includes a list for duplicate packets.
 * The sender can then send only those segments that are really lost.
 * The list of duplicate segments can help the sender find the segments which have been retransmitted by a short time-out.
 * The SACK-permitted option of two bytes is used only during connection establishment.
 * The host that sends the SYN segment adds this option to show that it can support the SACK option.
 * If the other end, in its SYN + ACK segment, also includes this option, then the two ends can use the SACK option during data transfer.
 * Note that the SACK-permitted option is not allowed during the data transfer phase.
 * The SACK option, of variable length, is used during data transfer only if both ends agree (if they have exchanged ACK-permitted options during connection establishment).
 * The option includes a list for blocks arriving out of order.
 * Each block occupies two 32-bit numbers that define the beginning and the end of the blocks.
 * Allowed size of an option in TCP is only 40 bytes.
 * This means that a SACK option cannot define more than 4 blocks.
 * The information for 5 blocks occupies (5 × 2) × 4 + 2 or 42 bytes, which is beyond the available size for the option section in a segment.
 * If the SACK option is used with other options, then the number of blocks may be reduced.
 * The first block of the SACK option can be used to report the duplicates.
 * This is used only if the implementation allows this feature.
 * The SACK option announces this duplicate data first and then the out-of-order block.
 * This time, however, the duplicated block is not yet acknowledged by ACK, but because it is part of the out-of-order block (4001:5000 is part of 4001:6000), it is understood by the sender that it defines the duplicate data.