TCP/IP

= Packet Headers =

The headers of IP Protocol suite are as follows:

IP Packet

 * IP Datagram header = 20-60 bytes.
 * Header = 20 bytes if no options.
 * Up to 60 bytes if it contains options.
 * Data length = total length - header length = [65535 - (20~60)] bytes.
 * Total length field = 16 bytes
 * Therefore Max Packet length = 216-1= 65535
 * Fragment offset is required to use 13 bits, it takes away 3 bits, so it can only index every 8th (2^3) byte, so the indices were for 8-byte chunks. THUS the 8 * Fragment Offset to calculate the actual byte-offset of each fragment.

Identification Field:
 * IPID of a packet remains same even after fragmentation.
 * It is used during reassembly of fragmented datagrams.
 * If a packet passes from a device multiple time, it can be traced with IPID.

TCP Packet

 * TCP packet is called Segment.
 * TCP header length = 20-60 bytes.
 * Header = 20 bytes if no options.
 * Up to 60 bytes if it contains options.


 * Sequence number field of a segment defines the number assigned to the first data byte contained in that segment.


 * Acknowledgment field in a segment defines the number of the next byte a party expects to receive. It acknowledges that all the previous bytes than that byte number were received. If ACK number = 1381, it mean all bytes till 1380 byte number are received & the sender now expects 1381 onwards bytes.


 * Byte No: The bytes of data being transferred in each connection are numbered by TCP. The numbering starts with an arbitrarily generated number.
 * Data offset: tells the upper layers where the data starts. Since TCP header can be anywhere from 5-15 words long, this tells where the header ends and the data begins.

UDP Packet

 * UDP packet is called Datagram.
 * UDP header size = 8 bytes.
 * UDP packet size may be between 8-65535 bytes.
 * But IP Datagram can have max 65535 bytes.
 * Therefore, UDP length= IP length - IP header length

= Internet Protocol =

Fragmentation
[Type][Length][Value] 1st bit = 0 ==> Copy options in 1st fragment only 1st bit = 1 ==> Copy options in all fragments
 * Only Data in a datagram is fragmented. Header is never fragmented.
 * Options are in a TLV format which can be max 40 Bytes in length.
 * Type field is 1 Byte & Length field is also 1 byte in length.
 * Options field may or may not be copied into each fragment.
 * Copy Options filed into a Fragment is based on the first bit of Type field.


 * Fragmented Packet Analysis:


 * Before fragmentation:


 * After fragmentation:


 * Further Fragmentation of a Fragmented Packet


 * Before fragmentation:


 * After Fragmentation:

= Transmission Control Protocol =

- Unreliable - Connection-less - Unstreamed - Responsible for Routing of Packets
 * IP is

- Reliable - Connection Oriented - Stream Ordered - Responsible for end to end delivery
 * TCP is


 * Sequence number of packet is the number of the first byte in the packet.
 * Together with length in the TCP header, we know which packet has which bytes

Connection Establishment - Three way handshake Data Transfer Connection Termination - Three-way Termination - Four-way Termination with a half-close option
 * Three phases:

Deny a connection request Abort an existing connection Terminate an idle connection
 * Connection Reset (using RST Flag):

TCP standard defines MSL as being a value of 120 seconds (2 minutes). The common value for MSL is between 30 seconds and 1 minute. The MSL is the maximum time a segment can exist in the Internet before it is dropped.
 * Maximum Segment Life

Windows
Congestion control = Adjusting cwnd in response to Packet loss and Congestion in Network Flow control = Adjusting Sending Rate so that we do not overwhelm the Receive Buffer
 * Do not confuse between:

Send window Receive window
 * TCP Windows


 * Flow Control
 * Opening and Closing Windows
 * Shrinking of Windows
 * Window Shutdown
 * Silly Window Syndrome: When the sending application program creates data slowly, the receiving application program consumes data slowly, or both.


 * Syndrome due to Sender
 * Solution - Nagle’s Algorithm

- TCP sends the first piece of data - TCP accumulates data in the output buffer - Waits until either the receiving TCP sends an acknowledgment - Or until enough data has accumulated to fill a maximum-size segment


 * Syndrome due to Receiver
 * A. Solution - Clark’s Solution

- Announce a window size of zero until either - There is enough space to accommodate a segment of maximum size - At least half of the receive buffer is empty.


 * B. Solution - Delayed Acknowledgment

- Segment is not acknowledged immediately - Prevents the sender TCP from sliding its window - Another advantage is it reduces traffic - May result in the sender unnecessarily retransmitting the unacknowledged segments. - Should not be delayed by more than 500 ms to prevent retransmission

Error Control
- Cumulative Acknowledgment (ACK) - Selective Acknowledgment (SACK) - Retransmission after RTO - Retransmission after Three Duplicate ACK Segments(Reno) - Store them temporarily - Flag them as out-of-order segments until the missing segments arrive - Not delivered to the process directly - Data is delivered to the process in order - Receive data in other segments in its buffer but leaves a gap to indicate non continuity in the data - Receiver immediately sends an acknowledgment displaying the next byte it expects - Segment retransmitted after RTO or after 3 duplicate Acks. - If RTO has a larger value - If sender receives four acknowledgments with same value (three duplicates) - Segment expected by all of these Ack is resent immediately - May time out - It is discarded if retransmitted & both reach - It is discarded - Key advantage of using cumulative acknowledgments - By Retransmission of Ack of next segments - By Ack received of next segments - When receiver sends Ack with rwnd 0 - Sender shut down its window temporarily - Receiver sends Ack if it wants to remove the restriction - Problem arises if this acknowledgment is lost - Persistence timer (=RTO) is used to resolve this deadlock
 * Checksum
 * Acknowledgment
 * Retransmission
 * Out-of-Order Segments
 * Lost Segment
 * Fast Retransmission
 * Delayed Segment
 * Duplicate Segment
 * Automatically Corrected Lost ACK
 * Deadlock Created by Lost Acknowledgment

Congestion Control
- Receiver (Receiver-Advertised window size) - Network (Congestion window size)
 * Congestion Window
 * Size of the sender’s window is dictated by:


 * Actual Window Size = Minimum (rwnd, cwnd)

- Sender starts with a slow rate of transmission(cwnd = 1 MSS) - Size increases one MSS each time one acknowledgement arrives - Increases the rate exponentially(1,2,4,8....) until a threshold is reached
 * Slow Start - Exponential Increase

- To avoid congestion TCP must slow down this Exponential growth - Increases the cwnd Additively instead of Exponentially - When a “window” of segments is acknowledged the size of congestion window is increased by one - A window is the number of segments transmitted during RTT - The increase is based on RTT, not on the number of arrived ACKs - Therefore size of the congestion window increases additively until congestion is detected
 * Congestion Avoidance - Additive Increase

- If congestion occurs, the window size must be decreased - Sender knows about congestion if it needs to retransmit (RTO or 3 Dup Acks received) - In both cases, size of Threshold is dropped to half
 * Congestion Detection - Multiplicative Decrease


 * If RTO occured, TCP Reacts Strongly

- Stronger possibility of congestion, Segment probably dropped in network - TCP sets threshold to half of the current window size - Reduces cwnd back to 1 Segment, starts the slow start phase again


 * If 3 Duplicate ACKs are received, TCP has a Weaker Reaction

- Weaker possibility of congestion - Segment may be dropped but some segments have arrived safely as 3 dup ACKs are received - TCP sets threshold to half of the current window size - Sets cwnd to value of Threshold - Starts the Congestion Avoidance phase - This is called fast transmission and fast recovery

Tahoe vs Reno
Source: Wikipedia
 * For each connection, TCP maintains a congestion window, limiting the total number of unacknowledged packets that may be in transit end-to-end.
 * This is analogous to TCP's sliding window used for flow control.


 * As long as non-duplicate ACKs are received, the congestion window is additively increased by one MSS every round trip time.
 * When a packet is lost, the likelihood of duplicate ACKs being received is very high:
 * The behavior of Tahoe and Reno differ in how they detect and react to packet loss:


 * Fast Recovery (Reno only): In this state, TCP retransmits the missing packet that was signaled by three duplicate ACKs, and waits for an acknowledgment of the entire transmit window before returning to congestion avoidance.
 * If there is no acknowledgment, TCP Reno experiences a timeout and enters the slow-start state.
 * Both algorithms reduce congestion window to 1 MSS on a timeout event.

TCP Timers
- Needs Round-Trip Time (RTT) - Measured RTT - time required for segment to reach the destination and be acknowledged - Smoothed RTT - RTT Deviation - Most implementations use RTT deviation - RTO = Smoothed RTT + [4 x RTT Deviation]
 * Retransmission Time-out (RTO)

- Do not consider RTT of a Retransmitted segment in calculation of RTO value
 * Karn’s Algorithm

- The value of RTO is doubled for each retransmission - Issue of Deadlock created by Lost Ack, used to reset Window size 0 advertized earlier, is resolved by this timer - After timeout, sending TCP sends a special segment(1 byte of new data) called Probe - Probe causes the receiving TCP to resend Ack - If no reply, another probe is sent and value of persistence timer is doubled and reset - Sender continues sending probes, doubling, resetting value of persistence timer until it reaches a threshold(generally 60s) - After that the sender sends one probe segment every 60s until the window is reopened
 * Exponential Backoff
 * Persistence Timer

- If client crashes the connection remains open forever - Time-out is usually 2 hours. - If server do not hear from client after 2 hours, it sends a probe segment - If no response after 10 probes (75s apart) server terminates the connection
 * Keepalive Timer

Options
- It can only be used as the last option - There are no more options in the header after EOP - Only one occurrence of this option is allowed - NOP option is also a 1-byte option used as a filler - Comes before another option to help align it in a four-word slot
 * 1-byte options
 * End of option list
 * No operation

- Size of the biggest unit of data that can be received by the destination of the TCP segment - Defines the maximum size of the data, not the maximum size of the segment - Field is 16 bits long, the value can be 0 to 65,535 bytes - default values is 536 bytes - Value is determined during connection establishment(1st & 2nd Packet) - Does not change during the connection
 * Multiple-byte options
 * Maximum Segment Size

- Window size field is 16 bits long so window can range from 0 to 65,535 bytes - To increase the window size beyond this limit, WSF is used - [New Window Size] = [Window Size in Header] × 2 WSF - Value can be determined only during connection establishment - Does not change during the connection - During data transfer, the size of the window (specified in the header) may be changed, but must be multiplied by same WSF - One end may set the value of the window scale factor to 0, which means it supports this option but does not want to use it for this connection.
 * Window Scale Factor

- Used to Measuring RTT - protection against wrapped sequence numbers
 * Timestamp

- Determined during connection establishment only(1st and 2nd Packet)
 * SACK-permitted

- Allows the sender to know which segments are actually lost and which have arrived out of order - Sender can then send only those segments that are really lost - Option includes a list for blocks arriving out of order - Each block occupies two 32-bit numbers - SACK option cannot define more than 4 blocks - The information for 5 blocks occupies (5 × 2) × 4 + 2 or 42 bytes - Allowed size of an option in TCP is only 40 bytes - The first block of the SACK option can be used to report the duplicates - The SACK option announces this duplicate data first and then the out-of-order block
 * SACK

Source: packetlife.net

       

Packet Capture: [[Media:TCP_SACK.cap|TCP SACK Sample Capture]]

PUSH vs URG Flags
Source: Packetlife.net

PSH Flag

 * Buffers are implemented on both sides of a TCP connection in both directions
 * Buffers allow for more efficient transfer of data when sending more than one MSS of data
 * Large buffers do more harm than good when dealing with real-time applications
 * For a Telnet session, if TCP waited until there was enough data to fill a packet before it would send one
 * A thousand characters are required before the first packet would make it to the remote device
 * The socket can be written by application with option of "pushing" data out immediately, rather waiting for additional data to enter the buffer
 * PSH flag in the outgoing TCP packet is set to 1
 * Upon receiving a packet with PSH flag, the other side immediately forwards the segment to application



URG Flag

 * RFC 6093 (Proposed Standard) will deprecates the use of URG flag
 * The URG flag is used to inform a receiving station that certain data within a segment is urgent and should be prioritized
 * If the URG flag is set, receiver checks the urgent pointer in TCP header
 * This pointer indicates how much of the data in the segment, counting from the first byte, is urgent
 * If the data size is 100 bytes and only first 50 bytes is urgent, the urgent pointer will have a value of 50
 * The URG flag isn't employed much by modern protocols

Capture file: [[Media:telnet.cap|Telnet PCAP]]


 * The 0xFF character sent in packet #86 is precedes the Telnet command 0xF2 (242) in packet #70 denoting a data mark.
 * Per RFC 854, this command should be sent with the TCP URG flag set.
 * The urgent pointer in packet #68 indicates that the first byte of the segment (which in this case is the entire segment) should be considered urgent data.

MTU vs MSS



 * The default TCP Maximum Segment Size is 536.
 * To use higher value, MSS is specified as a TCP option initially in the TCP SYN packet during the TCP handshake.
 * The value cannot be changed after the connection is established.
 * Each direction of data flow can use a different MSS.
 * Small MSS values will reduce or eliminate IP fragmentation, but will result in higher overhead.
 * For most computer users, the MSS option is established by the operating system.

= ECN =


 * Explicit Congestion Notification (ECN) is an extension to the Internet Protocol and to the Transmission Control Protocol.
 * ECN allows end-to-end notification of network congestion without dropping packets.
 * ECN is an optional feature that may be used between two ECN-enabled endpoints when the underlying network infrastructure also supports it.
 * Conventionally, TCP/IP networks signal congestion by dropping packets.
 * When ECN is successfully negotiated, an ECN-aware router may set a mark in the IP header instead of dropping a packet in order to signal impending congestion.
 * The receiver of the packet echoes the congestion indication to the sender, which reduces its transmission rate as if it detected a dropped packet.


 * ECN requires specific support at the Internet layer and the transport layer for the following reasons:
 * In TCP/IP, routers operate within the Internet layer, while the transmission rate is handled by the endpoints at the transport layer.
 * Congestion may be handled only by the transmitter, but since it is known to have happened only after a packet was sent, there must be an echo of the congestion indication by the receiver to the transmitter.
 * Without ECN, congestion indication echo is achieved indirectly by the detection of lost packets.
 * With ECN, the congestion is indicated by setting the ECN field within an IP packet to CE and is echoed back by the receiver to the transmitter by setting proper bits in the header of the transport protocol.
 * For example, when using TCP, the congestion indication is echoed back by setting the ECE bit.

Operation of ECN with IP

00 – Non ECN-Capable Transport, Non-ECT 10 – ECN Capable Transport, ECT(0) 01 – ECN Capable Transport, ECT(1) 11 – Congestion Encountered, CE.
 * ECN uses the two least significant (right-most) bits of the DiffServ field in the IPv4 or IPv6 header to encode four different codepoints:


 * When both endpoints support ECN they mark their packets with ECT(0) or ECT(1).
 * If the packet traverses an active queue management (AQM) queue (e.g., a queue that uses random early detection (RED)) that is experiencing congestion and the corresponding router supports ECN, it may change the codepoint to CE instead of dropping the packet.
 * This act is referred to as “marking” and its purpose is to inform the receiving endpoint of impending congestion.
 * At the receiving endpoint, this congestion indication is handled by the upper layer protocol (transport layer protocol) and needs to be echoed back to the transmitting node in order to signal it to reduce its transmission rate.


 * Because the CE indication can only be handled effectively by an upper layer protocol that supports it, ECN is only used in conjunction with upper layer protocols, such as TCP, that support congestion control and have a method for echoing the CE indication to the transmitting endpoint.

Operation of ECN with TCP


 * TCP supports ECN using three flags in the TCP header.
 * The first one, the Nonce Sum (NS), is used to protect against accidental or malicious concealment of marked packets from the TCP sender.
 * The other two bits are used to echo back the congestion indication (i.e. signal the sender to reduce the amount of information it sends) and to acknowledge that the congestion-indication echoing was received.
 * These are the ECN-Echo (ECE) and Congestion Window Reduced (CWR) bits.


 * Use of ECN on a TCP connection is optional; for ECN to be used, it must be negotiated at connection establishment by including suitable options in the SYN and SYN-ACK segments.


 * When ECN has been negotiated on a TCP connection, the sender indicates that IP packets that carry TCP segments of that connection are carrying traffic from an ECN Capable Transport by marking them with an ECT codepoint.
 * This allows intermediate routers that support ECN to mark those IP packets with the CE codepoint instead of dropping them in order to signal impending congestion.


 * Upon receiving an IP packet with the Congestion Experienced codepoint, the TCP receiver echoes back this congestion indication using the ECE flag in the TCP header.
 * When an endpoint receives a TCP segment with the ECE bit it reduces its congestion window as for a packet drop.
 * It then acknowledges the congestion indication by sending a segment with the CWR bit set.


 * A node keeps transmitting TCP segments with the ECE bit set until it receives a segment with the CWR bit set.

ECN support in IP by routers


 * Since ECN marking in routers is dependent on some form of active queue management, routers must be configured with a suitable queue discipline in order to perform ECN marking.
 * Cisco IOS routers perform ECN marking if configured with the WRED queuing discipline since version 12.2(8)T.
 * Linux routers perform ECN marking if configured with one of the RED or GRED queue disciplines with an explicit ecn parameter, by using the sfb discipline, or by using the CoDel Fair Queueing (fq_codel) discipline.

=Misc= Some misc information to be remembered regarding TCP/IP is below.

Study of a packet header
This is a IP header from an IP packet received at destination :

4500 003c 1c46 4000 4006 b1e6 ac10 0a63 ac10 0a0c


 * SNMP Connections:


 * References