Dump

__NOINDEX__

R&S Quick Notes – IGP
RIP

Know your filters: Offset-list, Distribute-lists, distance command. With filters read carefully: “between 25 & 45″ or “from 25 to 45″. Know your prefix-lists or alternatively using ACL’s instead. “passive interface” command, ONLY stops the sending of updates out the interface. Interface will still receive and process those updates. Passive interfaces will still be advertised in other updates.

EIGRP

Advertising a default route out one interface: “ip summary-address eigrp [AD] 0.0.0.0 0.0.0.0″ To see if a neighbor is configured as STUB, “show ip eigrp neighbors [detail]” as look for ‘CONNECTED SUMMARY’ On frame-relay multipoint interfaces, don’t forget to disable split-horizon. External EIGRP routes AD (admin distance = 170) can NOT be changed on per prefix basis. Metric weight values:

1 0 1 0 0 = Default 0 0 1 0 0 = Only DLY 1 0 0 0 0 = Only BW   3 0 1 0 0 = BW has 3 times more weight reference than DLY

Metric formula:

Metric = ((107 / BW) + (DLY/10) ) * 256

IPv6

RIPng – “no ip split-horizon” in a process command not a interface command. EIGRPv6 – Do not forget to enable eigrp under the process. IPv6 tunnel method with least overhead : IPv6IP Tunnel protocol numbers for ACL’s : IPv6IP = Protocol-41, &  GRE IPv6 = Protocol-47 You can not redistribute a default static route(::/0) with ospfv3. Dynamic information (ie IGP next-hops) recurses to remote link-local address, not the global unicast interface.

Windows Script Backup ScreenOS
@echo off REM ================================================================ REM ===This Script may give following error: REM ===FATAL ERROR: Network error: Connection timed out ==> Check IP addresses REM ===FATAL ERROR: Network error: Connection refused ==> Check SSH Parameters on Firewall REM ===WARNING - POTENTIAL SECURITY BREACH! ==> SSH Public Keys changed/recalculated REM ===Access denied ==> Password wrong REM ================================================================ REM ===No of times "Access denied" message appears ==> no of wrong firewalls with wrong pwds REM ================================================================ REM ===Configurable Parameters REM ================================================================ set username=aman set CFGFILE=BackupList.txt set DESTDIR=Backups\ REM ================================================================ REM ===Script code starts here REM ================================================================ SET TIMESTAMP=%date:~-4,4%.%date:~-7,2%.%date:~-10,2% for /F "tokens=1,2,3 delims=," %%A in (%CFGFILE%) do (   IF NOT EXIST "%DESTDIR%%TIMESTAMP%" mkdir "%DESTDIR%%TIMESTAMP%"    plink -ssh -C -batch -pw %%C %username%@%%B get config > "%DESTDIR%%TIMESTAMP%\%%A.cfg" ) echo Backup completed

"BackupList.txt" file: R1,192.168.1.3,cisco R2,192.168.1.4,cisco2

Need to download Plink from this Link

HA Best Practices
Basic:

1.	Two Firewalls should have same Hardware(Model, Modules, Ram, Ports, etc) 2.	Firmware should be exactly same i.e. Major, Minor as well as Patch version 3.	Licenses & features on both firewalls should be same (Basic, Advanced, AV, DI, AS, Web filtering, etc) 4.	One firewall with expired license should be not put in cluster with a firewall with no license as it may cause them to become out of sync & different free memory. 5.	It is recommended to configure cluster with 2 dedicated HA links. 6.	VSD Group should be 0. If it is not 0, need to assign interfaces to that VSD Group on both firewalls. 7.	Console access is always recommended before Configuring/ Implementing/ Troubleshooting NSRP issues. 8.	Hostnames of the firewalls should be different to differentiate between devices.

Preempt:

1.	Preempt should be enabled. 2.	Hold-down timer should have a higher value (~120-180 seconds) to prevent NSRP failover flapping. 3.	Preempt need not be configured on the backup device. 4.	It should not be configured in environments with dynamic routing protocols due to protocol re-convergence. 5.	The priority of the preferred backup should be a higher value, as the lower priority takes precedence.

Interface Monitoring:

1.	Only add critical interfaces to Monitoring to avoid unnecessary failovers/preempts.

Track-IP:

1.	Track-IP is necessary to achieving a successful failover event, when the primary Juniper firewall stops passing traffic; but the monitored interfaces remain up while using interface monitoring only. 2.	Need to determine one or more hosts that can reliably respond to ICMP/ARP traffic.

Master-Always-Exist:

1.	With NSRP monitoring enabled, both NSRP peers can become 'Inoperable'. Enabling the master-always-exist option will ensure that the cluster remains available. 2.	Run the command “set nsrp vsd-group master-always-exist” only on Master & it will sync to Backup automatically. Secondary-path:

1.	To avoid Split Brain, Configure NSRP with 2 dedicated HA links. 2.	The secondary-path option allows NSRP to poll the peer via an alternate, non-dedicated interface. The purpose is only to prevent a split-brain scenario, so NSRP sync data is not carried across this link, only heart-beat messages.

RTO Sync:

1.	Backup Session Timeout Acknowledge should be enabled 2.	Route Synchronization should not be used unless Dynamic Routing Protocol is running.

HA probe:

1.	HA probe must be enabled if the HA links are connected through a layer 2 switch. 2.	It should NOT be used if they are directly connected. 3.	Duplex settings on the switch and firewall interfaces should match. Authentication & Encryption password:

1.	Use NSRP Authentication & Encryption if the HA Cables connect through a layer 2 switch. 2.	No need to use Authentication & Encryption if they are directly connected. Misc:

1.	While adding secondary firewall in a cluster, Interface-based Default Route such as “set interface gateway ” will result in loss of communication as the Interface will become Inactive. Need to add regular Default Route before proceeding. 2.	Duplicate MAC address seen when 2 set of NSRP Clusters with same Cluster ID and VSD-Group are attached to the same switch/or are in same Broadcast Domain. Changing the Cluster ID or VSD group number will resolve the issue.

KBs Referred:

http://kb.juniper.net/KB9311 http://kb.juniper.net/KB9309 http://kb.juniper.net/KB11432

SRX Import config
1. load system terminal Merge|Replace| {}{}{}{} Cntl+D commit

2. Edit Set... set... one by one commit

Juniper SRX Firewalls
Juniper SRX Firewalls run = used in configure mode to use operational mode commands

//Show Routes show route brief show route best x.x.x.x set routing-options static route 10.2.2.0/24 next-hop 10.1.1.254 //Forwarding Table run show route forwarding-table destination x.x.x.x/24

//TraceOptions settings root@fw1# show security flow | display set set security flow traceoptions file matt_trace set security flow traceoptions file files 3 set security flow traceoptions file size 100000 set security flow traceoptions flag basic-datapath set security flow traceoptions packet-filter f0 source-prefix 10.0.0.1/32 destination-prefix 200.1.2.3/32 set security flow traceoptions packet-filter f1 source-prefix 10.0.0.1/32 destination-prefix 200.1.2.3/32 activate security flow traceoptions commit monitor start matt_trace monitor list

!! Kill the capture monitor stop  clear log            !! Clear the log file delete security flow traceoptions commit file delete 

//Show Traceoptions show security flow session source-prefix 10.124.80.42 destination-prefix 117.1.1.25 start shell

egrep ‘matched filter|(ge|fe|reth)-.*->.*|session found|create session|dst_xlate|routed|search|denied|src_xlate|outgoing phy if’ /var/log/matt_trace | sed -e ‘s/.*RT://g’ | sed -e ‘s/tcp, flag 2 syn/–TCP SYN–/g’ | sed -e ‘s/tcp, flag 12 syn ack/–TCP SYN\/ACK–/g’ | sed -e ‘s/tcp, flag 10/–TCP ACK–/g’ | sed -e ‘s/tcp, flag 4 rst/–TCP RST–/g’ | sed -e ‘s/tcp, flag 14 rst/–TCP RST\/ACK–/g’ | sed -e ‘s/tcp, flag 18/–TCP PUSH\/ACK–/g’ | sed -e ‘s/tcp, flag 11 fin/–TCP FIN\/ACK–/g’ | sed -e ‘s/tcp, flag 5/–TCP FIN\/RST–/g’ | sed -e ‘s/icmp, (0\/0)/–ICMP Echo Reply–/g’ | sed -e ‘s/icmp, (8\/0)/–ICMP Echo Request–/g’ | sed -e ‘s/icmp, (3\/0)/–ICMP Destination Unreachable–/g’ | sed -e ‘s/icmp, (11\/0)/–ICMP Time Exceeded–/g’ | awk ‘/matched/ {print “\n\t\t\t=== PACKET START ===”}; {print};’

//Show Sessions run show security flow session destination-prefix x.x.x.x

//Match Policy run show security match-policies from-zone zonea to-zone zoneb source-ip x.x.x.x destination-ip x.x.x.x protocol tcp source-port 1024 destination-port xx

//Check for Block Group show security policies from-zone untrust to-zone trust | display set | grep deny

//Find Syntax for an Existing Command show | display set | xxxxxxxxx

//VPN Troubleshooting show security ike security-associations [index ] [detail] show security ipsec security-associations [index ] [detail] show security ipsec statistics [index ]

//VPN //Set proxy ID’s for a route based tunnel set security ipsec vpn vpn-name ike proxy-identity local 10.0.0.0/8 remote 192.168.1.0/24 service any

//Packet Capture set security datapath-debug capture-file my-capture set security datapath-debug capture-file format pcap set security datapath-debug capture-file size 1m set security datapath-debug capture-file files 5 set security datapath-debug maximum-capture-size 400 set security datapath-debug action-profile do-capture event np-ingress packet-dump set security datapath-debug packet-filter my-filter action-profile do-capture set security datapath-debug packet-filter my-filter source-prefix 1.2.3.4/32

//Super SRX Packet Capture Filter egrep ‘matched filter|(ge|fe|reth ) -.*- > .*|session found|Session \(id|session id|create|dst_nat|chose interface|dst_xlate|routed|search|denied|src_xlate|dip id|outgoing phy if|route to|DEST|post’ /var/log/mchtrace | uniq | sed -e ‘s/.*RT://g’ | awk ‘/matched/ {print “\n\t\t\t=== PACKET START ===”} ; {print} ;’ | awk ‘/^$/ {print “\t\t\t=== PACKET END ===”}; {print};’ ; echo | awk ‘/^$/ {print “\t\t\t=== PACKET END ===”}; {print};’

// Policy commands

show | display set (shows policy) set system syslog set security log set interfaces ge-0/0/3 gigether-options auto-negotation (redundant-parent) set security policies from-zone xxx to-zone xxx policy policy_name match set security zones security-zone untrust address-book address set security nat source rule-set zone-to-zone rule rule-source-nat match source-address 10.0.0.0 set routing-instances set applications

set security ike proposal set security ike policy set security ike gateway set security ipsec proposal set security ipsec policy set security ipsec vpn

show|compare commit check commit comments ticket#2222 and-quit

set security policies from-zone dmz to-zone trust policy 12 match source-address h_10.124.0.1 destination-address h_1.2.3.4 application tcp_22 set security policies from-zone dmz to-zone trust policy 12 then permit set security policies from-zone dmz to-zone trust policy 12 then log session-init session-close

+        match { +            source-address h_10.124.0.1; +            destination-address h_1.2.3.4; +            application tcp_22; +        } +         then { +            permit; +            log { +                session-init; +                session-close; +            } +         } +     }

Various: show system uptime 	Uptime show version 	Version of platform (host/model) show chassis firmware 	Firmware loaded on FPCs show system software detail show chassis routing-engine 	CPU, Memory for Routing-Engine show chassis fan 	Speed and status of fans show chassis environment 	Temperature status of components show chassis hardware detail 	Hardware inventory (backplane) show system core-dumps 	Core-dumps show system alarms 	System alarms show chassis alarms 	Alarms for hardware and chassis show system boot-messages 	Logs from boot sequence show log chassisd 	Logs for SRX chassis (Cards) show log messages 	Recent system messages show configuration security log 	Syslog configuration show system buffers 	Utilization of memory buffers show system virtual-memory 	Virtual memory utilization show system processes 	Processes running on system show security idp memory 	IDP memory statistics show security monitoring performance session 	Session counts on each FPC

MIP in a policy-based VPN
KB9924

This work-around is for configuring a Mapped Internet Protocol (MIP) address in a policy-based VPN, where they are typically created on tunnel interfaces in a route-based VPN. This workaround applies when the customer requirement does not allow for a route-based VPN.

Customer requirements:

A site-to-site VPN tunnel between a Juniper firewall and a Cisco. The Cisco Peer IP address and the Remote subnet must use the same Public IP address. MIPs need to be configured for the servers behind the Juniper Firewall.

For these requirements, a route-based VPN on the Juniper firewall is not an option because a route is needed to the remote network pointing to the tunnel interface. If the peer IP and remote IP addresses are the same for both devices, the IKE negotiation can not be established. A policy-based VPN can be configured for this design, since only a default route is needed and then a policy can be used to determine the VPN. On the Juniper firewall, a MIP needs to be configured for the servers on the private network, which need to be accessed via a VPN from the Cisco site. However, MIPs are not directly supported in policy-based VPN.

If the outgoing interface is in a zone other than Untrust (for example, zone is ISP), follow KB27122- [ScreenOS] How to configure a MIP in a policy based VPN when outgoing interface is in zone other than Untrust

Untrust-Tun is the Tunnel type zone, carrier zone that helps encryption-decryption set interface tunnel.1 zone Untrust-Tun

Fixed IP on the tunnel interface set interface tunnel.1 ip 4.4.4.10/24 MIP will be used by the cisco-remote network to connect to server behind the Juniper firewall's local network set interface tunnel.1 mip 4.4.4.11 host 20.20.20.5 netmask 255.255.255.255

A route needs to be added to send the traffic to the tunnel interface: set route 25.34.5.7 interface tunnel.1 Phase 1 configuration: set ike gateway Netscreen-Cisco-IKE address 25.34.5.7 main outgoing-interface ethernet4 preshare test sec-level standard

Phase 2 configuration: set vpn Netscreen-Cisco-VPN gateway Netscreen-Cisco-IKE sec-level standard Bind Tunnel Zone (Juniper firewall will recognize the MIP configured on the tunnel interface): set vpn Netscreen-Cisco-VPN bind zone Untrust-Tun

Then an appropriate access-list needs to be configured on the Cisco end to support Proxy-IDs generated by the polices in the Juniper firewall. set policy from untrust to trust 2.2.2.2/32 MIP (4.4.4.10) any tunnel vpn Netscreen-Cisco-VPN log set policy from trust to untrust 20.20.20.5/32 2.2.2.2/32 any tunnel vpn Netscreen-Cisco-VPN log

Note: The MIP will work in only one direction. If traffic needs to be initiated from the Netscreen Trust zone over the tunnel and that traffic must use NAT, then a DIP is required, and the DIP cannot use the same IP as the MIP. This is a limitation. If a bi-directional MIP is required a route based VPN must be used.

Workaround if outgoing is other than Untrust Zone
If the outgoing interface is in a zone other than Untrust (for example, zone is ISP) proceed with following:

set zone "ISP" set internet ethernet0/2 zone "ISP" ISP is the zone for outgoing interface ethernet0/2: set internet ethernet0/2 ip 1.1.1.1/24

ISP-Tun zone is the carrier zone for the tunnel for NAT-ing: set zone "ISP-Tun" tunnel ISP

Untrust-Tun is the Tunnel type zone, carrier zone that helps encryption-decryption: set interface tunnel.1 zone ISP-Tun

Fixed IP on the tunnel interface set interface tunnel.1 ip 4.4.4.10/24

MIP will be used by the remote network to connect to server behind the ScreenOS firewall's local network: set interface tunnel.1 mip 4.4.4.11 host 20.20.20.5 netmask 255.255.255.255

A route needs to be added to send the traffic to the tunnel interface; for the translation to take place: set route 6.7.8.9/32 interface tunnel.1

Phase 1 configuration: set ike gateway Netscreen-IKE address 2.2.2.2 main outgoing-interface ethernet0/2 preshare test sec-level standard

Phase 2 configuration: set vpn Netscreen-VPN gateway Netscreen-IKE sec-level standard

Bind Tunnel Zone (ScreenOS firewall will identify the MIP configured on the tunnel interface): set vpn Netscreen-VPN bind zone Untrust-Tun

Then an appropriate access-list must be configured on the remote end to support Proxy-IDs generated by the polices in the ScreenOS firewall. set policy from ISP to trust 6.7.8.9/32 MIP (4.4.4.11) any tunnel vpn Netscreen-VPN log set policy from trust to ISP 20.20.20.5/32 6.7.8.9/32 any tunnel vpn Netscreen-VPN log

get sa detail
CORPORATE-> get sa total configured sa: 1 HEX ID  Gateway   Port Algorithm   SPI Life:sec kb Sta PID vsys 00000001< 2.2.2.2 500 esp:3des/sha1 c2e1f0e4 3296 unlim A/- -1 0 00000001> 2.2.2.2 500 esp:3des/sha1 74098e47 3296 unlim A/- -1 0

We can see that the remote peer is 2.2.2.2. The State shows A/-. The possible states are below:

I/I SA Inactive. VPN is currently not connected. A/- SA is Active, VPN monitoring is not enabled A/D SA is Active, VPN monitoring is enabled but failing thus DOWN A/U SA is Active, VPN monitoring is enabled and UP

Gateway IP address for Next Hop
Why is it necessary to specify 'Gateway IP address for Next Hop' during the configuration of static default route?

SSG-> set route 0.0.0.0/0 int eth0/1
 * Scenario I: Next-hop gateway IP address is not specified in the static default route.

SSG-> get db st

route to 4.2.2.2 cached arp entry with MAC 000000000000 for 4.2.2.2 add arp entry with MAC 000000000000 for 4.2.2.2 to cache table wait for arp rsp for 4.2.2.2 ifp2 ethernet0/1, out_ifp ethernet0/1, flag 10000e00, tunnel ffffffff, rc 0 outgoing wing prepared, not ready

SSG-> get route | i 4.2.2.2 Because the next-hop IP address is not specified in the default route, the firewall is doing an ARP for 4.2.2.2.
 * 16 0.0.0.0/0 eth0/1 0.0.0.0 S 20 1 Root

When the firewall needs to forward a packet via the default route, it needs the MAC address of the default router in order to build the frame to forward the packet.

The reason for the failure is that the firewall is waiting for an ARP response from 4.2.2.2, as if it was on a connected segment. This is indicated by the 'wait for arp rsp for 4.2.2.2', which it never receives.

It then drops the packet with the message 'outgoing wing prepared, not ready' which indicates that there is no ARP response;

SSG-> set route 0.0.0.0/0 int eth0/1 gateway 1.1.1.2
 * Scenario II: Next-hop gateway ip address is specified in the static default route.

SSG-> get db st

route to 1.1.1.2 cached arp entry with MAC 000000000000 for 1.1.1.2 add arp entry with MAC 002688e8c305 for 1.1.1.2 to cache table arp entry found for 1.1.1.2 ifp2 ethernet0/1, out_ifp ethernet0/1, flag 10800e00, tunnel ffffffff, rc 1 outgoing wing prepared, ready

SSG-> get route | i 4.2.2.2 In this scenario, the firewall found the MAC address for the next-hop gateway (ISP router with ip 1.1.1.2) in its ARP table.
 * 15 0.0.0.0/0 eth0/1 1.1.1.2 S 20 1 Root

It was then able to build the frame and forward the packet to the ISP router, which in turn routed the packet to its next hop, until the packet reached the destination IP 4.2.2.2.

SRX Stuck on old technology
The SRX uses stateful inspection which relies on port and protocol for policy decisions, a technique that is ineffective at controlling applications that use dynamic ports, encryption, or tunnel across often used/allowed ports to bypass firewalls.

Stateful Inspection
This solution allows calls to come from any port on an inside machine, and will direct them to port 25 on the outside.

So why is it wrong?

Our defined restriction is based solely on the outside host’s port number, which we have no way of controlling. Now an enemy

can access any internal machines and port by originating his call from port 25 on the outside machine.

What can be a better solution ?

The ACK signifies that the packet is part of an ongoing conversation Packets without the ACK are connection establishment messages, which we are only permitting from internal hosts

Sub interface number
The maximum permitable number for sub interface number in Juniper SSG140 firewall is 100. The firewall will accept a number in the range of 1-100 only. Sub Interface names in Juniper Netscreen firewalls are like: eth0/1.50, eth0/2.100. A name like eth0/2.101 or eth0/2.200 will not be acceptable.

Window size smaller that MTU
If window size is smaller than MTU, packet retransmissions will occur. This is an application issue. This means buffer size is smaller & lager packets are received.



= Certificates = ​ A session symmetric key between two parties is used only once.

The symmetric (shared) key in the Diffie-Hellman method is K = g xy mod p.

In public-key cryptography, everyone has access to everyone’s public key; public keys are available to the public.

Our example uses small numbers, but note that in a real situation, the numbers are very large. Assume that g = 7 and p = 23. The steps are as follows: 1. Alice chooses x = 3 and calculates R 1 = 7 3 mod 23 = 21. 2. Alice sends the number 21 to Bob. 3. Bob chooses y = 6 and calculates R 2 = 7 6 mod 23 = 4. 4. Bob sends the number 4 to Alice. 5. Alice calculates the symmetric key K = 4 3 mod 23 = 18. Bob calculates the symmetric key K = 21 6 mod 23 = 18. The value of K is the same for both Alice and Bob; g xy mod p = 7 18 mod 35 = 18.

Public Announcement: The naive approach is to announce public keys publicly. Bob can put his public key on his website or announce it in a local or national newspaper. When Alice needs to send a confidential message to Bob, she can obtain Bob’s public key from his site or from the newspaper, or even send a message to ask for it. This approach, however, is not secure; it is subject to forgery. For example, Eve could make such a public announcement. Before Bob can react, damage could be done. Eve can fool Alice into sending her a message that is intended for Bob. Eve could also sign a document with a corresponding forged private key and make everyone believe it was signed by Bob. The approach is also vulnerable if Alice directly requests Bob’s public key. Eve can intercept Bob’s response and substitute her own forged public key for Bob’s public key.

CSR has a Public Key.

CA signs it.

Certificate is a proof of public key.

Encrypt using public key & receiver decrypts using private key.

There are two types of certificate authorities (CAs), root CAs and intermediate CAs.

Certificate 1 - Issued To: example.com; Issued By: Intermediate CA 1 Certificate 2 - Issued To: Intermediate CA 1; Issued By: Intermediate CA 2 Certificate 3 - Issued To: Intermediate CA 2; Issued By: Intermediate CA 3 Certificate 4 - Issued To: Intermediate CA 3; Issued By: Root CA

Root CA certificates, on the other hand, are "Issued To" and "Issued By" themselves,

For enhanced security purposes, most end user certificates today are issued by intermediate certificate authorities.

Installing an intermediate CA signed certificate on a web server or load balancer usually requires installing a bundle of certificates.

The CA will also provide a so called intermediate CA file or chain certificate. It proves that your chosen CA is trusted by one of the root CAs. You will need the intermediate CA certificate as 'chain' certificate in your clientssl profile.

Nonce is Number Once

In an asymmetric key encryption scheme, anyone can encrypt messages using the public key, but only the holder of the paired private key can decrypt. Security depends on the secrecy of the private key.

In the Diffie–Hellman key exchange scheme, each party generates a public/private key pair and distributes the public key. After obtaining an authentic copy of each other's public keys, Alice and Bob can compute a shared secret offline. The shared secret can be used, for instance, as the key for a symmetric cipher.


 * Public-key encryption, in which a message is encrypted with a recipient's public key. The message cannot be decrypted by anyone who does not possess the matching private key, who is thus presumed to be the owner of that key and the person associated with the public key. This is used in an attempt to ensure confidentiality.


 * Digital signatures, in which a message is signed with the sender's private key and can be verified by anyone who has access to the sender's public key. This verification proves that the sender had access to the private key, and therefore is likely to be the person associated with the public key. This also ensures that the message has not been tampered with, as any manipulation of the message will result in changes to the encoded message digest, which otherwise remains unchanged between the sender and receiver.

= TCP =

Source: TCP/IP Protocol-Suite, B.Forouzan


 * TCP uses the services of IP, a connectionless protocol, but itself is connection-oriented.
 * TCP uses the services of IP to deliver individual segments to the receiver, but it controls the connection itself.
 * If a segment is lost or corrupted, it is retransmitted. IP is unaware of this retransmission.
 * If a segment arrives out of order, TCP holds it until the missing segments arrive; IP is unaware of this reordering.
 * Sequence number of packet is the number of the first byte in the packet.
 * Together with length in the TCP header, we know which packet has which bytes

TCP Connection
Connection Establishment Data Transfer Connection Termination
 * TCP transmits data in full-duplex mode.
 * When two TCPs in two machines are connected, they are able to send segments to each other simultaneously.
 * In TCP, connection-oriented transmission requires three phases:

Three way handshake

 * The process starts with the server.
 * The server program tells its TCP that it is ready to accept a connection.
 * This request is called a passive open.
 * The client program issues a request for an active open.
 * A client that wishes to connect to an open server tells its TCP to connect to a particular server.
 * TCP can now start the three-way handshaking process




 * 1st Packet:
 * SYN segment is for synchronization of sequence numbers.
 * The client in our example chooses a random number as the first sequence number and sends this number to the server.
 * This sequence number is called the initial sequence number(ISN).
 * This segment does not contain an acknowledgment number.
 * It does not define the window size either; a window size definition makes sense only when a segment includes an acknowledgment.
 * The segment can also include some options - WS, MSS, SACK_PERM
 * Note that the SYN segment is a control segment and carries no data.
 * However, it consumes one sequence number.
 * When the data transfer starts, the ISN is incremented by 1.
 * We can say that the SYN segment carries no real data, but we can think of it as containing one imaginary byte.


 * 2nd Packet:
 * The server sends the second segment, a SYN + ACK segment with two flag bits set: SYN and ACK.
 * This segment has a dual purpose.
 * First, it is a SYN segment for communication in the other direction.
 * The server uses this segment to initialize a sequence number for numbering the bytes sent from the server to the client.
 * The server also acknowledges the receipt of the SYN segment from the client by setting the ACK flag and displaying the next sequence number it expects to receive from the client.
 * Because it contains an acknowledgment, it also needs to define the receive window size, rwnd (to be used by the client).


 * 3rd Packet:
 * The client sends the third segment which is just an ACK segment.
 * It acknowledges the receipt of the second segment with the ACK flag and acknowledgment number field.
 * Note that the sequence number in this segment is the same as the one in the SYN segment; the ACK segment does not consume any sequence numbers.
 * The client must also define the server window size.
 * In general, the third segment usually does not carry data and consumes no sequence numbers.


 * Note:
 * A SYN cannot carry data, but it consumes one Sequence number.
 * A SYN+ACK cannot carry data, but consumes one Sequence number.
 * A ACK if carrying no data, consumes no sequence number.

Simultaneous Open

 * In rare situation when both processes issue an active open.
 * In this case, both TCPs transmit a SYN + ACK segment to each other.
 * Only one single connection is established between them.

SYN Flooding Attack

 * TCP handshake is susceptible to SYN flooding attack.
 * This happens when a malicious attackers send a large number of SYN segments.
 * The server, assuming that the clients are issuing an active open, allocates the necessary resources and setting timers.
 * The TCP server then sends the SYN+ACK segments to the fake clients, which are lost.
 * When the server waits for the third packet, resources are allocated without being used.
 * If the number of SYN segments is large, the server eventually runs out of resources.
 * It may be unable to accept connection requests from valid clients.
 * This SYN flooding attack belongs to denial of service attack group.
 * One strategy is to postpone resource allocation until the server can verify that the connection request is coming from a valid IP address, by using a Cookie.
 * SCTP uses this strategy.

Data Transfer

 * After connection is established, bidirectional data transfer can take place.
 * The client and server can send data and acknowledgments in both directions.
 * Data traveling in the same direction as an acknowledgment are carried on the same segment.
 * The acknowledgment is piggybacked with the data.

Connection Termination
Three-way handshaking Four-way handshaking with a half-close option.
 * Any of the two parties involved in exchanging data (client or server) can close the connection, it is usually initiated by the client.
 * Most implementations today allow two options for connection termination:

Three-Way Termination



 * 1st Packet:
 * The client TCP, after receiving a close command from the client process, sends the FIN segment.
 * A FIN segment can include the last chunk of data sent by the client or it can be just a control segment.
 * If it is only a control segment, it consumes only one sequence number.


 * 2nd Packet:
 * The server TCP after receiving the FIN, informs its process
 * It then sends a FIN+ACK to confirm the receipt of the FIN from the client and to announce the closing of the connection in the other direction.
 * This segment can also contain the last chunk of data from the server.
 * If it does not carry data, it consumes only one sequence number.


 * 3rd Packet:
 * The client TCP sends an ACK segment to confirm the receipt of the FIN from the TCP server.
 * This segment contains the acknowledgment number, which is one plus the sequence number received in the FIN segment from the server.
 * This segment cannot carry data and consumes no sequence numbers.


 * Note:
 * The FIN segment consumes one sequence number if it does not carry data.
 * The FIN + ACK segment consumes one sequence number if it does not carry data.

Half-Close

 * In TCP, one end can stop sending data while still receiving data. This is called a Half-Close.
 * Either the server or the client can issue a half-close request.
 * It can occur when the server needs all the data before processing can begin.
 * An example is sorting.
 * When the client sends data to the server to be sorted, the server needs to receive all the data before sorting can start.
 * This means the client, after sending all data, can close the connection in the client-to-server direction.
 * However, the server-to-client direction must remain open to return the sorted data.
 * The server, after receiving the data, still needs time for sorting; its outbound direction must remain open.




 * The data transfer from the client to the server stops.
 * The client half-closes the connection by sending a FIN segment.
 * The server accepts the half-close by sending the ACK segment.
 * The server, however, can still send data.
 * When the server has sent all of the processed data, it sends a FIN segment, which is acknowledged by an ACK from the client.
 * After half closing the connection, data can travel from server to client and acknowledgments can travel from client to server.
 * The client cannot send any more data to the server.

Connection Reset
Deny a connection request Abort an existing connection Terminate an idle connection
 * TCP at any end may
 * All of these are done with the RST flag.

Maximum Segment Life

 * The TCP standard defines MSL as being a value of 120 seconds (2 minutes).
 * In modern networks TCP allows implementations to choose a lower value.
 * The common value for MSL is between 30 seconds and 1 minute.
 * The MSL is the maximum time a segment can exist in the Internet before it is dropped.
 * TCP segment is encapsulated in an IP datagram, which has a limited lifetime (TTL).
 * When the IP datagram is dropped, the encapsulated TCP segment is also dropped.

TIME-WAIT state and 2SML timer
There are two reasons for the existence of the TIME-WAIT state and the 2SML timer:


 * 1st Reason:
 * If the last ACK segment is lost, the server TCP, which sets a timer for the last FIN, assumes that its FIN is lost and resends it.
 * If the client goes to the CLOSED state and closes the connection before the 2MSL timer expires, it never receives this resent FIN segment, and consequently, the server never receives the final ACK.
 * The server cannot close the connection.
 * The 2MSL timer makes the client wait for a duration that is enough time for an ACK to be lost (one SML) and a FIN to arrive (another SML).
 * If during the TIME-WAIT state, a new FIN arrives, the client sends a new ACK and restarts the 2SML timer.


 * 2nd Reason:
 * A duplicate segment from one connection might appear in the next one.
 * Assume a client and a server have closed a connection.
 * After a short period of time, they open a connection with the same socket addresses (same source and destination IP addresses and same source and destination port numbers).
 * This new connection is called an incarnation of the old one.
 * A duplicated segment from the previous connection may arrive in this new connection and be interpreted as belonging to the new connection if there is not enough time between the two connections.
 * To prevent this problem, TCP requires that an incarnation cannot occur unless 2MSL amount of time has elapsed.
 * Some implementations, however, ignore this rule if the initial sequence number of the incarnation is greater than the last sequence number used in the previous connection.

TCP Windows
Send window Receive window
 * TCP uses two windows for each direction of data transfer:
 * Four windows for a bidirectional communication.

Send Window



 * The window shown here is of size 100 bytes (normally thousands of bytes).
 * The send window size is dictated by the receiver (flow control) and the congestion in the underlying network (congestion control).
 * The figure shows how a send window opens, closes, or shrinks.

Receive Window


rwnd = buffer size − number of waiting bytes to be pulled
 * TCP allows the receiving process to pull data at its own pace.
 * This means that part of the allocated buffer at the receiver may be occupied by bytes that have been received and acknowledged, but are waiting to be pulled by the receiving process.
 * The receive window size is then always smaller or equal to the buffer size
 * The receiver window size determines the number of bytes that the receive window can accept from the sender before being overwhelmed (flow control).

Flow Control

 * Flow control balances the rate a producer creates data with the rate a consumer can use the data.
 * TCP separates flow control from error control.




 * Data travels from Sending Process to Sending TCP, then to the Receiving TCP, and finaly to the receiving process (paths 1, 2, and 3).
 * Flow control feedback's are traveling from the receiving TCP to the sending TCP and from the sending TCP up to the sending process (paths 4 and 5).
 * Most implementations of TCP do not provide flow control feedback from the receiving process to the receiving TCP; they let the receiving process pull data from the receiving TCP whenever it is ready.
 * Thus receiving TCP controls the sending TCP; the sending TCP controls the sending process.
 * Flow control feedback from the Sending TCP to the Sending Process (path 5) is achieved through simple rejection of data by sending TCP when its window is full.
 * Windows are used to achieve flow control from Receiving TCP to Sending TCP, as discussed in below section.

Opening and Closing Windows

 * To achieve flow control, TCP forces the sender and the receiver to adjust their window sizes.
 * The size of the buffer for both parties is fixed when the connection is established.
 * The receive window closes (moves its left wall to the right) when more bytes arrive from the sender;
 * It opens (moves its right wall to the right) when more bytes are pulled by the process.
 * Assume that it does not shrink (the right wall does not move to the left).
 * The opening, closing, and shrinking of the send window is controlled by the receiver.
 * The send window closes (moves its left wall to the right) when a new acknowledgement allows it to do so.
 * The send window opens (its right wall moves to the right) when the RWND advertised by the receiver allows it to do so.



The diagram shows 8 segments:

1. Client sends the server a SYN to request connection. The client announces its ISN = 100. The server, allocates a buffer size of 800 (assumption) and sets its window to cover the whole buffer (rwnd = 800). The number of the next byte to arrive starts from 101.

2. This is an ACK + SYN segment. The segment uses ack no = 101 to show that it expects to receive bytes starting from 101. It also announces that the client can set a buffer size of 800 bytes.

3. The third segment is an ACK segment from client to server.

4. After the client has set its window with the size (800) dictated by the server, the process pushes 200 bytes of data. The TCP client numbers these bytes 101 to 300. It creates a segment and sends it to server. The segment has starting byte number as 101 and the segment carries 200 bytes. The window of client is then adjusted to show 200 bytes of data are sent but waiting for acknowledgment. When this segment is received at the server, the bytes are stored, and the receive window closes to show that the next byte expected is byte 301; the stored bytes occupy 200 bytes of buffer.

5. The fifth segment is the feedback from the server to the client. The server acknowledges bytes up to and including 300 (expecting to receive byte 301). The segment also carries the size of the receive window after decrease (600). The client, after receiving this segment, purges the acknowledged bytes from its window and closes its window to show that the next byte to send is byte 301. The window size decreases to 600 bytes. Although the allocated buffer can store 800 bytes, the window cannot open (moving its right wall to the right) because the receiver does not let it.

6. Sent by the client after its process pushes 300 more bytes. The segment defines seq no as 301 and contains 300 bytes. When this segment arrives at the server, the server stores them, but it has to reduce its window size. After its process has pulled 100 bytes of data, the window closes from the left for the amount of 300 bytes, but opens from the right for the amount of 100 bytes. The result is that the size is only reduced 200 bytes. The receiver window size is now 400 bytes.

7. The server acknowledges the receipt of data, and announces that its window size is 400. When this segment arrives at the client, the client has no choice but to reduce its window again and set the window size to the value of rwnd = 400. The send window closes from the left by 300 bytes, and opens from the right by 100 bytes.

8. This one is also from the server after its process has pulled another 200 bytes. Its window size increases. The new rwnd value is now 600. The segment informs the client that the server still expects byte 601, but the server window size has expanded to 600. After this segment arrives at the client, the client opens its window by 200 bytes without closing it. The result is that its window size increases to 600 bytes.


 * Shrinking of Windows
 * The receive window cannot shrink.
 * The send window can shrink if the receiver defines a value for rwnd that results in shrinking the window.

Window Shutdown

 * Shrinking the send window by moving its right wall to the left is discouraged.
 * There is one exception: the receiver can temporarily shut down the window by sending a RWND of 0.
 * This can happen if the receiver does not want to receive data from the sender for a while.
 * The sender do not actually shrink the size of the window, but stops sending data until a new advertisement has arrived.
 * Even when the window is shut down by an order from the receiver, the sender can always send a segment with 1 byte of data.
 * This is called Probing and is used to prevent a deadlock.

Silly Window Syndrome

 * A serious problem can arise in the sliding window operation when either the sending application program creates data slowly or the receiving application program consumes data slowly, or both.
 * Any of these situations results in the sending of data in very small segments, which reduces the efficiency of the operation.
 * If TCP sends segments containing only 1 byte of data, it means that a 41-byte datagram (20 bytes TCP header and 20 bytes IP header) transfers only 1 byte of user data.
 * The Overhead is 41:1
 * The inefficiency is even worse after accounting for the data link layer and physical layer overhead.


 * Syndrome due to Sender


 * The sending TCP may create a silly window syndrome if it is serving an application program that creates data slowly(e.g:1 byte at a time).
 * The application program writes 1 byte at a time into the buffer of the sending TCP.
 * If the sending TCP does not have any specific instructions, it may create segments containing 1 byte of data.
 * The result is a lot of 41-byte segments that are traveling through an internet.
 * The solution is to prevent the sending TCP from sending the data byte by byte.
 * The sending TCP must be forced to wait and collect data to send in a larger block.
 * If it waits too long, it may delay the process.
 * If it does not wait long enough, it may end up sending small segments.


 * Solution - Nagle’s Algorithm


 * The sending TCP sends the first piece of data it receives from the sending application program even if it is only 1 byte.
 * After sending the first segment, the sending TCP accumulates data in the output buffer and waits until either the receiving TCP sends an acknowledgment or until enough data has accumulated to fill a maximum-size segment.
 * Above Step is repeated for the rest of the transmission.


 * Syndrome Created by the Receiver


 * if Receiving TCP is serving an application that consumes data slowly (like 1 byte at a time) Syndrome may occur.
 * Assume that the sender creates data in blocks of 1000 byte, but the receiver consumes data 1 byte at a time.
 * Also assume that the input buffer of the receiving TCP is 4 kilobytes. The sender sends the first 4 kilobytes of data.
 * The receiver stores it in its buffer.
 * Now its buffer is full.
 * It advertises a window size of zero, which means the sender should stop sending data.
 * The receiving application reads the first byte of data from the input buffer of the receiving TCP.
 * Now there is 1 byte of space in the incoming buffer.
 * The receiving TCP announces a window size of 1 byte, which means that the sending TCP takes this advertisement as good news and sends a segment carrying only 1 byte of data.
 * The procedure will continue.
 * One byte of data is consumed and a segment carrying 1 byte of data is sent.
 * This is again an efficiency problem.


 * Two solutions are possible


 * Clark’s Solution


 * Announce a window size of zero until either
 * 1) There is enough space to accommodate a segment of maximum size
 * 2) At least half of the receive buffer is empty.


 * Delayed Acknowledgment


 * The second solution is to delay sending the acknowledgment.
 * This means that when a segment arrives, it is not acknowledged immediately.
 * The receiver waits until there is a decent amount of space in its incoming buffer before acknowledging the arrived segments.
 * The delayed acknowledgment prevents the sending TCP from sliding its window.
 * After the sending TCP has sent the data in the window, it stops.
 * This removes the syndrome.
 * Delayed acknowledgment also has another advantage: it reduces traffic.
 * The receiver does not have to acknowledge each segment.
 * However, there also is a disadvantage in that the delayed acknowledgment may result in the sender unnecessarily retransmitting the unacknowledged segments.
 * TCP adjusts this by defining that the acknowledgment should not be delayed by more than 500 ms.

Error Control

 * TCP is a reliable transport layer protocol.
 * This means that an application program that delivers a stream of data to TCP relies on TCP to deliver the entire stream to the application program on the other end in order, without error, and without any part lost or duplicated.
 * TCP provides reliability using error control. Error control includes mechanisms for detecting and resending corrupted segments, resending lost segments, storing out-of-order segments until missing segments arrive, and detecting and discarding duplicated segments.
 * Error control in TCP is achieved through the use of three simple tools: checksum, acknowledgment, and time-out.

Checksum

 * Each segment includes a checksum field, which is used to check for a corrupted segment.
 * If a segment is corrupted as deleted by an invalid checksum, the segment is discarded by the destination TCP and is considered as lost.
 * TCP uses a 16-bit checksum that is mandatory in every segment.

Acknowledgment

 * TCP uses acknowledgments to confirm the receipt of data segments.
 * Control segments that carry no data but consume a sequence number are also acknowledged.
 * ACK segments are never acknowledged.

There are two types of acknowledgment:
 * Acknowledgment Type:


 * Cumulative Acknowledgment (ACK)
 * TCP was originally designed to acknowledge receipt of segments cumulatively.
 * The receiver advertises the next byte it expects to receive, ignoring all segments received and stored out of order.
 * Also called Positive Cumulative Acknowledgment or ACK.
 * "Positive” indicates that no feedback is provided for discarded, lost, or duplicate segments.
 * The 32-bit ACK field in the TCP header is used for cumulative acknowledgments
 * Its value is valid only when the ACK flag bit is set to 1.


 * Selective Acknowledgment (SACK)
 * A SACK does not replace ACK, but reports additional information to the sender.
 * A SACK reports a block of data that is out of order.
 * Also reports a block of segments that is duplicated.
 * There is no provision in the TCP header for adding this type of information.
 * SACK is implemented as an option at the end of the TCP header.


 * Acknowledgment Generation

1. When end A sends a data segment to end B, it must include (piggyback) an acknowledgment that gives the next sequence number it expects to receive. This rule decreases the number of segments needed and therefore reduces traffic.

2. When the receiver has no data to send and it receives an in-order segment (with expected sequence number) and the previous segment has already been acknowledged, the receiver delays sending an ACK segment until another segment arrives or until a period of time (normally 500 ms) has passed. In other words, the receiver needs to delay sending an ACK segment if there is only one outstanding in-order segment. This rule reduces ACK segment traffic.

3. When a segment arrives with a sequence number that is expected by the receiver, and the previous in-order segment has not been acknowledged, the receiver immediately sends an ACK segment. In other words, there should not be more than two in-order unacknowledged segments at any time. This prevents the unnecessary retransmission of segments that may create congestion in the network.

4. When a segment arrives with an out-of-order sequence number that is higher than expected, the receiver immediately sends an ACK segment announcing the sequence number of the next expected segment. This leads to the fast retransmission of missing segments.

5. When a missing segment arrives, the receiver sends an ACK segment to announce the next sequence number expected. This informs the receiver that segments reported missing have been received.

6. If a duplicate segment arrives, the receiver discards the segment, but immediately sends an acknowledgment indicating the next in-order segment expected. This solves some problems when an ACK segment itself is lost.

Retransmission

 * The heart of the error control mechanism is the retransmission of segments.
 * When a segment is sent, it is stored in a queue until it is acknowledged.
 * When the retransmission timer expires or when the sender receives three duplicate ACKs for the first segment in the queue, that segment is retransmitted.


 * Retransmission after RTO


 * The sending TCP maintains one retransmission time-out (RTO) for each connection.
 * When the timer matures, i.e. times out, TCP sends the segment in the front of the queue (the segment with the smallest sequence number) and restarts the timer.
 * Note that again we assume Sf < Sn.
 * This version of TCP is sometimes referred to as Tahoe.
 * We will see later that the value of RTO is dynamic in TCP and is updated based on the round-trip time (RTT) of segments.
 * RTT is the time needed for a segment to reach a destination and for an acknowledgment to be received.


 * Retransmission after Three Duplicate ACK Segments(Reno)


 * The previous rule about retransmission of a segment is sufficient if the value of RTO is not large.
 * To help throughput by allowing sender to retransmit sooner than waiting for a time out, most implementations today follow the three duplicate ACKs rule and retransmit the missing segment immediately.
 * This feature is called fast retransmission, and the version of TCP that uses this feature is referred to as Reno.
 * In this version, if three duplicate acknowledgments (i.e., an original ACK plus three exactly identical copies) arrives for a segment, the next segment is retransmitted without waiting for the time-out.


 * Out-of-Order Segments


 * TCP implementations today do not discard out-of-order segments.
 * They store them temporarily and flag them as out-of-order segments until the missing segments arrive.
 * Out-of-order segments are never delivered to the process.
 * TCP guarantees that data are delivered to the process in order.


 * Lost Segment




 * A lost segment is discarded somewhere in the network; a corrupted segment is discarded by the receiver itself.
 * Both are considered lost.
 * We are assuming that data transfer is unidirectional: one site is sending, the other receiving.
 * In our scenario, the sender sends segments 1 and 2, which are acknowledged immediately by an ACK (rule 3).
 * Segment 3, however, is lost.
 * The receiver receives segment 4, which is out of order.
 * The receiver stores the data in the segment in its buffer but leaves a gap to indicate that there is no continuity in the data.
 * The receiver immediately sends an acknowledgment to the sender displaying the next byte it expects (rule 4).
 * Note that the receiver stores bytes 801 to 900, but never delivers these bytes to the application until the gap is filled.
 * The sender TCP keeps one RTO timer for the whole period of connection.
 * When the third segment times out, the sending TCP resends segment 3, which arrives this time and is acknowledged properly (rule 5).


 * Fast Retransmission


 * Here RTO has a larger value.
 * Each time the receiver receives the fourth, fifth, and sixth segments, it triggers an acknowledgment (rule 4).
 * The sender receives four acknowledgments with the same value (three duplicates).
 * Although the timer has not matured, the rule for fast transmission requires that segment 3, the segment that is expected by all of these duplicate acknowledgments, be resent immediately.
 * After resending this segment, the timer is restarted.


 * Delayed Segment
 * TCP uses the services of IP, which is a connectionless protocol.
 * Each IP datagram encapsulating a TCP segment may reach the final destination through a different route with a different delay.
 * Hence TCP segments may be delayed.
 * Delayed segments sometimes may time out.
 * If the delayed segment arrives after it has been resent, it is considered a duplicate segment and discarded.


 * Duplicate Segment
 * A duplicate segment can be created, for example, by a sending TCP when a segment is delayed and treated as lost by the receiver.
 * Handling the duplicated segment is a simple process for the destination TCP.
 * The destination TCP expects a continuous stream of bytes.
 * When a segment arrives that contains a sequence number equal to an already received and stored segment, it is discarded.
 * An ACK is sent with ackNo defining the expected segment.


 * Automatically Corrected Lost ACK




 * A key advantage of using cumulative acknowledgments.
 * Figure shows a lost acknowledgment sent by the receiver of data.
 * In the TCP acknowledgment mechanism, a lost acknowledgment may not even be noticed by the source TCP.
 * TCP uses an accumulative acknowledgment system.
 * We can say that the next acknowledgment automatically corrects the loss of the acknowledgment.




 * If the next acknowledgment is delayed for a long time or there is no next acknowledgment (the lost acknowledgment is the last one sent), the correction is triggered by the RTO timer.
 * A duplicate segment is the result.
 * When the receiver receives a duplicate segment, it discards it, and resends the last ACK immediately to inform the sender that the segment or segments have been received.
 * Note that only one segment is retransmitted although two segments are not acknowledged.
 * When the sender receives the retransmitted ACK, it knows that both segments are safe and sound because acknowledgment is cumulative.


 * Deadlock Created by Lost Acknowledgment


 * There is one situation in which loss of an acknowledgment may result in system deadlock.
 * This is the case in which a receiver sends an acknowledgment with rwnd set to 0 and requests that the sender shut down its window temporarily.
 * After a while, the receiver wants to remove the restriction; however, if it has no data to send, it sends an ACK segment and removes the restriction with a nonzero value for rwnd.
 * A problem arises if this acknowledgment is lost.
 * The sender is waiting for an acknowledgment that announces the nonzero rwnd.
 * The receiver thinks that the sender has received this and is waiting for data.
 * This situation is called a deadlock; each end is waiting for a response from the other end and nothing is happening.
 * A retransmission timer is not set.
 * To prevent deadlock, a persistence timer was designed.

Congestion Control

 * Congestion control in TCP is based on both open-loop and closed-loop mechanisms.
 * TCP uses a congestion window and a congestion policy that avoid congestion and detect and alleviate congestion after it has occurred.

Actual window size = Minimum (rwnd, cwnd)
 * Congestion Window
 * It is not only the receiver that can dictate to the sender the size of the sender’s window.
 * The network can also dectate the size.
 * If the network cannot deliver the data as fast as it is created by the sender, it must tell the sender to slow down.
 * So Receiver and Network determine the size of the sender’s window.
 * The sender has two pieces of information: the Receiver-Advertised window size and the Congestion window size.
 * The actual size of the window is the minimum of these two:


 * Congestion Policy
 * TCP’s general policy for handling congestion is based on three phases:
 * Slow Start
 * Congestion Avoidance
 * Congestion Detection


 * In the slow start phase, the sender starts with a slow rate of transmission, but increases the rate rapidly to reach a threshold.
 * When the threshold is reached, the rate of increase is reduced.
 * Finally if ever congestion is detected, the sender goes back to the slow start or congestion avoidance phase, based on how the congestion is detected.


 * Slow Start - Exponential Increase




 * The slow start algorithm is based on the idea that the size of the congestion window (cwnd) starts with 1 MSS.
 * The MSS is determined during connection establishment using an option of the same name.
 * The size of the window increases one MSS each time one acknowledgement arrives.
 * The algorithm starts slowly, but grows exponentially.
 * Assume that rwnd is much longer than cwnd, so that the sender window size always equals cwnd.
 * Ignore delayed-ACK policy for now and assume that each segment is acknowledged individually.
 * The sender starts with cwnd = 1 MSS.
 * This means that the sender can send only one segment.
 * After the first ACK arrives, the size of the congestion window is increased by 1, which means that cwnd is now 2.
 * Now two more segments can be sent.
 * When two more ACKs arrive, the size of the window is increased by 1 MSS for each ACK, which means cwnd is now 4.
 * Now four more segments can be sent.
 * When four ACKs arrive, the size of the window increases by 4, which means that cwnd is now 8.
 * In the slow start algorithm, the size of the congestion window increases exponentially until it reaches a threshold.


 * Congestion Avoidance - Additive Increase




 * In slow start algorithm, the size of the congestion window increases exponentially.
 * To avoid congestion before it happens, one must slow down this exponential growth.
 * TCP's Congestion avoidance feature increases the cwnd additively instead of exponentially.
 * When the size of the congestion window reaches the slow start threshold, the slow start phase stops and the additive phase begins.
 * Each time the whole “window” of segments is acknowledged, the size of the congestion window is increased by one.
 * A window is the number of segments transmitted during RTT.
 * The increase is based on RTT, not on the number of arrived ACKs.
 * Therefore the size of the congestion window increases additively until congestion is detected.


 * Congestion Detection - Multiplicative Decrease


 * If congestion occurs, the congestion window size must be decreased.
 * The only way a sender can guess that congestion has occurred is the need to retransmit a segment.
 * This is a major assumption made by TCP.
 * Retransmission is needed to recover a missing packet which is assumed to have been dropped by a router due to overloaded or congested.
 * Retransmission can occur in one of two cases: when the RTO timer times out or when three duplicate ACKs are received.
 * In both cases, the size of the threshold is dropped to half (multiplicative decrease).

Most TCP implementations have two reactions:

1. If a time-out occurs, there is a stronger possibility of congestion; a segment has probably been dropped in the network and there is no news about the following sent segments.

In this case TCP reacts strongly:
 * a. It sets the value of the threshold to half of the current window size.


 * b. It reduces cwnd back to one segment.


 * c. It starts the slow start phase again.

2. If three duplicate ACKs are received, there is a weaker possibility of congestion; a segment may have been dropped but some segments after that have arrived safely since three duplicate ACKs are received. This is called fast transmission and fast recovery.

In this case, TCP has a weaker reaction as shown below:
 * a. It sets the value of the threshold to half of the current window size.


 * b. It sets cwnd to the value of the threshold (some implementations add three segment sizes to the threshold).


 * c. It starts the congestion avoidance phase.

TCP Timers
Most TCP implementations use at least four timers
 * Retransmission
 * Persistence
 * Keepalive
 * TIME-WAIT


 * Retransmission Timer

To retransmit lost segments, TCP employs one retransmission timer for the whole connection period that handles the retransmission time-out (RTO), the waiting time for an acknowledgment of a segment.

The following rules apply to the retransmission timer:

1. When TCP sends the segment in front of the sending queue, it starts the timer.

2. When the timer expires, TCP resends the first segment in front of the queue, and restarts the timer.

3. When a segment (or segments) are cumulatively acknowledged, the segment (or segments) are purged from the queue.

4. If the queue is empty, TCP stops the timer; otherwise, TCP restarts the timer.

To calculate the retransmission time-out (RTO), we first need to calculate the RTT.
 * Round-Trip Time (RTT)
 * Measured RTT - The measured round-trip time for a segment is the time required for the segment to reach the destination and be acknowledged, although the acknowledgment may include other segments. In TCP only one RTT measurement can be in progress at any time.
 * Smoothed RTT - The measured RTT is likely to change for each round trip. The fluctuation is so high in today’s Internet that a single measurement alone cannot be used for retransmission time-out purposes.
 * RTT Deviation - Most implementations use RTT deviation


 * Retransmission Time-out (RTO)
 * The value of RTO is based on the smoothed round-trip time and its deviation.
 * Take the running smoothed average value of Smoothed RTT, and add four times the running smoothed average value of RTT Deviation (normally a small value).


 * Karn’s Algorithm
 * Do not consider the round-trip time of a retransmitted segment in the calculation of RTTs.
 * Do not update the value of RTTs until you send a segment and receive an acknowledgment without the need for retransmission.
 * TCP does not consider the RTT of a retransmitted segment in its calculation of a new RTO.


 * Exponential Backoff
 * Most TCP implementations use an exponential backoff strategy to calculate the value of RTO if a retransmission occurs.
 * The value of RTO is doubled for each retransmission.
 * So if the segment is retransmitted once, the value is two times the RTO.
 * If it transmitted twice, the value is four times the RTO.


 * Persistence Timer
 * To deal with a zero-window-size advertisement, TCP needs Persistence Timer.
 * If the receiving TCP announces a window size of zero, the sending TCP stops transmitting segments until the receiving TCP sends an ACK segment announcing a nonzero window size.
 * This ACK segment can be lost.
 * Remember - ACK segments are not acknowledged nor retransmitted in TCP.
 * Both TCPs might continue to wait for each other forever (a deadlock).
 * To correct this deadlock, TCP uses a persistence timer for each connection.
 * When the sending TCP receives an acknowledgment with a window size of zero, it starts a persistence timer.
 * When the persistence timer goes off, the sending TCP sends a special segment called a Probe.
 * This segment contains only 1 byte of new data.
 * It has a sequence number, but its sequence number is never acknowledged; it is even ignored in calculating the sequence number for the rest of the data.
 * The probe causes the receiving TCP to resend the acknowledgment.
 * The value of the persistence timer is set to the value of the retransmission time.
 * If a response is not received from the receiver, another probe segment is sent and the value of the persistence timer is doubled and reset.
 * The sender continues sending the probe segments and doubling and resetting the value of the persistence timer until the value reaches a threshold (generally 60s).
 * After that the sender sends one probe segment every 60 s until the window is reopened.


 * Keepalive Timer
 * A keepalive timer is used in some implementations to prevent a long idle connection between two TCPs.
 * If a client opens a TCP connection to a server, transfers some data, and becomes silent.
 * Perhaps the client has crashed. In this case, the connection remains open forever.
 * To remedy this situation, most implementations equip a server with a keepalive timer.
 * Each time the server hears from a client, it resets this timer.
 * The time-out is usually 2 hours.
 * If the server does not hear from the client after 2 hours, it sends a probe segment.
 * If there is no response after 10 probes, each of which is 75s apart, it assumes that the client is down and terminates the connection.

Options
The TCP header can have up to 40 bytes of optional information.


 * 1-byte options
 * 1) End of option list
 * 2) No operation


 * Multiple-byte options
 * 1) Maximum Segment Size
 * 2) Window Scale Factor
 * 3) Timestamp
 * 4) SACK-permitted
 * 5) SACK


 * End of Option
 * EOP is a 1-byte option used for padding at the end of the option section.
 * It can only be used as the last option. There are no more options in the header after EOP.
 * Only one occurrence of this option is allowed.
 * After this option, the receiver looks for the payload data.
 * Data from the application program starts at the beginning of the next 32-bit word.


 * No Operation
 * NOP option is also a 1-byte option used as a filler.
 * It normally comes before another option to help align it in a four-word slot.
 * NOP can be used more than once.


 * Maximum Segment Size (MSS)


 * MSS option defines the size of the biggest unit of data that can be received by the destination of the TCP segment.
 * It defines the maximum size of the data, not the maximum size of the segment.
 * The field is 16 bits long, the value can be 0 to 65,535 bytes.
 * Each party defines the MSS for the segments it will receive during the connection.
 * If a party does not define this, the default values is 536 bytes.
 * The value of MSS is determined during connection establishment and does not change during the connection.


 * Window Scale Factor
 * Window size field in the header defines the size of the sliding window.
 * This field is 16 bits long, which means that the window can range from 0 to 65,535 bytes.
 * It may not be sufficient if the data are traveling through a long channel with a wide bandwidth.
 * To increase the window size, a window scale factor is used.
 * The new window size is found by first raising 2 to the number specified in the window scale factor.
 * Then this result is multiplied by the value of the window size in the header.

New Window Size = Window Size in Header × 2 Window Scale Factor

If Window Scale Factor is 3. An end point receives an acknowledgment in which the window size is advertised as 32,768. New Window Size = 32,768 × 23 = 262,144 bytes.


 * Although the scale factor could be as large as 255, the largest value allowed by TCP/IP is 14.
 * Maximum window size is 216 × 214 = 230, which is less than the maximum value for the sequence number.
 * The size of the window cannot be greater than the maximum value of the sequence number.
 * The value of the window scale factor can also be determined only during connection establishment; it does not change during the connection.
 * During data transfer, the size of the window (specified in the header) may be changed, but it must be multiplied by the same window scale factor.
 * One end may set the value of the window scale factor to 0, which means although it supports this option, it does not want to use it for this connection.


 * Timestamp


 * This is a 10-byte option with the format shown in Figure 15.46. Note that the end with the active open announces a timestamp in the connection request segment (SYN segment).
 * If it receives a timestamp in the next segment (SYN + ACK) from the other end, it is allowed to use the timestamp; otherwise, it does not use it any more.
 * The timestamp option has two applications: it measures the round-trip time and prevents wraparound sequence numbers.


 * Measuring RTT


 * Timestamp can be used to measure the round-trip time (RTT).
 * TCP, when ready to send a segment, reads the value of the system clock and inserts this value, a 32-bit number, in the timestamp value field.
 * The receiver, when sending an acknowledgment for this segment or an accumulative acknowledgment that covers the bytes in this segment, copies the timestamp received in the timestamp echo reply.
 * The sender, upon receiving the acknowledgment, subtracts the value of the timestamp echo reply from the time shown by the clock to find RTT.


 * Note that there is no need for the sender’s and receiver’s clocks to be synchronized because all calculations are based on the sender clock.
 * Also note that the sender does not have to remember or store the time a segment left because this value is carried by the segment itself.

is subtracted from the current time.
 * The receiver needs to keep track of two variables. The first, lastack, is the value of the last acknowledgment sent.
 * The second, tsrecent, is the value of the recent timestamp that has not yet echoed.
 * When the receiver receives a segment that contains the byte matching the value of lastack, it inserts the value of the timestamp field in the tsrecent variable.
 * When it sends an acknowledgment, it inserts the value of tsrecent in the echo reply field.
 * The sender simply inserts the value of the clock (for example, the number of seconds past midnight) in the timestamp field for the first and second segment.
 * When an acknowledgment comes (the third segment), the value of the clock is checked and the value of the echo reply field
 * RTT is 12 s in this scenario.
 * The receiver’s function is more involved.
 * It keeps track of the last acknowledgment sent (12000).
 * When the first segment arrives, it contains the bytes 12000 to 12099.
 * The first byte is the same as the value of lastack.
 * It then copies the timestamp value (4720) into the tsrecent variable.
 * The value of lastack is still 12000 (no new acknowledgment has been sent).
 * When the second segment arrives, since none of the byte numbers in this segment include the value of lastack, the value of the timestamp field is ignored.
 * When the receiver decides to send an accumulative acknowledgment with acknowledgment 12200, it changes the value of lastack to 12200 and inserts the value of tsrecent in the echo reply field.
 * The value of tsrecent will not change until it is replaced by a new segment that carries byte 12200 (next segment).
 * Note that as the example shows, the RTT calculated is the time difference between sending the first segment and receiving the third segment.
 * This is actually the meaning of RTT: the time difference between a packet sent and the acknowledgment received.
 * The third segment carries the acknowledgment for the first and second segments.


 * PAWS


 * The timestamp option has another application, protection against wrapped sequence numbers (PAWS).
 * The sequence number defined in the TCP protocol is only 32 bits long.
 * Although this is a large number, it could be wrapped around in a high-speed connection.
 * This implies that if a sequence number is n at one time, it could be n again during the lifetime of the same connection.
 * Now if the first segment is duplicated and arrives during the second round of the sequence numbers, the segment belonging to the past is wrongly taken as the segment belonging to the new round.
 * One solution to this problem is to increase the size of the sequence number, but this involves increasing the size of the window as well as the format of the segment and more.
 * The easiest solution is to include the timestamp in the identification of a segment.
 * In other words, the identity of a segment can be defined as the combination of timestamp and sequence number.
 * This means increasing the size of the identification.
 * Two segments 400:12,001 and 700:12,001 definitely belong to different incarnations.
 * The first was sent at time 400, the second at time 700.


 * SACK-Permitted and SACK Options


 * As we discussed before, the acknowledgment field in the TCP segment is designed as an accumulative acknowledgment, which means it reports the receipt of the last consecutive byte: it does not report the bytes that have arrived out of order.
 * It is also silent about duplicate segments.
 * This may have a negative effect on TCP’s performance.
 * If some packets are lost or dropped, the sender must wait until a time-out and then send all packets that have not been acknowledged.
 * The receiver may receive duplicate packets.
 * To improve performance, selective acknowledgment (SACK) was proposed.
 * Selective acknowledgment allows the sender to have a better idea of which segments are actually lost and which have arrived out of order.
 * The new proposal even includes a list for duplicate packets.
 * The sender can then send only those segments that are really lost.
 * The list of duplicate segments can help the sender find the segments which have been retransmitted by a short time-out.
 * The SACK-permitted option of two bytes is used only during connection establishment.
 * The host that sends the SYN segment adds this option to show that it can support the SACK option.
 * If the other end, in its SYN + ACK segment, also includes this option, then the two ends can use the SACK option during data transfer.
 * Note that the SACK-permitted option is not allowed during the data transfer phase.
 * The SACK option, of variable length, is used during data transfer only if both ends agree (if they have exchanged ACK-permitted options during connection establishment).
 * The option includes a list for blocks arriving out of order.
 * Each block occupies two 32-bit numbers that define the beginning and the end of the blocks.
 * We will show the use of this option in examples; for the moment, remember that the allowed size of an option in TCP is only 40 bytes.
 * This means that a SACK option cannot define more than 4 blocks.
 * The information for 5 blocks occupies (5 × 2) × 4 + 2 or 42 bytes, which is beyond the available size for the option section in a segment.
 * If the SACK option is used with other options, then the number of blocks may be reduced.
 * The first block of the SACK option can be used to report the duplicates.
 * This is used only if the implementation allows this feature.
 * The SACK option announces this duplicate data first and then the out-of-order block.
 * This time, however, the duplicated block is not yet acknowledged by ACK, but because it is part of the out-of-order block (4001:5000 is part of 4001:6000), it is understood by the sender that it defines the duplicate data.