Basic Troubleshooting: Difference between revisions

← Older edit

Basic Troubleshooting (view source)

Revision as of 23:46, 3 December 2019

2,324 bytes added , 4 years ago

→‎Tail Latency

Amanjosan2008

Bureaucrats, Administrators

4,400

edits

Revision as of 12:44, 26 October 2015 (view source) m>Amandeep m (Protected "Basic Troubleshooting" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)))		Latest revision as of 23:46, 3 December 2019 (view source) Amanjosan2008 (talk \| contribs) (→‎Tail Latency)
(6 intermediate revisions by the same user not shown)
Line 2: __TOC__ = Proxy Server = == Capture Traffic passing though a Transparent Proxy == The filters to be put into a firewall(ScreenOS here) to capture complete packet flow across a firewall. Line 15 ⟶ 16: <br /> == Proxy Server Flow<ref>www.india.fidelity.com</ref> == [[File:Proxy_server_flow_non_transparent.png\|center]] <br /> === Packet flow for HTTP Traffic === [[File:Proxy_server_flow_non_transparent_http.png\|center]] <br /> === Packet flow for HTTPS Traffic === [[File:Proxy_server_flow_non_transparent_https.png\|center]] <br /> = Tail Latency = Source [http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html highscalability.com], [https://accelazh.github.io/storage/Tail-Latency-Study accelazh.github.io] Imagine a client making a request of a single web server. Ninety-nine times out of a hundred that request will be returned within an acceptable period of time. But one time out of hundred it may not. Say the disk is slow for some reason. If you look at the distribution of latencies, most of them are small, but there's one out on the tail end that's large. That's not so bad really. All it means is one customer gets a slightly slower response every once in a while. Lets' change the example, now instead of one server you have 100 servers and a request will require a response from all 100 servers. That changes everything about your system's responsiveness. Suddenly the majority of queries are slow. 63% will take greater than 1 second. That's bad. Using the same components and scaling them results in a really unexpected outcome. This is a fundamental property of scaling systems: you need to worry not just about not latency, but tail latency, that is the longer events in your system. High performance equals high tolerances. At scale you can’t ignore tail latency. This latency could come from: RCP Library DNS lookups Disk Slow Packet loss Microbursts Deep queues High task response latency Locking Garbage collection OS stack issues Router/switch overhead Transiting multiple hops Slow processing code Other Reasons: Overprovisioned VMs Many OS images being forked from a small shared base A large request may be pegging your CPU/network/disk, and make the others queuing up. something went wrong as a dead loop stuck your cpu. The latency percentile has low, middle, and tail parts. To reduce the low, middle parts: Provisioning more resources, cut and parallelize the tasks, eliminate “head-of-line” blocking, and caching will help. To reduce the tail latency: The basic idea is hedging. Even we’ve parallelized the service, the slowest instance will determine when our request is done. *Code freezes--interrupt, context switch, cache buffer flush to disk, garbage collection, reindexing the database