Basic Troubleshooting: Difference between revisions

m (Protected "Basic Troubleshooting" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)))
 
 
(6 intermediate revisions by the same user not shown)
Line 2:
__TOC__
 
= Proxy Server =
 
== Capture Traffic passing though a Transparent Proxy ==
 
The filters to be put into a firewall(ScreenOS here) to capture complete packet flow across a firewall.
Line 15 ⟶ 16:
<br />
 
== Proxy Server Flow<ref>www.india.fidelity.com</ref> ==
 
[[File:Proxy_server_flow_non_transparent.png|center]]
<br />
 
=== Packet flow for HTTP Traffic ===
 
[[File:Proxy_server_flow_non_transparent_http.png|center]]
<br />
 
=== Packet flow for HTTPS Traffic ===
 
[[File:Proxy_server_flow_non_transparent_https.png|center]]
<br />
 
= Tail Latency =
 
Source [http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html highscalability.com], [https://accelazh.github.io/storage/Tail-Latency-Study accelazh.github.io]
 
*Imagine a client making a request of a single web server.
*Ninety-nine times out of a hundred that request will be returned within an acceptable period of time.
*But one time out of hundred it may not. Say the disk is slow for some reason.
*If you look at the distribution of latencies, most of them are small, but there's one out on the tail end that's large.
*That's not so bad really.
*All it means is one customer gets a slightly slower response every once in a while.
 
*Lets' change the example, now instead of one server you have 100 servers and a request will require a response from all 100 servers.
*That changes everything about your system's responsiveness.
*Suddenly the majority of queries are slow. 63% will take greater than 1 second. That's bad.
 
*Using the same components and scaling them results in a really unexpected outcome.
*This is a fundamental property of scaling systems: you need to worry not just about not latency, but tail latency, that is the longer events in your system.
*High performance equals high tolerances.
*At scale you can’t ignore tail latency.
 
*This latency could come from:
RCP Library
DNS lookups
Disk Slow
Packet loss
Microbursts
Deep queues
High task response latency
Locking
Garbage collection
OS stack issues
Router/switch overhead
Transiting multiple hops
Slow processing code
 
Other Reasons:
Overprovisioned VMs
Many OS images being forked from a small shared base
A large request may be pegging your CPU/network/disk, and make the others queuing up.
something went wrong as a dead loop stuck your cpu.
 
 
*The latency percentile has low, middle, and tail parts.
*To reduce the low, middle parts: Provisioning more resources, cut and parallelize the tasks, eliminate “head-of-line” blocking, and caching will help.
*To reduce the tail latency: The basic idea is hedging.
*Even we’ve parallelized the service, the slowest instance will determine when our request is done.
*Code freezes--interrupt, context switch, cache buffer flush to disk, garbage collection, reindexing the database