Basic Troubleshooting: Difference between revisions

No edit summary
Line 32:
 
= Tail Latency =
 
Source [http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html highscalability.com]
 
*Imagine a client making a request of a single web server.
*Ninety-nine times out of a hundred that request will be returned within an acceptable period of time.
*But one time out of hundred it may not.
*Say the disk is slow for some reason.
*If you look at the distribution of latencies, most of them are small, but there's one out on the tail end that's large.
*That's not so bad really.
*All it means is one customer gets a slightly slower response every once in a while.
 
*Lets' change the example, now instead of one server you have 100 servers and a request will require a response from all 100 servers.
*That changes everything about your system's responsiveness.
*Suddenly the majority of queries are slow. 63% will take greater than 1 second.
*That's bad.
 
*Using the same components and scaling them results in a really unexpected outcome.
*This is a fundamental property of scaling systems: you need to worry not just about not latency, but tail latency, that is the longer events in your system.
*High performance equals high tolerances.
*At scale you can’t ignore tail latency.
 
*This latency could come from:
RCP Library
DNS lookups
Disk Slow
Packet loss
Microbursts
Deep queues
High task response latency
Locking
Garbage collection
OS stack issues
Router/switch overhead
Transiting multiple hops
Slow processing code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
<br />