Armed with these graphs and this reasoning, I went to the VPs who owned the various web pages and argued that they needed to set these SLAs. In parallel, we (the perf engr) team built tools that made it really easy for developers to measure the latency of their pages.
-
Show this thread
-
That system, called PMET, let developers put little "start" and "stop" indicators in their code and then our system would scrape the logs and store the latency histograms in a database. If their page wasn't hitting SLA, they could drill down and figure out why.
3 replies 3 retweets 107 likesShow this thread -
@dwagner00 wrote the prototype service for collecting and aggregating the data, and I wrote a simple visualization tool (using Perl and gnuplot). The rest is history!2 replies 2 retweets 77 likesShow this thread -
All things considered, it's probably the most-impactful thing I have done in my 20 years at Amazon, and I did it in my first few years! That focus on high percentiles instead of averages has driven so much good behavior.
2 replies 19 retweets 252 likesShow this thread -
When James Hamilton started in AWS, we were chatting about PMET. When he heard that I had been one of the creators, he told me that when he read the Dynamo paper, the thing that had the biggest impact on his thinking from the paper was that we were focused on percentile latency.
1 reply 3 retweets 100 likesShow this thread -
If you work at Amazon, you can hear me talk about this in an old PoA talk. If you search broadcast for "andrew certain gems" it should be the only hit (it's the first ten minutes of that video). That's it!
5 replies 1 retweet 134 likesShow this thread -
Replying to @tacertain
I'll have to find and add that video to the monitoring bootcamp! One set of stats that PMET didn't & CloudWatch don't (yet) offer are 'trimmed mean' stats, wondering if you'd every considered them? Seems like they'd compliment percentiles for use cases like observing page latency
1 reply 0 retweets 1 like -
Replying to @joshea
I hadn't. I'm very self-taught when it comes to statistics, so didn't have many tools in my toolbox!
2 replies 0 retweets 0 likes -
Replying to @tacertain @joshea
Related: overstats are an under appreciated gem!
2 replies 0 retweets 2 likes -
What are overstats? (I tried googling for the term, but it's buried in overwatch results.)
1 reply 0 retweets 0 likes
A simple count of the number of data points over a given value. Eg the 1ms overstat counts every data point over 1ms.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.