Yep definitely don’t page on signals indicating ‘user pain’ is likely to come soon; you should 100% wait for actual users to have pain before reacting to it. Don’t try to anticipate the pain as a means of heading off and mitigating the path of pain—users need to feel it!
-
-
Odgovor korisniku/ci @allspaw
I deserve this :-D. To be clear, metrics are not the whole story of user happiness, safety, or reliability, which are discussed at length in the blog posts linked! The point I was trying to summarize was to page on signals that are closely linked to user pain. (1/X)
1 reply 0 proslijeđenih tweetova 1 korisnik označava da mu se sviđa -
Or perhaps, are observable to users (latency is observable to users, CPU usage is not). In the general case non-user-observable signals can be quite noisy -- I can of course think of counter examples though, like disk becoming full (or OOMs, like the incident linked) (2/X)
1 reply 0 proslijeđenih tweetova 0 korisnika označava da im se sviđa -
In general, I agree that prevaling sentiment that noisy pages are bad, because it sabotages trust in the system. I think high precision is important on pages and am willing to tradeoff some recall, especially since you can't depend on pages for everything anyway. (3/X)
1 reply 0 proslijeđenih tweetova 0 korisnika označava da im se sviđa -
I'm super interested if you think this argument in general has giant holes, or if the critique is "when users are unhappy" vs "on signals tied to user paid". And of course in my subsequent tweet, I mention reliability beyond SLOs and tag ACL :-P.
0 proslijeđenih tweetova 1 korisnik označava da mu se sviđa -
How does this apply to something like data loss? It seems crazy to wait for data loss to occur before paging but maybe that is not a good example.
1 reply 0 proslijeđenih tweetova 2 korisnika označavaju da im se sviđa -
what does data loss look like, I guess? Related, how does this apply to security? I can think of cases where there's a clear limit you're going to hit (cert expiration, I mentioned disk free above).
1 reply 0 proslijeđenih tweetova 0 korisnika označava da im se sviđa -
Luckily for me I mention that "[SLOs] won't save you from everything" -- perhaps it boils down to heavily favoring precision over recall? Pages that are 100% actionable every time, even if not user visible, might not be that bad? cc
@lizthegrey0 proslijeđenih tweetova 0 korisnika označava da im se sviđa -
Data loss usually has leading indicators you can set an SLO on e.g. underreplication.
0 proslijeđenih tweetova 5 korisnika označava da im se sviđa -
Odgovor korisnicima @lizthegrey @jhscott i sljedećem broju korisnika:
This might be oversimplifying, but I feel like system resiliency can follow a similar pattern to the traditional advice for commenting your code: "Write your code so it doesn't need comments, and then comment it anyway." (Not trying to start a flame war here, bear with me.)
0 proslijeđenih tweetova 1 korisnik označava da mu se sviđa
@mipsytipsy's thought experiment was essentially that:
Instrument and observe your code as if you don't have pager alerts. Then add pager alerts back in.
-
-
Odgovor korisnicima @shelbyspees @lizthegrey i sljedećem broju korisnika:
Alerts, like comments, are static. It takes extra cognitive resources to validate their value and correctness (compared to code, which can at least be run and tested). Comments and documentation don't get updated when code does. The same is true for many pager alerts.
1 reply 3 proslijeđena tweeta 8 korisnika označava da im se sviđa -
Odgovor korisnicima @shelbyspees @lizthegrey i sljedećem broju korisnika:
My team has been getting a weird alert all week, numInputRows is too low. Systems were behaving fine, data was streaming fine. Each day when it was triggered we'd query the DB to make sure data was arriving, and it was.
1 reply 0 proslijeđenih tweetova 1 korisnik označava da mu se sviđa - Još 8 drugih odgovora
Novi razgovor -
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.
at the intersection of ice cream and complex socio-technical systems