Related: planned downtime in 2019 is madness. Deliberate downtime to keep dependencies on their toes ... toxic. I'll make exceptions for chaos engineering *in the noise* where resilience processes can paper over a failed request or something, but not for turning off your service.
-
-
Show this thread
-
9 times out of 10 I hear about error budgets it's a team who is focusing on their own convenience, because they haven't invested in convenient resilience, and isn't putting customers first. SLAs can be crude at this ... is a 4 hour outage ok on your busiest day? Still 99.95% YoY!
Show this thread
End of conversation
New conversation -
-
-
Does anyone really use an “error-budget” though - outside Google? I don’t know of anyone, tbh.
-
SRE culture has tentacles, and seems to bring error budgets, seemingly endless Kubernetes tinkering, and stack over-complication to some places. Plenty of good too, mostly good on balance, but some of it gets to me.
- 12 more replies
New conversation -
-
-
What do you mean by SLAs? For me, they're the public-facing targets - like EC2 not breaking to AZs in a region at the same time. Such SLAs are really weak, and we set our SLOs a lot tighter. If you're talking about something else, what is the consequence of missing?
-
At AWS pretty much everything has an internal SLA that is measured as the percentage of transactions. It varies from 99.95% to 99.999% depending on the service. We often measure it per-customer too. It's very similar to SLOs.
- 1 more reply
New conversation -
-
-
I can see how they came to be given attempts to make IT more of a weapon than a cost center. In less bureaucratic places, error budgets may be used more so as expectation setting, but I’m not a fan. 1 incident can ruin a budget to where it’s like: ‘discounting that one time...’
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
Interesting. What about the logic that those using your service will expect the availability you provide, not what you say you'll provide, and making a change that reduces availability, even if still well within SLO, appears a failure and causes them issues?
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
I'd generally agree with
@mipsytipsy that nines don't matter if users are unhappy (or if they're happy for that matter). I think the key is choosing a metric that's a proxy for user happiness. If that can translate neatly into an error budget, yay. If not figure out what worksThanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
There is a way without having religious wars. Both are correct and both can and are abused. Google has a valid and great point that there will be errors and a certain amount of errors won’t hurt the business, but trading feature velocity for keeping sla often does.
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.