"Error budgets" are the god-damn worst idea I've heard of in recent years. SLAs should be realistic goals about what we can achieve with our current techniques and tools, not permission to fail a certain amount.
-
-
Even the name implies that there's a budget for errors. It's a bad framing. Leads to behaviors like being less cautious when the budget is less spent. Or worst ... that you can plan to spend it pro-actively to help cut some corners.
-
ah but I think that's one of the best parts! Like there is always a balance between velocity and reliability, right? So setting an *appropriate* SLA and error budget makes it clear to e.g. product managers that you are moving slowly b/c you need five nines.
- 9 more replies
New conversation -
-
-
You've already internalized your permission to periodically fail implicit in an SLA. The framing helps people get there. It comes from services that were providing an higher implicit SLA (by never failing), which lead to people tightly coupling things that shouldn't have been.
-
It's a framing that empowers the engineers to make changes when they need to and for their customers to know they have to engineer for service failures. In my experience it contributes to reliable services, at Google and elsewhere. YMMV
- 3 more replies
New conversation -
-
-
to be clear, I'm not saying you're wrong :-). Just trying to use this as an opportunity to refine my understanding -- which is that error budgets are a tool to describe what happens when you *miss* your SLA.
-
I see them used the other way around ... "our error budget is 4 hours of downtime a year, so far we've only had 1 hour of downtime, so let's green light a risky action that might incur a two-hour problem".
- 4 more replies
New conversation -
-
-
FWIW: A common confusion that I see is that in “SRE land” SLAs are SLOs with an explicitly defined set of consequences (think contracts and the like). So yes, SLAs do what you are saying, but also more. After reading this thread I realized you use SLA like I use SLO.
-
SLOs need to be stricter than SLAs because of those consequences. Error budgets are derived from SLOs. SLOs & Error Budgets are a way to gauge how a service is doing. Your Error Budget Policy determines what you end up doing though if the service is blowing through that budget
- 1 more reply
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.