"Error budgets" are the god-damn worst idea I've heard of in recent years. SLAs should be realistic goals about what we can achieve with our current techniques and tools, not permission to fail a certain amount.
-
Show this thread
-
Replying to @colmmacc
Does anyone really use an “error-budget” though - outside Google? I don’t know of anyone, tbh.
6 replies 0 retweets 3 likes -
Replying to @copyconstruct
SRE culture has tentacles, and seems to bring error budgets, seemingly endless Kubernetes tinkering, and stack over-complication to some places. Plenty of good too, mostly good on balance, but some of it gets to me.
1 reply 1 retweet 16 likes -
-
Replying to @copyconstruct @colmmacc
the "platonic ideal" of error budgets seem to me like a useful goal to align all stakeholders (ugh) in the lifecycle (design, dev, operation) )of a "service". It seems to me like the cost of autonomy is an increased risk of externalizing failures (1/N)
1 reply 0 retweets 3 likes -
E.g., "Whups, sorry a bug in a new feature in our microservice took down the whole platform". Error budgets simply provide a strong contract on how to gate feature velocity when these things happen. Probably more rarely, they can help enforce architectural desires (2/N)
2 replies 0 retweets 1 like -
Replying to @jhscott @copyconstruct
SLAs already do this. If you're in danger of missing an SLA ... more work on stability and resilience is evidently needed. Typically that means fewer features .. but naturally. I think having to gate features is unhealthy and a sign of organizational disfunction.
1 reply 0 retweets 2 likes -
Replying to @colmmacc @copyconstruct
I think that "strong contracts empower distributed ownership". Thus, an explicit error budget -- with e.g. common visibility of current value -- empowers teams to make the right decisions and have clear expectations what happens when you miss your SLA.
2 replies 0 retweets 2 likes -
Replying to @jhscott @copyconstruct
Measuring SLAs and how you're doing, down to each customer's experience, and having transparency about that ... that's all good. But that's just using an SLA IMO. Error budgets are an extra concept layered on top, and my point is I don't think it's a good one.
4 replies 0 retweets 1 like -
Replying to @colmmacc @copyconstruct
to be clear, I'm not saying you're wrong :-). Just trying to use this as an opportunity to refine my understanding -- which is that error budgets are a tool to describe what happens when you *miss* your SLA.
1 reply 0 retweets 0 likes
I see them used the other way around ... "our error budget is 4 hours of downtime a year, so far we've only had 1 hour of downtime, so let's green light a risky action that might incur a two-hour problem".
-
-
Replying to @colmmacc @copyconstruct
I think you can token bucket limit your error budget if the impact of an outage is superlinear?
1 reply 0 retweets 0 likes -
but I agree I probably tweeted too quickly -- if you are within SLA, (Google SRE style) error budgets do allow you to take additional risk.
1 reply 0 retweets 0 likes - 2 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.