Does anyone really use an “error-budget” though - outside Google? I don’t know of anyone, tbh.
-
-
Replying to @copyconstruct
SRE culture has tentacles, and seems to bring error budgets, seemingly endless Kubernetes tinkering, and stack over-complication to some places. Plenty of good too, mostly good on balance, but some of it gets to me.
1 reply 1 retweet 16 likes -
-
Replying to @copyconstruct @colmmacc
the "platonic ideal" of error budgets seem to me like a useful goal to align all stakeholders (ugh) in the lifecycle (design, dev, operation) )of a "service". It seems to me like the cost of autonomy is an increased risk of externalizing failures (1/N)
1 reply 0 retweets 3 likes -
E.g., "Whups, sorry a bug in a new feature in our microservice took down the whole platform". Error budgets simply provide a strong contract on how to gate feature velocity when these things happen. Probably more rarely, they can help enforce architectural desires (2/N)
2 replies 0 retweets 1 like -
Replying to @jhscott @copyconstruct
SLAs already do this. If you're in danger of missing an SLA ... more work on stability and resilience is evidently needed. Typically that means fewer features .. but naturally. I think having to gate features is unhealthy and a sign of organizational disfunction.
1 reply 0 retweets 2 likes -
Replying to @colmmacc @copyconstruct
I think that "strong contracts empower distributed ownership". Thus, an explicit error budget -- with e.g. common visibility of current value -- empowers teams to make the right decisions and have clear expectations what happens when you miss your SLA.
2 replies 0 retweets 2 likes -
Replying to @jhscott @copyconstruct
Measuring SLAs and how you're doing, down to each customer's experience, and having transparency about that ... that's all good. But that's just using an SLA IMO. Error budgets are an extra concept layered on top, and my point is I don't think it's a good one.
4 replies 0 retweets 1 like -
You've already internalized your permission to periodically fail implicit in an SLA. The framing helps people get there. It comes from services that were providing an higher implicit SLA (by never failing), which lead to people tightly coupling things that shouldn't have been.
1 reply 0 retweets 2 likes -
Replying to @NYCDubliner @colmmacc and
It's a framing that empowers the engineers to make changes when they need to and for their customers to know they have to engineer for service failures. In my experience it contributes to reliable services, at Google and elsewhere. YMMV
2 replies 0 retweets 2 likes
Non-100% SLAs already tell people they need to engineer for service failures. My worldview, from meaningful business ownership (talking to customers, setting the roadmap, etc ..) is that every X service calls puts a dollar in my pocket, and thousands of dollars in my customer's.
-
-
Replying to @colmmacc @NYCDubliner and
So when I see folks proposing deliberate downtime, to match an SLA, I hear "How about we just switch off the money tap a bit" and I look at them with wide eyes. Tells me more value is being placed on elegance and convenience than on end-to-end business aspects.
2 replies 0 retweets 1 like -
I think we fully understand each other and disagree. Have a great day.
0 replies 0 retweets 2 likes
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.