the "platonic ideal" of error budgets seem to me like a useful goal to align all stakeholders (ugh) in the lifecycle (design, dev, operation) )of a "service". It seems to me like the cost of autonomy is an increased risk of externalizing failures (1/N)
-
-
E.g., "Whups, sorry a bug in a new feature in our microservice took down the whole platform". Error budgets simply provide a strong contract on how to gate feature velocity when these things happen. Probably more rarely, they can help enforce architectural desires (2/N)
2 replies 0 retweets 1 like -
Replying to @jhscott @copyconstruct
SLAs already do this. If you're in danger of missing an SLA ... more work on stability and resilience is evidently needed. Typically that means fewer features .. but naturally. I think having to gate features is unhealthy and a sign of organizational disfunction.
1 reply 0 retweets 2 likes -
Replying to @colmmacc @copyconstruct
I think that "strong contracts empower distributed ownership". Thus, an explicit error budget -- with e.g. common visibility of current value -- empowers teams to make the right decisions and have clear expectations what happens when you miss your SLA.
2 replies 0 retweets 2 likes -
Replying to @jhscott @copyconstruct
Measuring SLAs and how you're doing, down to each customer's experience, and having transparency about that ... that's all good. But that's just using an SLA IMO. Error budgets are an extra concept layered on top, and my point is I don't think it's a good one.
4 replies 0 retweets 1 like -
Even the name implies that there's a budget for errors. It's a bad framing. Leads to behaviors like being less cautious when the budget is less spent. Or worst ... that you can plan to spend it pro-actively to help cut some corners.
1 reply 0 retweets 1 like -
Replying to @colmmacc @copyconstruct
ah but I think that's one of the best parts! Like there is always a balance between velocity and reliability, right? So setting an *appropriate* SLA and error budget makes it clear to e.g. product managers that you are moving slowly b/c you need five nines.
1 reply 0 retweets 0 likes -
vs moving faster if you can get away with three nines. If your business use case can't take the corner cutting, tighten your SLA!
1 reply 0 retweets 0 likes -
Replying to @jhscott @copyconstruct
*shudder*. That is the exact unhealthy organizational dysfunction I'm talking about! First - we don't need to trade-off velocity and reliability. Investments in test and deployment automation, and in compartmentalization, usually deliver both. Second ...
1 reply 0 retweets 4 likes -
if you're in a PHB product managers are in charge but not listening situation ... fix that :) Best model is comprehensive team ownership of business, development, and operations, with mutual understanding.
2 replies 0 retweets 2 likes
and it's not that the business "can't take" it, it's that generally every request, every sale ... is good. We should want as many as possible, and we *are* the business. The common world-view that we, as engineers, just build and operate "for" the business is harmful IMO.
-
-
Replying to @colmmacc @copyconstruct
I would hypothesize (perhaps a reach) that your concerns about improper corner-cutting can be addressed by having the error budget expire (so you can only bank X hours vs Y months) and setting SLA well. It shouldn't matter to stakeholders what you do if you meet your SLA, right?
0 replies 0 retweets 1 likeThanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
All decisions are business decisions (if you're being paid to make them).
0 replies 0 retweets 0 likesThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.