Let's say you're running some kind of service that clients call, and it has an "availability event" ... which is, you know, PR speak for "outage". Anyway, so you have this event and you fix the problem, but that doesn't always mean that you recover right away.
-
-
Show this thread
-
Often your clients will have started to retry their requests. Clients build up over over time, more and more of them. You can easily get into a situation where you now have way more requests than you ever ordinarily would. And this just as you are trying to recover. Uggh.
Show this thread -
If you have some kind of throttling, you can apply that, but the best strategy is actually to have clients backoff.
@clare_liguori has gone over this very well so read what she has to say ...https://twitter.com/clare_liguori/status/1034829330201731072 …Show this thread -
The main takeaways are to do exponential backoff and add some random jitter. Do that and the recovery will go more quickly. O.k. so what am I tweeting about?
Show this thread -
There's a challenge with this kind of backoff; it works well only if everyone, or at least "nearly everyone", does it. A single client could decide to be more greedy and retry more aggressively - and it's sort of in their interest.
Show this thread -
If everyone else is well behaved, that greedy client will probably get service more quickly, they'll jump the queue basically. But the overall recovery will be slowed down a bit by the aggressive tactics. Then, if this behavior spreads, it's disaster.
Show this thread -
Everyone retrying aggressively makes the whole situation worse, for everyone. It's a "tragedy of the commons" - everyone acting in their own narrow self-interest actually makes everyone worse off in the end.
Show this thread -
I often think of airport baggage belts, where people compete and huddles to get closer to the belt; we'd all be better off if we *all* just stood back a few meters. Clearer views, easier access, easier exit.
Show this thread -
Many distributed systems really rely on this kind of community oriented cooperation. TCP, BGP, the Web, e-commerce, Cloud Services. All have examples where a kind of codified socialism, fairness and restricting individual competition, keep the systems safe and cost-effective.
Show this thread -
It needn't be like that; industries don't have to co-operate! Stock markets are the opposite, where there's a costly (maybe wasteful) race to be the fastest and try very aggressively, producing HFT.
Show this thread -
Blockchains are another example, where there is deliberate aggressive competition, and it's even exploited to drive a perverse consensus based on mutual mistrust.
Show this thread -
Anyway, back to the tragedy of commons! Afaik, no-one has really found a robust solution to this beyond blocking mis-behaving clients before they spread, which also comes at expense.
Show this thread -
IME, how it "really" works is that the folks who builds clients and services have a sort of code of honor system. Everyone recognizes that it would be a wasteful race to the bottom, and mostly self-policing works. So browsers, SDKs, etc ... all do sane, safe, decent things.
Show this thread -
I don't really have a big technical lesson there, I just find it fascinating that huge sections of the economy can get by like that. It's very inspiring and reassuring! /EOF
Show this thread
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.