Well, no, sometimes servers are down and errors happen, so you need a workflow engine to drive retries. Oh and that implies you have some way to tell if the change even made it there, so you need a poller or a pusher or something to monitor config propagation status. YOU GET ME.
-
Show this thread
-
But then we build all that, and it works, and changes get integrated and customer configs change in seconds, and WE'RE GOOD, RIGHT?
1 reply 0 retweets 5 likesShow this thread -
We're good until one day we're not. Imagine a big event happens, maybe a power outage, or a spike on the internet due to the Super Bowl or something, and for whatever reason a bunch of customers all try to make changes at the same time?
1 reply 0 retweets 7 likesShow this thread -
The API gets choked, the workflow gets backed up, the status monitors start to lag. OUCH. Even worse, some customers start undoing their changes because they don't see them happening quickly. Now we have pointless changes stuck in the system that no-one even wants!! OUCH OUCH.
1 reply 0 retweets 7 likesShow this thread -
Now, let's try a much better, O(1) control plane. DUMB AS ROCKS. Imagine if instead that the customer API pretty much edits a document directly, like a file on S3 or whatever. And imagine the servers just pull that file every few seconds, WHETHER IT CHANGED OR NOT.
2 replies 1 retweet 14 likesShow this thread -
This system is O(1). The whole config can change, or none of it, and the servers don't even care. They are dumb. They just carry on. This is MUCH MUCH more robust.
1 reply 0 retweets 12 likesShow this thread -
It never lags, it self-heals very quickly, and it's always ready for a set of events like everyone changing everything at once. The SYSTEM DOES NOT CARE.
1 reply 1 retweet 20 likesShow this thread -
This is how we build the most critical control planes at AWS. For example if you use Route 53 health checks, or rely on NLB, or NAT GW, the underlying health statuses are *ALWAYS* being pushed around as a bitset. ALWAYS ALWAYS.
3 replies 6 retweets 32 likesShow this thread -
Replying to @colmmacc
Fantastic thread. You say "pushed around as a bitset" here but refer above to servers pulling the file that the O(1) control plane updates. In this case are the servers pulling the health statuses as bitsets or are they being pushed?
1 reply 0 retweets 0 likes -
Replying to @moderat10n @hartley
For health statuses, every status is assigned a position in a bitset. High level: Healthcheckers push to aggregators, aggregators apply a stabilizing function to the raw statuses, and then the aggregators push to the servers via a replication tree.
1 reply 0 retweets 1 like
For push vs pull: if information has to get a small number of boxes (like the aggregators) to a large number (like the servers) we prefer to push. Pushing from small fleets to bigger ones avoids thundering herd problems where the big fleet hammers the small one.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.