Anyway, so Crypto people have been studying this for a long time. Our Automated Reasoning Group built a tool to analyze code and tell you if it's O(1) ... or how close it is ...https://github.com/awslabs/s2n/tree/master/tests/sidetrail …
-
Show this thread
-
But it turns out that O(1) patterns matter for distributed systems and availability EVEN MORE! Completely unrelated fields really, it's surprising, so let's dig in ...
1 reply 0 retweets 11 likesShow this thread -
O.k. so imagine a very simple control plane. Let's say you have a bunch of servers and they do things for customers and their users. Customer gets some knobs and dials, a configuration they can edit. This is lots and lots of web services.
1 reply 0 retweets 5 likesShow this thread -
So customer makes a change. What do we do? Well it goes to an API, and that API produces a diff or a delta or whatever you want to call it and sends it to the servers. The servers then apply the patch, and the new config is in place. SIMPLE, RIGHT?
1 reply 0 retweets 5 likesShow this thread -
Well, no, sometimes servers are down and errors happen, so you need a workflow engine to drive retries. Oh and that implies you have some way to tell if the change even made it there, so you need a poller or a pusher or something to monitor config propagation status. YOU GET ME.
2 replies 0 retweets 7 likesShow this thread -
But then we build all that, and it works, and changes get integrated and customer configs change in seconds, and WE'RE GOOD, RIGHT?
1 reply 0 retweets 5 likesShow this thread -
We're good until one day we're not. Imagine a big event happens, maybe a power outage, or a spike on the internet due to the Super Bowl or something, and for whatever reason a bunch of customers all try to make changes at the same time?
1 reply 0 retweets 7 likesShow this thread -
The API gets choked, the workflow gets backed up, the status monitors start to lag. OUCH. Even worse, some customers start undoing their changes because they don't see them happening quickly. Now we have pointless changes stuck in the system that no-one even wants!! OUCH OUCH.
1 reply 0 retweets 7 likesShow this thread -
Now, let's try a much better, O(1) control plane. DUMB AS ROCKS. Imagine if instead that the customer API pretty much edits a document directly, like a file on S3 or whatever. And imagine the servers just pull that file every few seconds, WHETHER IT CHANGED OR NOT.
2 replies 1 retweet 14 likesShow this thread -
Replying to @colmmacc
This sounds like the “edge-driven” vs “level-driven” discussion that e.g. k8s has as one of its core design decisions, or am I getting that wrong?
1 reply 0 retweets 1 like
That's related too! level-triggered control planes are usually O(1) or closer to it. Edge triggering can be very easily stressed. You have to decide if you want better average performance (edge) or reliable peak performance (level).
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.