Control Planes are all about taking intent and translate it into real world action in the universe. Just like your TV remote. You tell it what you want, and it's job is to make that happen. But it's harder than it seems!pic.twitter.com/crVeBSyvuL
You can add location information to your Tweets, such as your city or precise location, from the web and via third-party applications. You always have the option to delete your Tweet location history. Learn more
Every stable control system needs 3 things: a measurement process, a controller, and an actuator. Basically something to see how the world is, something to figure out how the world needs to change, and something that makes that change happen.pic.twitter.com/cuZP7dJLdr
That simple mental model is very very important. Most control systems built by CS people *don't* have a measurement element. Like the remote control we've already seen! These systems propagate errors they can't correct. BAD BAD.
So always start with the idea of a measurer; poll every server to know what state it is in, check if the user settings get there, etc ... and build the system as something that corrects any errors it sees, not just a system that just blindly shouts instructions.
O.k. that's 80% of control theory right there for you. The next 10% is that controllers are very sensitive to lag. Imagine a furnace that heated your boiler based on the temperature it was an hour ago? It'd be very unstable!pic.twitter.com/4GJ21YM7uu
Imagine scaling up based on the systems load from 2 hours ago? You might not even need those machines any more, peak may have passed! So systems need to be fast. Low lag is critical. O.k. now we know 90% of control theory,
If you want to get the next 5%, 9% ... 10% , and please do, then focus on learning what "PID" means. I'm just going to say this to tempt you: if you can learn to recognise the P.I.D. components of real-world control systems, it is a design review super-power.pic.twitter.com/dDOzyIx8bV
Like in seconds you can spot that a system can't possibly be stable. Buy this book ... https://www.amazon.com/Designing-Distributed-Control-Systems-Language/dp/1118694155/ … it's very approachable and takes a pattern based approach.pic.twitter.com/hm0xj0XELo
Since it is so accessible, I'm going to borrow the pattern approach and give 10 patterns we use at Amazon. I've chosen patterns that I hope will be interesting, new, and short enough to synopsise. We have way more!pic.twitter.com/3p15C3yohx
O.k. pattern 1: CHECKSUM ALL THE THINGS. Because this: https://status.aws.amazon.com/s3-20080720.html … Never underestimate the ability of bit-rot to set in. S3 had an event in 2008 due to a single corrupt bit!!pic.twitter.com/7pSSWYfeb8
To this day, we still ask teams if they are checksumming everything. Another example of how corruption can slip in is ... YAML. Because YAML is truncatable, configs can fail back to implicit defaults due to partial transfers, full disks, etc. *sigh* CHECKSUM ALL THE THINGS.pic.twitter.com/h0HB1ZZLF3
Pattern 2: control planes need strong cryptographic authentication! They are important security systems, make sure that they are protected from malicious data. It's ALSO useful to make sure that test stacks don't talk to prod and that operators aren't manually poking things.pic.twitter.com/NGy45zzaxs
Pattern 3: reduce blast radius. Do your best, write great code, do great code reviews, test everything, twice, more. But still have some humility and assume things will fail. So reduce the scope of impact, have circuit breakers and so on.pic.twitter.com/6vWzwqv5tC
Watch @PeterVosshall's talk to go much deeper on this:https://www.youtube.com/watch?v=swQbA4zub20 …
Pattern 4: Asynchronous Coupling! If system A calls system B synchronously, which means that B has to succeed to A to make any progress, then they are basically one system. There is no real insulation or meaningful separation.pic.twitter.com/LL8T36UwSx
Worse still: if A calls B which calls C and so on, and they have retries built-in, things can get really bad really quickly when there are problems! Just 3 layers deep with 3 retries per layer, and you have 27x application factor if the deepest service fails. Oh wow is that bad.
Asynchronous systems are more forgiving: queues and workflows and step functions and so on are all examples. They tend to try consistently and they can make partial progress when dependencies fail. Of course don't let queue grow infinitely either, have some limits.
All of AWS's multi-region offerings, like S3 cross-region replication, or DynamoDB global tables, are asynchronously coupled. That means that if there is a problem in one region, that the other regions don't just stall waiting for it. Very powerful and important!
Pattern 5: use closed feedback loops! Always Be Checking. Never fire and forget. So important that I repeat this a lot. Repeating good advice over and over is actually a good habit.pic.twitter.com/sFP6G2alYm
Pattern 6: should we push data or pull data from the control plane to the data plane? WRONG QUESTION! I mean we can get into eventing systems and edge triggering, but let's not. What really matters 99% of the time is the relative size of fleets ...pic.twitter.com/SEZ6of3NPn
The way to think about it is this: don't have large fleets connect to small fleets. They will overwhelm the small fleet with a thundering herd during cold starts or stress events! Optimize connection direction for that.
Related is pattern 7: Avoid cold start caching problems! If you end up with a caching layer in your system, be very very careful. Can it cope if the origin goes down for an extended duration? When the TTLs expire, will the system stall?pic.twitter.com/x3SjogK1vJ
Try to build caches that will serve stale entries, and caches that self-warm or prime their cache before accepting requests, pre-fetching is nice too. Wherever you see caches, see danger, and go super deep on whether they will safely recover from blips.
If you have to throttle things to safely recover and shorten the duration of events, do! have a throttling system at hand. But don't kid yourself either: throttling a customer is also an outage. Think instead how throttling can be used to prioritise smartly ...pic.twitter.com/nDPlEoSmYB
Example: ELB is a fault-tolerant AZ-redundant system. We can lose an AZ at any time and ELB is scaled for capacity, it'll be fine. We can deliberately throttle ELBs recovery in a zone after a power event to give our paying customers priority. Works great! Good use of throttling.
Pattern 9: I couldn't say it at the time, but basically use a system like QLDB (https://aws.amazon.com/qldb/ ) for your control plane data flow if you can! If you have an immutable append only ledger for your data flow then ...pic.twitter.com/N1C9XQuO9D
... you can compute and merge deltas easily minimising data volume, and you get item history, so you can implement point-in-time-recovery and rollback! You can also optimise-out no-op changes. We use this pattern in Route 53, EC2, bunch of places.
O.k. I left the most important thoughts and pattern for last. You have filter every element of your design through the lens of "How many modes of operation do I have". For stability, that needs to be minimal.pic.twitter.com/kvXZfbTlzl
Avoid emergency modes that are different, or anything that can alter what the system is doing suddenly. Think about your system in terms of state space, or code branches. How many can you get rid of?
Branches and state spaces are evil, because they grow exponentially, past the point you can test or predict behaviour, they become emergent instead. A simple example here is relational databases.pic.twitter.com/jiHIm1zGNk
I'm not knocking offerings like RDS or Aurora, relational DBs are great for versatile business queries, but they are terrible for control planes. We essentially ban them for that purpose at AWS. Why?
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.