Last week I spoke about how we build ultra-reliable AWS services. It's my favourite talk that I've given. Everyone I've asked has told me that they learned something new and interesting
Here I'm going to tweet some highlights to tempt you to watch ...https://www.youtube.com/watch?v=O8xLxNje30M …
-
-
Control Planes are all about taking intent and translate it into real world action in the universe. Just like your TV remote. You tell it what you want, and it's job is to make that happen. But it's harder than it seems!pic.twitter.com/crVeBSyvuL
Show this thread -
Have you ever used a universal remote control, and had it turn your TV on, but not your audio system? Super common problem! There's two things going on: one is that there's a network partition ... your remote can't reach everything, ok fine, so you move it around and press again.pic.twitter.com/ONpX4YVAcG
Show this thread -
But more deeply, the real problem is that the control has no idea whether it achieved success or not. It has no feedback mechanism! This is the most common design problem for control planes. A system like that can never be stable!
Show this thread -
I see this all the time in customer designs. For example: users change settings, but sometimes they don't take, because the update doesn't make it to all the servers. Often they end up with support processes to push everything again, on demand, or tell users to try reseting.
Show this thread -
Systems like this aren't just annoying, they fail catastrophically under stress! So we have approaches and patterns that help us build more high quality systems. I'm going to share ten of these patterns, and they are awesome, but also they aren't magic.
Show this thread -
The most important thing I know about building anything high quality is that there is no magic quality sprinkle dust. Instead it comes about as a result of good habits. Paying lots of attention to detail, always testing and checking things. That's where it really comes from.pic.twitter.com/qapYoTHPeD
Show this thread -
Before building great systems, you need to build a great team. Strive for fearlessness, That means a no blame, no shame, awesome positive atmosphere ... and then translate that into action and calculated risk taking ... the safety to try new things and new ideas.pic.twitter.com/dgA6g6tXfy
Show this thread -
But DON'T risk taking with security, durability, or availability. Those are core values, and top priorities that need to be inviolable. Take risks with business ideas and features, and product names, and have some fun!pic.twitter.com/yWfZSaMyh0
Show this thread -
With that context, let's build some stable and reliable control systems! What do we use them for? 4 common reasons: 1/ lifecycling resources (launching, scaling, etc) 2/ deploying system config 3/ deploying software 4/ deploying user settings.pic.twitter.com/00uaBoD3mr
Show this thread -
At Amazon, we encourage merging 2 and 3. Deploying systems config, like global feature flags, *IS* deploying software. So where possible, we use the same system for both. We have awesome awesome deployment safety systems. One-boxing, staggering, rollback, etc. So use it for both!pic.twitter.com/kyv7AmwPUP
Show this thread -
For building control systems, it turns out there's a whole branch of rigorous engineering called control theory. There's a lot of math, and it is awesome, well worth knowing, but also you don't need all of that to get most of the benefit. Here is what is worth knowing ...pic.twitter.com/LtjZtIMs1J
Show this thread -
Every stable control system needs 3 things: a measurement process, a controller, and an actuator. Basically something to see how the world is, something to figure out how the world needs to change, and something that makes that change happen.pic.twitter.com/cuZP7dJLdr
Show this thread -
That simple mental model is very very important. Most control systems built by CS people *don't* have a measurement element. Like the remote control we've already seen! These systems propagate errors they can't correct. BAD BAD.
Show this thread -
So always start with the idea of a measurer; poll every server to know what state it is in, check if the user settings get there, etc ... and build the system as something that corrects any errors it sees, not just a system that just blindly shouts instructions.
Show this thread -
O.k. that's 80% of control theory right there for you. The next 10% is that controllers are very sensitive to lag. Imagine a furnace that heated your boiler based on the temperature it was an hour ago? It'd be very unstable!pic.twitter.com/4GJ21YM7uu
Show this thread -
Imagine scaling up based on the systems load from 2 hours ago? You might not even need those machines any more, peak may have passed! So systems need to be fast. Low lag is critical. O.k. now we know 90% of control theory,
Show this thread -
If you want to get the next 5%, 9% ... 10% , and please do, then focus on learning what "PID" means. I'm just going to say this to tempt you: if you can learn to recognise the P.I.D. components of real-world control systems, it is a design review super-power.pic.twitter.com/dDOzyIx8bV
Show this thread -
Like in seconds you can spot that a system can't possibly be stable. Buy this book ... https://www.amazon.com/Designing-Distributed-Control-Systems-Language/dp/1118694155/ … it's very approachable and takes a pattern based approach.pic.twitter.com/hm0xj0XELo
Show this thread -
Since it is so accessible, I'm going to borrow the pattern approach and give 10 patterns we use at Amazon. I've chosen patterns that I hope will be interesting, new, and short enough to synopsise. We have way more!pic.twitter.com/3p15C3yohx
Show this thread -
O.k. pattern 1: CHECKSUM ALL THE THINGS. Because this: https://status.aws.amazon.com/s3-20080720.html … Never underestimate the ability of bit-rot to set in. S3 had an event in 2008 due to a single corrupt bit!!pic.twitter.com/7pSSWYfeb8
Show this thread -
To this day, we still ask teams if they are checksumming everything. Another example of how corruption can slip in is ... YAML. Because YAML is truncatable, configs can fail back to implicit defaults due to partial transfers, full disks, etc. *sigh* CHECKSUM ALL THE THINGS.pic.twitter.com/h0HB1ZZLF3
Show this thread -
Pattern 2: control planes need strong cryptographic authentication! They are important security systems, make sure that they are protected from malicious data. It's ALSO useful to make sure that test stacks don't talk to prod and that operators aren't manually poking things.pic.twitter.com/NGy45zzaxs
Show this thread -
Pattern 3: reduce blast radius. Do your best, write great code, do great code reviews, test everything, twice, more. But still have some humility and assume things will fail. So reduce the scope of impact, have circuit breakers and so on.pic.twitter.com/6vWzwqv5tC
Show this thread -
Watch
@PeterVosshall's talk to go much deeper on this:https://www.youtube.com/watch?v=swQbA4zub20 …Show this thread -
Pattern 4: Asynchronous Coupling! If system A calls system B synchronously, which means that B has to succeed to A to make any progress, then they are basically one system. There is no real insulation or meaningful separation.pic.twitter.com/LL8T36UwSx
Show this thread -
Worse still: if A calls B which calls C and so on, and they have retries built-in, things can get really bad really quickly when there are problems! Just 3 layers deep with 3 retries per layer, and you have 27x application factor if the deepest service fails. Oh wow is that bad.
Show this thread -
Asynchronous systems are more forgiving: queues and workflows and step functions and so on are all examples. They tend to try consistently and they can make partial progress when dependencies fail. Of course don't let queue grow infinitely either, have some limits.
Show this thread -
All of AWS's multi-region offerings, like S3 cross-region replication, or DynamoDB global tables, are asynchronously coupled. That means that if there is a problem in one region, that the other regions don't just stall waiting for it. Very powerful and important!
Show this thread -
Pattern 5: use closed feedback loops! Always Be Checking. Never fire and forget. So important that I repeat this a lot. Repeating good advice over and over is actually a good habit.pic.twitter.com/sFP6G2alYm
Show this thread -
Pattern 6: should we push data or pull data from the control plane to the data plane? WRONG QUESTION! I mean we can get into eventing systems and edge triggering, but let's not. What really matters 99% of the time is the relative size of fleets ...pic.twitter.com/SEZ6of3NPn
Show this thread - 20 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.