Since it is so accessible, I'm going to borrow the pattern approach and give 10 patterns we use at Amazon. I've chosen patterns that I hope will be interesting, new, and short enough to synopsise. We have way more!pic.twitter.com/3p15C3yohx
You can add location information to your Tweets, such as your city or precise location, from the web and via third-party applications. You always have the option to delete your Tweet location history. Learn more
Pattern 6: should we push data or pull data from the control plane to the data plane? WRONG QUESTION! I mean we can get into eventing systems and edge triggering, but let's not. What really matters 99% of the time is the relative size of fleets ...pic.twitter.com/SEZ6of3NPn
The way to think about it is this: don't have large fleets connect to small fleets. They will overwhelm the small fleet with a thundering herd during cold starts or stress events! Optimize connection direction for that.
Related is pattern 7: Avoid cold start caching problems! If you end up with a caching layer in your system, be very very careful. Can it cope if the origin goes down for an extended duration? When the TTLs expire, will the system stall?pic.twitter.com/x3SjogK1vJ
Try to build caches that will serve stale entries, and caches that self-warm or prime their cache before accepting requests, pre-fetching is nice too. Wherever you see caches, see danger, and go super deep on whether they will safely recover from blips.
If you have to throttle things to safely recover and shorten the duration of events, do! have a throttling system at hand. But don't kid yourself either: throttling a customer is also an outage. Think instead how throttling can be used to prioritise smartly ...pic.twitter.com/nDPlEoSmYB
Example: ELB is a fault-tolerant AZ-redundant system. We can lose an AZ at any time and ELB is scaled for capacity, it'll be fine. We can deliberately throttle ELBs recovery in a zone after a power event to give our paying customers priority. Works great! Good use of throttling.
Pattern 9: I couldn't say it at the time, but basically use a system like QLDB (https://aws.amazon.com/qldb/ ) for your control plane data flow if you can! If you have an immutable append only ledger for your data flow then ...pic.twitter.com/N1C9XQuO9D
... you can compute and merge deltas easily minimising data volume, and you get item history, so you can implement point-in-time-recovery and rollback! You can also optimise-out no-op changes. We use this pattern in Route 53, EC2, bunch of places.
O.k. I left the most important thoughts and pattern for last. You have filter every element of your design through the lens of "How many modes of operation do I have". For stability, that needs to be minimal.pic.twitter.com/kvXZfbTlzl
Avoid emergency modes that are different, or anything that can alter what the system is doing suddenly. Think about your system in terms of state space, or code branches. How many can you get rid of?
Branches and state spaces are evil, because they grow exponentially, past the point you can test or predict behaviour, they become emergent instead. A simple example here is relational databases.pic.twitter.com/jiHIm1zGNk
I'm not knocking offerings like RDS or Aurora, relational DBs are great for versatile business queries, but they are terrible for control planes. We essentially ban them for that purpose at AWS. Why?
RDBMSs have built-in fancy Query Plan Optimizers that can suddenly change what indices are being used, or how tables are being scanned. That can have a disastrous effect on performance or behaviour. Another is that they are very accessible and tempting ...
... an operator, product manager, business analyst might all think it's safe to run a one-time read-only query, but a simple SQL typo can choke up the system! Bad bad. So what's the fix?
Use NoSQL and do things the "dumb" way every time. Because the perf characteristics are much more obvious to the programmer and designer, now you can just do a full join, or a full table scan every time for every query. Much more stable!
I've tweet stormed about this before, but now we're getting into the "constant work" pattern. The most stable control systems do the same work all of the time, with no change that is dependent on the data, or even the volume of change.pic.twitter.com/Gp0eD5emZi
Suppose you need to get some config to your data plane. What if the data plane just fetched the config from S3 every 10 seconds, whether it changed or not? And reloaded the configuration, every time, whether it changed or not?
This simple, simple, design is rarely seen in the wild, but I don't know why. It's very very reliable ... incredibly resilient and will recover from all sorts of issues. It's not even expensive! We're talking hundreds of dollars per year. Not even a few days of SDE time.pic.twitter.com/6ZBaxiamwP
That's the pattern we use for our most critical systems. The network health check statuses that allow AWS to instantly handle an Availability Zone power issue? Those are always flowing, all the time, 0 or 1, whether they change or not.
We have these and so many more patterns, and ... we're been building them into API Gateway and Lambda behind the scenes too! So consider building your control planes on those!pic.twitter.com/DgzdZAyNNC
Thank your for listening to my talk! Always always feel free to AMA. This is the last tweet in the thread for now, and I won't even promote my Soundcloud!
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.