Asynchronous systems are more forgiving: queues and workflows and step functions and so on are all examples. They tend to try consistently and they can make partial progress when dependencies fail. Of course don't let queue grow infinitely either, have some limits.
-
-
O.k. I left the most important thoughts and pattern for last. You have filter every element of your design through the lens of "How many modes of operation do I have". For stability, that needs to be minimal.pic.twitter.com/kvXZfbTlzl
Show this thread -
Avoid emergency modes that are different, or anything that can alter what the system is doing suddenly. Think about your system in terms of state space, or code branches. How many can you get rid of?
Show this thread -
Branches and state spaces are evil, because they grow exponentially, past the point you can test or predict behaviour, they become emergent instead. A simple example here is relational databases.pic.twitter.com/jiHIm1zGNk
Show this thread -
I'm not knocking offerings like RDS or Aurora, relational DBs are great for versatile business queries, but they are terrible for control planes. We essentially ban them for that purpose at AWS. Why?
Show this thread -
RDBMSs have built-in fancy Query Plan Optimizers that can suddenly change what indices are being used, or how tables are being scanned. That can have a disastrous effect on performance or behaviour. Another is that they are very accessible and tempting ...
Show this thread -
... an operator, product manager, business analyst might all think it's safe to run a one-time read-only query, but a simple SQL typo can choke up the system! Bad bad. So what's the fix?
Show this thread -
Use NoSQL and do things the "dumb" way every time. Because the perf characteristics are much more obvious to the programmer and designer, now you can just do a full join, or a full table scan every time for every query. Much more stable!
Show this thread -
I've tweet stormed about this before, but now we're getting into the "constant work" pattern. The most stable control systems do the same work all of the time, with no change that is dependent on the data, or even the volume of change.pic.twitter.com/Gp0eD5emZi
Show this thread -
Suppose you need to get some config to your data plane. What if the data plane just fetched the config from S3 every 10 seconds, whether it changed or not? And reloaded the configuration, every time, whether it changed or not?
Show this thread -
This simple, simple, design is rarely seen in the wild, but I don't know why. It's very very reliable ... incredibly resilient and will recover from all sorts of issues. It's not even expensive! We're talking hundreds of dollars per year. Not even a few days of SDE time.pic.twitter.com/6ZBaxiamwP
Show this thread -
That's the pattern we use for our most critical systems. The network health check statuses that allow AWS to instantly handle an Availability Zone power issue? Those are always flowing, all the time, 0 or 1, whether they change or not.
Show this thread -
We have these and so many more patterns, and ... we're been building them into API Gateway and Lambda behind the scenes too! So consider building your control planes on those!pic.twitter.com/DgzdZAyNNC
Show this thread -
Thank your for listening to my talk! Always always feel free to AMA. This is the last tweet in the thread for now, and I won't even promote my Soundcloud!
Show this thread
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.