Skip to content
By using Twitter’s services you agree to our Cookies Use. We and our partners operate globally and use cookies, including for analytics, personalisation, and ads.

This is the legacy version of twitter.com. We will be shutting it down on June 1, 2020. Please switch to a supported browser, or disable the extension which masks your browser. You can see a list of supported browsers in our Help Center.

  • Home Home Home, current page.
  • About

Saved searches

  • Remove
  • In this conversation
    Verified accountProtected Tweets @
Suggested users
  • Verified accountProtected Tweets @
  • Verified accountProtected Tweets @
  • Language: English
    • Bahasa Indonesia
    • Bahasa Melayu
    • Català
    • Čeština
    • Dansk
    • Deutsch
    • English UK
    • Español
    • Filipino
    • Français
    • Hrvatski
    • Italiano
    • Magyar
    • Nederlands
    • Norsk
    • Polski
    • Português
    • Română
    • Slovenčina
    • Suomi
    • Svenska
    • Tiếng Việt
    • Türkçe
    • Ελληνικά
    • Български език
    • Русский
    • Српски
    • Українська мова
    • עִבְרִית
    • العربية
    • فارسی
    • मराठी
    • हिन्दी
    • বাংলা
    • ગુજરાતી
    • தமிழ்
    • ಕನ್ನಡ
    • ภาษาไทย
    • 한국어
    • 日本語
    • 简体中文
    • 繁體中文
  • Have an account? Log in
    Have an account?
    · Forgot password?

    New to Twitter?
    Sign up
colmmacc's profile
Colm MacCárthaigh
Colm MacCárthaigh
Colm MacCárthaigh
@colmmacc

Tweets

Colm MacCárthaigh

@colmmacc

AWS, Apache, Crypto, Irish Music, Haiku, Photography

Seattle
notesfromthesound.com
Joined April 2008

Tweets

  • © 2020 Twitter
  • About
  • Help Center
  • Terms
  • Privacy policy
  • Imprint
  • Cookies
  • Ads info
Dismiss
Previous
Next

Go to a person's profile

Saved searches

  • Remove
  • In this conversation
    Verified accountProtected Tweets @
Suggested users
  • Verified accountProtected Tweets @
  • Verified accountProtected Tweets @

Promote this Tweet

Block

  • Tweet with a location

    You can add location information to your Tweets, such as your city or precise location, from the web and via third-party applications. You always have the option to delete your Tweet location history. Learn more

    Your lists

    Create a new list


    Under 100 characters, optional

    Privacy

    Copy link to Tweet

    Embed this Tweet

    Embed this Video

    Add this Tweet to your website by copying the code below. Learn more

    Add this video to your website by copying the code below. Learn more

    Hmm, there was a problem reaching the server.

    By embedding Twitter content in your website or app, you are agreeing to the Twitter Developer Agreement and Developer Policy.

    Preview

    Why you're seeing this ad

    Log in to Twitter

    · Forgot password?
    Don't have an account? Sign up »

    Sign up for Twitter

    Not on Twitter? Sign up, tune into the things you care about, and get updates as they happen.

    Sign up
    Have an account? Log in »

    Two-way (sending and receiving) short codes:

    Country Code For customers of
    United States 40404 (any)
    Canada 21212 (any)
    United Kingdom 86444 Vodafone, Orange, 3, O2
    Brazil 40404 Nextel, TIM
    Haiti 40404 Digicel, Voila
    Ireland 51210 Vodafone, O2
    India 53000 Bharti Airtel, Videocon, Reliance
    Indonesia 89887 AXIS, 3, Telkomsel, Indosat, XL Axiata
    Italy 4880804 Wind
    3424486444 Vodafone
    » See SMS short codes for other countries

    Confirmation

     

    Welcome home!

    This timeline is where you’ll spend most of your time, getting instant updates about what matters to you.

    Tweets not working for you?

    Hover over the profile pic and click the Following button to unfollow any account.

    Say a lot with a little

    When you see a Tweet you love, tap the heart — it lets the person who wrote it know you shared the love.

    Spread the word

    The fastest way to share someone else’s Tweet with your followers is with a Retweet. Tap the icon to send it instantly.

    Join the conversation

    Add your thoughts about any Tweet with a Reply. Find a topic you’re passionate about, and jump right in.

    Learn the latest

    Get instant insight into what people are talking about now.

    Get more of what you love

    Follow more accounts to get instant updates about topics you care about.

    Find what's happening

    See the latest conversations about any topic instantly.

    Never miss a Moment

    Catch up instantly on the best stories happening as they unfold.

    1. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      But DON'T risk taking with security, durability, or availability. Those are core values, and top priorities that need to be inviolable. Take risks with business ideas and features, and product names, and have some fun!pic.twitter.com/yWfZSaMyh0

      1 reply 7 retweets 39 likes
      Show this thread
    2. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      With that context, let's build some stable and reliable control systems! What do we use them for? 4 common reasons: 1/ lifecycling resources (launching, scaling, etc) 2/ deploying system config 3/ deploying software 4/ deploying user settings.pic.twitter.com/00uaBoD3mr

      1 reply 2 retweets 11 likes
      Show this thread
    3. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      At Amazon, we encourage merging 2 and 3. Deploying systems config, like global feature flags, *IS* deploying software. So where possible, we use the same system for both. We have awesome awesome deployment safety systems. One-boxing, staggering, rollback, etc. So use it for both!pic.twitter.com/kyv7AmwPUP

      2 replies 2 retweets 17 likes
      Show this thread
    4. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      For building control systems, it turns out there's a whole branch of rigorous engineering called control theory. There's a lot of math, and it is awesome, well worth knowing, but also you don't need all of that to get most of the benefit. Here is what is worth knowing ...pic.twitter.com/LtjZtIMs1J

      1 reply 1 retweet 11 likes
      Show this thread
    5. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      Every stable control system needs 3 things: a measurement process, a controller, and an actuator. Basically something to see how the world is, something to figure out how the world needs to change, and something that makes that change happen.pic.twitter.com/cuZP7dJLdr

      1 reply 3 retweets 28 likes
      Show this thread
    6. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      That simple mental model is very very important. Most control systems built by CS people *don't* have a measurement element. Like the remote control we've already seen! These systems propagate errors they can't correct. BAD BAD.

      1 reply 1 retweet 13 likes
      Show this thread
    7. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      So always start with the idea of a measurer; poll every server to know what state it is in, check if the user settings get there, etc ... and build the system as something that corrects any errors it sees, not just a system that just blindly shouts instructions.

      1 reply 2 retweets 13 likes
      Show this thread
    8. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      O.k. that's 80% of control theory right there for you. The next 10% is that controllers are very sensitive to lag. Imagine a furnace that heated your boiler based on the temperature it was an hour ago? It'd be very unstable!pic.twitter.com/4GJ21YM7uu

      2 replies 1 retweet 11 likes
      Show this thread
    9. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      Imagine scaling up based on the systems load from 2 hours ago? You might not even need those machines any more, peak may have passed! So systems need to be fast. Low lag is critical. O.k. now we know 90% of control theory,

      1 reply 1 retweet 9 likes
      Show this thread
    10. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      If you want to get the next 5%, 9% ... 10% , and please do, then focus on learning what "PID" means. I'm just going to say this to tempt you: if you can learn to recognise the P.I.D. components of real-world control systems, it is a design review super-power.pic.twitter.com/dDOzyIx8bV

      1 reply 2 retweets 15 likes
      Show this thread
      Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
      • Report Tweet
      • Report NetzDG Violation

      Like in seconds you can spot that a system can't possibly be stable. Buy this book ... https://www.amazon.com/Designing-Distributed-Control-Systems-Language/dp/1118694155/ … it's very approachable and takes a pattern based approach.pic.twitter.com/hm0xj0XELo

      9:18 AM - 7 Dec 2018
      • 3 Retweets
      • 36 Likes
      • Vincent Lu Alexander Yakovlev Chris Maxwell is in Tokyo Doug Blackmore svraghavan Paul Elbro Gerd Oberlechner 𝚕𝚊𝚛𝚜 𝚏𝚛𝚘𝚗𝚒𝚞𝚜 Breland Miley 🏳️‍🌈
      2 replies 3 retweets 36 likes
        1. New conversation
        2. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Since it is so accessible, I'm going to borrow the pattern approach and give 10 patterns we use at Amazon. I've chosen patterns that I hope will be interesting, new, and short enough to synopsise. We have way more!pic.twitter.com/3p15C3yohx

          2 replies 7 retweets 19 likes
          Show this thread
        3. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          O.k. pattern 1: CHECKSUM ALL THE THINGS. Because this: https://status.aws.amazon.com/s3-20080720.html … Never underestimate the ability of bit-rot to set in. S3 had an event in 2008 due to a single corrupt bit!!pic.twitter.com/7pSSWYfeb8

          1 reply 5 retweets 26 likes
          Show this thread
        4. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          To this day, we still ask teams if they are checksumming everything. Another example of how corruption can slip in is ... YAML. Because YAML is truncatable, configs can fail back to implicit defaults due to partial transfers, full disks, etc. *sigh* CHECKSUM ALL THE THINGS.pic.twitter.com/h0HB1ZZLF3

          1 reply 4 retweets 26 likes
          Show this thread
        5. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Pattern 2: control planes need strong cryptographic authentication! They are important security systems, make sure that they are protected from malicious data. It's ALSO useful to make sure that test stacks don't talk to prod and that operators aren't manually poking things.pic.twitter.com/NGy45zzaxs

          1 reply 2 retweets 15 likes
          Show this thread
        6. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Pattern 3: reduce blast radius. Do your best, write great code, do great code reviews, test everything, twice, more. But still have some humility and assume things will fail. So reduce the scope of impact, have circuit breakers and so on.pic.twitter.com/6vWzwqv5tC

          1 reply 2 retweets 16 likes
          Show this thread
        7. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Watch @PeterVosshall's talk to go much deeper on this:https://www.youtube.com/watch?v=swQbA4zub20 …

          1 reply 4 retweets 20 likes
          Show this thread
        8. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Pattern 4: Asynchronous Coupling! If system A calls system B synchronously, which means that B has to succeed to A to make any progress, then they are basically one system. There is no real insulation or meaningful separation.pic.twitter.com/LL8T36UwSx

          1 reply 12 retweets 40 likes
          Show this thread
        9. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Worse still: if A calls B which calls C and so on, and they have retries built-in, things can get really bad really quickly when there are problems! Just 3 layers deep with 3 retries per layer, and you have 27x application factor if the deepest service fails. Oh wow is that bad.

          1 reply 9 retweets 20 likes
          Show this thread
        10. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Asynchronous systems are more forgiving: queues and workflows and step functions and so on are all examples. They tend to try consistently and they can make partial progress when dependencies fail. Of course don't let queue grow infinitely either, have some limits.

          2 replies 6 retweets 26 likes
          Show this thread
        11. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          All of AWS's multi-region offerings, like S3 cross-region replication, or DynamoDB global tables, are asynchronously coupled. That means that if there is a problem in one region, that the other regions don't just stall waiting for it. Very powerful and important!

          1 reply 1 retweet 12 likes
          Show this thread
        12. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Pattern 5: use closed feedback loops! Always Be Checking. Never fire and forget. So important that I repeat this a lot. Repeating good advice over and over is actually a good habit.pic.twitter.com/sFP6G2alYm

          1 reply 3 retweets 21 likes
          Show this thread
        13. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Pattern 6: should we push data or pull data from the control plane to the data plane? WRONG QUESTION! I mean we can get into eventing systems and edge triggering, but let's not. What really matters 99% of the time is the relative size of fleets ...pic.twitter.com/SEZ6of3NPn

          1 reply 3 retweets 17 likes
          Show this thread
        14. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          The way to think about it is this: don't have large fleets connect to small fleets. They will overwhelm the small fleet with a thundering herd during cold starts or stress events! Optimize connection direction for that.

          1 reply 1 retweet 9 likes
          Show this thread
        15. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Related is pattern 7: Avoid cold start caching problems! If you end up with a caching layer in your system, be very very careful. Can it cope if the origin goes down for an extended duration? When the TTLs expire, will the system stall?pic.twitter.com/x3SjogK1vJ

          1 reply 3 retweets 15 likes
          Show this thread
        16. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Try to build caches that will serve stale entries, and caches that self-warm or prime their cache before accepting requests, pre-fetching is nice too. Wherever you see caches, see danger, and go super deep on whether they will safely recover from blips.

          1 reply 1 retweet 17 likes
          Show this thread
        17. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          If you have to throttle things to safely recover and shorten the duration of events, do! have a throttling system at hand. But don't kid yourself either: throttling a customer is also an outage. Think instead how throttling can be used to prioritise smartly ...pic.twitter.com/nDPlEoSmYB

          1 reply 1 retweet 10 likes
          Show this thread
        18. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Example: ELB is a fault-tolerant AZ-redundant system. We can lose an AZ at any time and ELB is scaled for capacity, it'll be fine. We can deliberately throttle ELBs recovery in a zone after a power event to give our paying customers priority. Works great! Good use of throttling.

          1 reply 0 retweets 3 likes
          Show this thread
        19. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Pattern 9: I couldn't say it at the time, but basically use a system like QLDB (https://aws.amazon.com/qldb/ ) for your control plane data flow if you can! If you have an immutable append only ledger for your data flow then ...pic.twitter.com/N1C9XQuO9D

          1 reply 3 retweets 15 likes
          Show this thread
        20. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          ... you can compute and merge deltas easily minimising data volume, and you get item history, so you can implement point-in-time-recovery and rollback! You can also optimise-out no-op changes. We use this pattern in Route 53, EC2, bunch of places.

          1 reply 0 retweets 7 likes
          Show this thread
        21. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          O.k. I left the most important thoughts and pattern for last. You have filter every element of your design through the lens of "How many modes of operation do I have". For stability, that needs to be minimal.pic.twitter.com/kvXZfbTlzl

          1 reply 3 retweets 10 likes
          Show this thread
        22. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Avoid emergency modes that are different, or anything that can alter what the system is doing suddenly. Think about your system in terms of state space, or code branches. How many can you get rid of?

          1 reply 1 retweet 14 likes
          Show this thread
        23. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Branches and state spaces are evil, because they grow exponentially, past the point you can test or predict behaviour, they become emergent instead. A simple example here is relational databases.pic.twitter.com/jiHIm1zGNk

          1 reply 1 retweet 10 likes
          Show this thread
        24. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          I'm not knocking offerings like RDS or Aurora, relational DBs are great for versatile business queries, but they are terrible for control planes. We essentially ban them for that purpose at AWS. Why?

          1 reply 0 retweets 9 likes
          Show this thread
        25. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          RDBMSs have built-in fancy Query Plan Optimizers that can suddenly change what indices are being used, or how tables are being scanned. That can have a disastrous effect on performance or behaviour. Another is that they are very accessible and tempting ...

          1 reply 0 retweets 5 likes
          Show this thread
        26. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          ... an operator, product manager, business analyst might all think it's safe to run a one-time read-only query, but a simple SQL typo can choke up the system! Bad bad. So what's the fix?

          1 reply 1 retweet 5 likes
          Show this thread
        27. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Use NoSQL and do things the "dumb" way every time. Because the perf characteristics are much more obvious to the programmer and designer, now you can just do a full join, or a full table scan every time for every query. Much more stable!

          1 reply 1 retweet 11 likes
          Show this thread
        28. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          I've tweet stormed about this before, but now we're getting into the "constant work" pattern. The most stable control systems do the same work all of the time, with no change that is dependent on the data, or even the volume of change.pic.twitter.com/Gp0eD5emZi

          2 replies 3 retweets 15 likes
          Show this thread
        29. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          Suppose you need to get some config to your data plane. What if the data plane just fetched the config from S3 every 10 seconds, whether it changed or not? And reloaded the configuration, every time, whether it changed or not?

          2 replies 2 retweets 8 likes
          Show this thread
        30. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          This simple, simple, design is rarely seen in the wild, but I don't know why. It's very very reliable ... incredibly resilient and will recover from all sorts of issues. It's not even expensive! We're talking hundreds of dollars per year. Not even a few days of SDE time.pic.twitter.com/6ZBaxiamwP

          2 replies 0 retweets 9 likes
          Show this thread
        31. Colm MacCárthaigh‏ @colmmacc 7 Dec 2018
          • Report Tweet
          • Report NetzDG Violation

          That's the pattern we use for our most critical systems. The network health check statuses that allow AWS to instantly handle an Availability Zone power issue? Those are always flowing, all the time, 0 or 1, whether they change or not.

          1 reply 0 retweets 9 likes
          Show this thread
        32. 2 more replies

      Loading seems to be taking a while.

      Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.

        Promoted Tweet

        false

        • © 2020 Twitter
        • About
        • Help Center
        • Terms
        • Privacy policy
        • Imprint
        • Cookies
        • Ads info