Ed: Sounds like COE approx equals postmortem. I wonder how many COEs there are/week at AWS? and what is the threshold for writing a COE?
Also, @colmmacc mentions "root cause" here, to which I knee jerk copy paste
https://www.verica.io/inhumanity-of-root-cause-analysis/ …https://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/ …
-
Show this thread
-
After reviewing postmortems, use a wheel to select a team for a deep dive into graphs, metrics, etc (ed: can't do everyone, looks like wheel scheme incentivizes everyone to be prepared even if only one team presents since no one wants to look like a chump at a meeting this large)
1 reply 0 retweets 1 likeShow this thread -
Also look at "operational sustainability" to keep an eye out for burnout (ed: more details here plz
@colmmacc?) Meeting involves not just ops folks but senior execs. Led by SVP@charlieaws, many other senior folks also show up, which shows its importancepic.twitter.com/oobj6WLX1h
1 reply 0 retweets 1 likeShow this thread -
teams have their own, smaller version of this meeting including on-call handoff, metrics review, etc. If Colm is thinking about taking a dependency on a service, can show up to their ops meeting to check on how things are going.
1 reply 0 retweets 0 likesShow this thread -
What happens when things go wrong? Usually means that something "quite unusual" has happened. Will share AWS' event response workflows. Every team has an oncall, so they can be paged, and most issues can be handled by that on-call (ed: see https://how.complexsystems.fail/#2 )
1 reply 0 retweets 0 likesShow this thread -
AWS uses voice conference calls as main coordination backbone, mainly because AWS has expertise and a history of doing so. Avoids ChatOps because many run on AWS itself, and they don't want a dependency on AWS during events (ed: so voice conferences must not be on Chime!)
1 reply 0 retweets 1 likeShow this thread -
Every call has a Call leader and facilitator for coordinations. Call leader manages engineering coordination, facilitator does externally-facing statusing, etc. These roles are filled by experienced staff who are empowered to make decisions.
1 reply 0 retweets 0 likesShow this thread -
Challenge: how do you build a habit of excellence in event response if you want events to be rare? AWS uses a simple protocol: Stay calm Assess the situation Focus on Mitigation, not root cause analysis (ed: already discussed my feelings on RCA) Escalate early and often
1 reply 0 retweets 1 likeShow this thread -
Ed: these slides have tons of text, perhaps from internal docs?
@colmmacc, are those docs published anywhere? Also, this might be an interesting part of the talk to go watch, the story/language is compelling. https://www.slideshare.net/AmazonWebServices/the-theory-and-practice-practice-practice-of-aws-operations-aws-summit-sydney … (slide 49-53)1 reply 0 retweets 1 likeShow this thread -
Summary is that operations is not an afterthought, and this talk shows * how AWS thinks about operational risk * how AWS deploys * how SAFE works
@colmmacc where's a good talk on AWS' postmortem process? :-D1 reply 0 retweets 0 likesShow this thread
Becky gave a talk about that at Sydney too, and I'm giving some more detail about our process at Networking @ Scale on September 9th. Let me know what you'd like to hear!
-
-
Replying to @colmmacc
Ah, thats "The Art of Successful Failure" by Becky Weiss (twitter unknown), https://embed.vidyard.com/share/6upxJjy58auqAsYThuj8v6 … I'll take a watch in the next week or two and let you know what questions I have after :-P
0 replies 0 retweets 0 likesThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.