The system itself is probably nowhere near the complexity of what @joyent is running, but I can relate to what @bcantrill says in this talk, from a couple of years ago: https://youtu.be/30jNsCVLpAE [/2]
-
-
Show this thread
-
Automation rules out a whole class of possible errors, but it also often reduces the number of people who have sufficiently holistic understanding of the whole system. [/3]
Show this thread -
And when an incident strikes, because of all those safeguards, it's usually much more difficult to pin down. The root cause, if any, is subtler. Hidden behind layers of stuff that "used to just work". [/4]
Show this thread -
And all the monitoring, the logs, the alerts... A deluge of data, useless without a hypothesis to glue the pieces together. Cultivating a structured approach to investigation is something I never taught to explicitly, but it makes so much sense! [/5]
Show this thread -
And it's probably one of the reasons I find some of these challenges to be so interesting. Because you get the complexity of real natural systems, distilled in an environment where you control the knobs, the variables, the setting. [/6]
Show this thread -
New conversation -
-
-
ouch. sorry. When I had to do stuff like that I started reading every story I could find of outages. Consolidating what worked and didn't. I found that making a physical list of things to check helped. Even if I didn't always follow the list, it helped to know there was a "plan"
-
Nothing to be sorry of :) Playbooks are definitely useful, the challenge I guess is in keeping them relevant and updated.
- 1 more reply
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.
Writing about stuff to learn how it works, mostly in Rust.
Lead Engineer at