As a company evolves, different and new challenges arise. And people are called to face them, even if they never had to before. I have recently been wearing the operator hat for part of our systems, both as maintainer and member of the oncall team. [1/]
And when an incident strikes, because of all those safeguards, it's usually much more difficult to pin down. The root cause, if any, is subtler. Hidden behind layers of stuff that "used to just work". [/4]
-
-
And all the monitoring, the logs, the alerts... A deluge of data, useless without a hypothesis to glue the pieces together. Cultivating a structured approach to investigation is something I never taught to explicitly, but it makes so much sense! [/5]
Show this thread -
And it's probably one of the reasons I find some of these challenges to be so interesting. Because you get the complexity of real natural systems, distilled in an environment where you control the knobs, the variables, the setting. [/6]
Show this thread -
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.
Writing about stuff to learn how it works, mostly in Rust.
Lead Engineer at