A more detailed description of last weekend’s massive GCP outage is on the blog - there’s so much I winced at I can’t even highlight specific interesting lines. https://status.cloud.google.com/incident/cloud-networking/19009 … Configuration change. Automation making things worse. And the postmortem isn’t even fully done
-
-
"we will harden Google's cluster management software such that it rejects such requests regardless of origin, providing an additional layer of defense in depth and eliminating other similar classes of failure." ... this feels like more layering, more complexity. 2/n
-
Why does automation even have access to multiple regions? Why add more logic, which could have its own bugs and problems, instead of enforcing more robust separation; compartmentalization with their own local fail-safes. 3/n
- 3 more replies
New conversation -
-
-
So as per my earlier comment, there is a job that Google needs to do, and I assure you that it knows that it needs to do it, which is to remove global co-ordination, split namespaces, better canarying, etc. However. 1/N
-
You might call it hubris but actually the whole place was designed that way, from the beginning - for maximal scale with minimum effort. Google can’t just throw away that assumption and introduce system partitions immediately. It is rational for them to proceed this way. 3/N
- 1 more reply
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.