Kubernetes Borg/Omega history topic 11: PodDisruptionBudget. Google constantly performs software and hardware maintenance in its datacenters: firmware updates, kernel and image updates, disk repairs, switch updates, battery tests, etc. etc. More and more kinds over time.
-
-
For Omega, we developed a model that could be applied both during task preemption to run a higher-priority task and eviction for maintenance -- disruption counters. There was a time dimension that ended up not being effective due to constant changes, so we dropped it in K8s
Show this thread -
I think I first mentioned this in Kubernetes in my big scheduling braindump comment: https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-74355529 …. It came up again when I proposed maxUnavailable to moderate concurrent disruptions caused by updates during the design of Deployment:https://github.com/kubernetes/kubernetes/pull/12236#discussion_r36501373 …
Show this thread -
That discussion was forked into https://github.com/kubernetes/kubernetes/issues/12611 …. Around that time, Matt Liggett (https://github.com/kubernetes/kubernetes/pulls?q=is%3Apr+author%3Amml+is%3Aclosed …) joined the GKE team from Borg SRE (woo hoo!). One of the first things Matt worked on was improving node drains:https://github.com/kubernetes/kubernetes/issues/6080 …
Show this thread -
Together with
@davidopp and@erictune4, we folded disruption budgets into the rescheduling design proposal: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/rescheduling.md#disruption-budget …. (Rescheduling deserves its own thread -- I'll do that one next.) Implementation began in https://github.com/kubernetes/kubernetes/pull/24697 … andhttps://github.com/kubernetes/kubernetes/pull/25551 …Show this thread -
PodDisruptionBudget is now documented: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ … and https://kubernetes.io/docs/tasks/run-application/configure-pdb/ …. Try it out and give us feedback on how well it works for you. We're looking to advance it from beta to GA:https://github.com/kubernetes/enhancements/issues/85 …
Show this thread -
You can safely drain a node with kubetctl drain: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ …. Node upgrades and the cluster autoscaler in Google Kubernetes Engine (GKE) also respect PodDisruptionBudget. The latter is documented here:https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler …
Show this thread -
Node upgrade behavior is documented here:https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-upgrades …
Show this thread -
And more about the Google SRE philosophy behind automation can be found in the SRE book: https://landing.google.com/sre/sre-book/chapters/automation-at-google/ …
Show this thread -
And the Safe Removal Service was also mentioned in Google's "VM Live Migration at Scale" paper in VEE 2018: https://dl.acm.org/citation.cfm?id=3186415 …
Show this thread
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.