Kubernetes Borg/Omega history topic 12: A follow-on to the PodDisruptionBudget topic: the descheduler (https://github.com/kubernetes-incubator/descheduler …). Descheduler is more appropriate than the original term "rescheduler", because its job is to decide which pods to kill, not to replace or schedule them
-
Show this thread
-
In Kubernetes, when running on a cloud provider such as in GKE, in the case of pending pods with no existing available space to be placed, either cluster autoscaling or even node autoprovisioning (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/node_autoprovisioning.md …, https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning …) can create new nodes for them
1 reply 0 retweets 0 likesShow this thread -
In Borg, the rescheduler was created to defragment nodes to make room. It selected tasks to evict so that the new tasks could schedule, while also ensuring the replacements for the evicted tasks could also find new homes so as not to just cause unnecessary churn
1 reply 1 retweet 0 likesShow this thread -
In K8s, the purpose of the descheduler is mainly to reshuffle pods to improve the overall distribution of pods across nodes. After some churn in a cluster due to pod terminations due to pod autoscaling, pod updates, pods for batch/CI tasks, etc., pod layout can become uneven
1 reply 0 retweets 0 likesShow this thread -
A simple example: Say the cluster autoscaler (https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler …) added a new node for new pods. If those pods were due to creation of a new Deployment or ReplicaSet, they could all land on the new node if there weren't enough space on existing nodes
1 reply 0 retweets 0 likesShow this thread -
From the experience in Borg, we knew the descheduler would be needed from the beginning of the Kubernetes project. I think it was first mentioned when discussing the addition of liveness and readiness probes:https://github.com/kubernetes/kubernetes/issues/620#issuecomment-50110653 …
2 replies 0 retweets 1 likeShow this thread -
This enabled us to establish a clear separation of concerns between pod creation and replacement by workload controllers, horizontal scaling by HPA, placement by the scheduler, and rebalancing across nodes and failure domains by the descheduler, which would respect PDB
1 reply 0 retweets 0 likesShow this thread
That division was discussed when designing eviction for unresponsive nodes (https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-71984989 …) and then in http://issues.k8s.io/12140 . The design docs can be found at https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/rescheduler.md … and https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/rescheduling.md …
-
-
Note that if churn in the cluster is sufficiently high and eviction is highly constrained due to PodDisruptionBudgets, it may not be possible for the descheduler to keep up. This is one reason why it may not be possible to achieve an "optimal" layout
1 reply 0 retweets 1 likeShow this threadThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.