Disruption budgets were never added to the scheduler, which would have been hard, but there were also concerns about performance and priority inversion. Higher-priority tasks could specify how long they would wait for lower-priority ones to gracefully terminate
-
Show this thread
-
Priorities were used to ensure production/critical serving workloads could always get the resources they needed. This was essential to enabling mixed workloads to run together in the same clusters. Batch and experimental workloads ran at lower priorities, infrastructure at higher
1 reply 0 retweets 2 likesShow this thread -
For a while, users tried spreading their workloads across multiple priority bands in order to be nice to other tenants -- crude kind of fairness in the case of resource crunches. That resulted in preemption cascades of higher-priority tasks preempting lower-priority ones
1 reply 0 retweets 2 likesShow this thread -
Batch workloads, many of which were continuous automatically submitted, primarily preempted other batch tasks, causing significant amounts of lost work. So, priorities were "collapsed" into bands such that everything in the same band was treated as the same priority
1 reply 0 retweets 1 likeShow this thread -
The collapse reduced preemption, but other mechanisms were needed to ensure timely and efficient scheduling. The rescheduler ensured that pending production-priority tasks could schedule by choosing others to displace. It verified that both tasks would schedule, to avoid cascades
1 reply 0 retweets 1 likeShow this thread -
Groups of batch tasks were queued and admitted to the cluster when enough resources became available to schedule them. Resource quota by priority prevented priority inflation over time. Space was left between the bands in case new bands were needed -- like BASIC line numbering
1 reply 0 retweets 2 likesShow this thread -
Eventually the priority values of virtually all tasks were changed to rationalize them with the new scheme, across thousands of jobs, in their configuration files, through a painstaking process. This reiterated the importance of abstracting the operational intent.
1 reply 0 retweets 1 likeShow this thread -
Borg's approach is described in the Borg paper: https://ai.google/research/pubs/pub43438 …. K8s design proposals were in https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-preemption.md … and https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-priority-api.md …. Priority in resource quota: https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20190409-resource-quota-ga.md …. Coscheduling:https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/34-20180703-coscheduling.md …
1 reply 1 retweet 6 likesShow this thread -
Priority in Kubernetes is relatively new, and it's still evolving. For instance, there's an open proposal to add a preemption policy, https://github.com/kubernetes/enhancements/pull/1096 …, primarily to avoid preempting other pods. Borg has a similar mechanism. I'll discuss why when covering QoS
1 reply 0 retweets 6 likesShow this thread -
Waiting for preempted pods to terminate gracefully before starting newly scheduled pods creates significant complexity in the design. The scheduler then needs to model the future state, and some controller needs to watch for the space to become before starting the new pod
1 reply 0 retweets 2 likesShow this thread
The complexity of priority and preemption is primarily what drove the change for the DaemonSet controller to rely on the default scheduler to bind pods to nodes, as well as the scheduler framework proposal https://github.com/kubernetes/enhancements/issues/624 …, so the code could be reused in custom schedulers
-
-
I'll cover Quality of Service (QoS) and oversubscription next. Over time, priority bands in Borg (specific hardcoded integer values) came to be used as part of the determination of QoS level, for reasons I'll go into in that thread.
1 reply 0 retweets 3 likesShow this threadThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.