Kubernetes Borg/Omega history topic 13: Priority and preemption. Some work is more important and/or urgent than other work. Borg represented this as an integer value: priority. A higher value meant a task was more important than a lower value, and should be able to displace it.
-
-
For a while, users tried spreading their workloads across multiple priority bands in order to be nice to other tenants -- crude kind of fairness in the case of resource crunches. That resulted in preemption cascades of higher-priority tasks preempting lower-priority ones
Show this thread -
Batch workloads, many of which were continuous automatically submitted, primarily preempted other batch tasks, causing significant amounts of lost work. So, priorities were "collapsed" into bands such that everything in the same band was treated as the same priority
Show this thread -
The collapse reduced preemption, but other mechanisms were needed to ensure timely and efficient scheduling. The rescheduler ensured that pending production-priority tasks could schedule by choosing others to displace. It verified that both tasks would schedule, to avoid cascades
Show this thread -
Groups of batch tasks were queued and admitted to the cluster when enough resources became available to schedule them. Resource quota by priority prevented priority inflation over time. Space was left between the bands in case new bands were needed -- like BASIC line numbering
Show this thread -
Eventually the priority values of virtually all tasks were changed to rationalize them with the new scheme, across thousands of jobs, in their configuration files, through a painstaking process. This reiterated the importance of abstracting the operational intent.
Show this thread -
Borg's approach is described in the Borg paper: https://ai.google/research/pubs/pub43438 …. K8s design proposals were in https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-preemption.md … and https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-priority-api.md …. Priority in resource quota: https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20190409-resource-quota-ga.md …. Coscheduling:https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/34-20180703-coscheduling.md …
Show this thread -
Priority in Kubernetes is relatively new, and it's still evolving. For instance, there's an open proposal to add a preemption policy, https://github.com/kubernetes/enhancements/pull/1096 …, primarily to avoid preempting other pods. Borg has a similar mechanism. I'll discuss why when covering QoS
Show this thread -
Waiting for preempted pods to terminate gracefully before starting newly scheduled pods creates significant complexity in the design. The scheduler then needs to model the future state, and some controller needs to watch for the space to become before starting the new pod
Show this thread -
The complexity of priority and preemption is primarily what drove the change for the DaemonSet controller to rely on the default scheduler to bind pods to nodes, as well as the scheduler framework proposal https://github.com/kubernetes/enhancements/issues/624 …, so the code could be reused in custom schedulers
Show this thread -
I'll cover Quality of Service (QoS) and oversubscription next. Over time, priority bands in Borg (specific hardcoded integer values) came to be used as part of the determination of QoS level, for reasons I'll go into in that thread.
Show this thread
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.