Batch workloads, many of which were continuous automatically submitted, primarily preempted other batch tasks, causing significant amounts of lost work. So, priorities were "collapsed" into bands such that everything in the same band was treated as the same priority
-
Show this thread
-
The collapse reduced preemption, but other mechanisms were needed to ensure timely and efficient scheduling. The rescheduler ensured that pending production-priority tasks could schedule by choosing others to displace. It verified that both tasks would schedule, to avoid cascades
1 reply 0 retweets 1 likeShow this thread -
Groups of batch tasks were queued and admitted to the cluster when enough resources became available to schedule them. Resource quota by priority prevented priority inflation over time. Space was left between the bands in case new bands were needed -- like BASIC line numbering
1 reply 0 retweets 2 likesShow this thread -
Eventually the priority values of virtually all tasks were changed to rationalize them with the new scheme, across thousands of jobs, in their configuration files, through a painstaking process. This reiterated the importance of abstracting the operational intent.
1 reply 0 retweets 1 likeShow this thread -
Borg's approach is described in the Borg paper: https://ai.google/research/pubs/pub43438 …. K8s design proposals were in https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-preemption.md … and https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-priority-api.md …. Priority in resource quota: https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20190409-resource-quota-ga.md …. Coscheduling:https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/34-20180703-coscheduling.md …
1 reply 1 retweet 6 likesShow this thread -
Priority in Kubernetes is relatively new, and it's still evolving. For instance, there's an open proposal to add a preemption policy, https://github.com/kubernetes/enhancements/pull/1096 …, primarily to avoid preempting other pods. Borg has a similar mechanism. I'll discuss why when covering QoS
1 reply 0 retweets 6 likesShow this thread -
Waiting for preempted pods to terminate gracefully before starting newly scheduled pods creates significant complexity in the design. The scheduler then needs to model the future state, and some controller needs to watch for the space to become before starting the new pod
1 reply 0 retweets 2 likesShow this thread -
The complexity of priority and preemption is primarily what drove the change for the DaemonSet controller to rely on the default scheduler to bind pods to nodes, as well as the scheduler framework proposal https://github.com/kubernetes/enhancements/issues/624 …, so the code could be reused in custom schedulers
1 reply 1 retweet 2 likesShow this thread -
I'll cover Quality of Service (QoS) and oversubscription next. Over time, priority bands in Borg (specific hardcoded integer values) came to be used as part of the determination of QoS level, for reasons I'll go into in that thread.
1 reply 0 retweets 3 likesShow this thread -
After QoS, also on my list are liveness and readiness probes, the networking model, bootstrapping, state machines, and a handful of smaller topics. Configuration seemed popular, and is a bottomless area, so I will get back to it at some point as well
-
-
Replying to @bgrant0607
Oh yes, liveness probe and it’s rare use cases... I already saw too many errors caused by liveness probes.
0 replies 0 retweets 0 likesThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.