With time-division multiplexing, many CPU threads can be interleaved. They can be blocked and queued by the OS, typically at the cost of context switches and waiting a few time slices. Thus there is no fixed limit to how many can be packed onto a machine. CPU is compressible
-
Show this thread
-
OTOH, swapping memory pages, even to local SSD, is painfully expensive. This is why systems hosting services that need to respond with subsecond latency disable swap. Memory is considered an incompressible resource. For simplicity, I'll ignore resources other than CPU and memory
1 reply 1 retweet 3 likesShow this thread -
Compressible resources like CPU can be made available quickly by the kernel with low impact to the threads that were interrupted, provided it knows which threads urgently need the resources and which ones don't. We call this latency sensitive and latency tolerant respectively
1 reply 1 retweet 3 likesShow this thread -
Borg used an explicit attribute to indicate this, called appclass, which is described by the Borg paper: https://ai.google/research/pubs/pub43438 …. This was translated to scheduling latency in LMCTFY: https://github.com/google/lmctfy/blob/master/include/lmctfy.proto#L142 …. In Kubernetes, it's inferred from resource requests and limits
1 reply 1 retweet 4 likesShow this thread -
In order to reallocate incompressible resources quickly, threads need to be killed, which is obviously not low impact. (For memory, in Linux this is done by the OOM killer.) This was why Borg used priority (production priority vs not) to make memory oversubscription decisions
1 reply 1 retweet 1 likeShow this thread -
Borg's resource reclamation approach is described by the paper: reservations based on observed usage were computed and oversubscribed resources (latency-tolerant cpu and non-production memory) were tallied against reservations whereas guaranteed ones used limits. Complicated.
1 reply 1 retweet 1 likeShow this thread -
Vertical autoscaling (VA) added even more complexity. VA changed limits, but left its own padding to provide slack for reaction time and observation of demand. Ad hoc mechanisms were added to disable limit enforcement for each resource, creating a notion similar to request in K8s
1 reply 1 retweet 2 likesShow this thread -
In K8s, I wanted something simpler, to directly convey the desire for oversubscription and bursting flexibility. The discussion started way back in http://issues.k8s.io/147 and http://issues.k8s.io/168 . The model we settled on was determined by looking at limits and requests
2 replies 1 retweet 3 likesShow this thread -
Replying to @bgrant0607
Do you think it was a good decision to make it simpler and derive from that implicitly? I am not sure if I like all the autoscaling behavior. Sometimes it feels a bit over engineered and too implicit.
1 reply 0 retweets 0 likes -
Replying to @sszuecs
Other than latency tolerance, Borg was even more implicit, and had too many overlapping control loops. I think the K8s model is a better starting point but has had less iteration, because it's 10 years younger, and also not the same priority when machines can also be scaled
1 reply 0 retweets 0 likes
Unified cgroup hierarchy forces some changes also
-
-
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.