I'm not surprised to find out that Linux kernel socket load balancing is in a terrible state with no good options. Some background:
blog.cloudflare.com/the-sad-state-
Traditional epoll wakes every thread and they race to accept the connection. EPOLLEXCLUSIVE fixes it but uses LIFO order.
Conversation
LIFO order is terrible for a web server. HTTP connections are generally long-lived and reused for mixed / varying workloads. That's even more true with HTTP/2 where clients are only supposed to make a single connection to each server and multiplex everything over it concurrently.
1
6
In a standard nginx setup on Linux, it uses EPOLLEXCLUSIVE. This gives nearly all the connections to the same worker until it starts getting overloaded. Even then, the most overloaded workers still keep getting the most connections among handling other events. It's pretty awful.
1
3
The alternative that's offered is using reuseport to have the kernel evenly distribute new connections across workers.
It doesn't account for the varying usage of connections. It happily hands out new connections to a worker that's not even idle. It's not as nice as it sounds.
1
1
The kernel developers were gradually improving REUSEPORT in different ways. However, they stopped and allowed some of it to regress. You're supposed to use BPF to give the kernel a load balancing approach specific to the application. Of course, applications don't actually do it.
2
1
2
A decent approach for many applications would be if epoll used FIFO instead of LIFO order. It would work like reuseport but only distributing connections to idle workers. It was proposed in 2015 with EPOLLEXCLUSIVE but didn't land. It's apparently what Cloudflare uses themselves.
1
3
marc.info/?l=linux-fsdev was an attempt at reviving it. Cloudflare tends to not be persistent enough to get through the hassle of getting changes landed in the Linux kernel or nginx.
They deal with the initial technical aspect but not the politics / evangelism to get it landed.
1
4
They also tend not to invest resources into making things into a more generic solution suitable beyond their use case. Makes sense for them. They clearly don't see that much value in their changes being upstream. The upstream projects tend not to care much about latency, etc.
2
3
Replying to
It only seems useful when you have such a high load than serializing accepting connections is the bottleneck.
I don't think it really makes much sense as the approach on a server doing TLS. The overhead of TLS for accepting a new connections is pretty high.
1
It would make sense on some kind of internal reverse proxy or perhaps a static web server not handling TLS itself. I don't think it makes much sense on the frontend. The issue is that EPOLLEXCLUSIVE is so ridiculously stupid and chooses the worst idle process to handle accepts...
I'm thinking that maybe `accept_mutex on` is the the least bad option for a HTTPS-only web server on a non-NUMA machine. REUSEPORT doesn't seem worthwhile unless you have a setup like multiple many core CPUs.
EPOLLEXCLUSIVE is simply awful if you actually have real load + TLS.
1

