Conversation

🤣🐧🤡 This would not even be an issue if everyone hadn't jumped on the epoll 🤡🚗
Quote Tweet
I'm not surprised to find out that Linux kernel socket load balancing is in a terrible state with no good options. Some background: blog.cloudflare.com/the-sad-state- Traditional epoll wakes every thread and they race to accept the connection. EPOLLEXCLUSIVE fixes it but uses LIFO order.
Show this thread
1
3
Replying to
The Linux kernel didn't really provide other viable options. The least bad option not requiring patching the kernel is probably REUSEPORT combined with dealing with the BPF nonsense they expect you to do. You still use epoll but with a socket per worker instead of sharing one.
2
Replying to and
Blocking accept uses FIFO rather than LIFO, but has other issues. It requires a weird non-standard approach to properly handle graceful reloads, etc. It doesn't really make any sense that EPOLLEXCLUSIVE uses a different order than blocking accept by default. It's a weird choice.
1
1
Replying to and
LIFO is lower overhead under very low load. If you have high load, there are no cold workers. The nginx model is to make one worker per CPU core. On many high performance systems, those will even be pinned to cores. It wants to have even load balancing across those threads/CPUs.
1
The reason nginx uses multiple workers is scaling to multiple cores. It matters a lot with TLS, especially against denial of service attacks. It scales up disk I/O on Linux with a per-worker thread pool. It could use io_uring but that hasn't landed yet and isn't that much better.
1
It would help if they had work stealing to let workers with less load steal connections from those with more load but it's hard to fit into how it works. Workers are almost entirely quickly processing events still accept connections fairly fast under high load, making it worse.
1
Even if there was work stealing... you would still want the connections to be decently distributed across them. REUSEPORT is good enough at that but giving more connections to non-idle workers isn't good. FIFO EPOLLEXCLUSIVE would be great for baseline load balancing.
1
Replying to and
There were also some patches for round-robin wakeup. All of this does give me a sense of why kernel developers go with BPF for the 'policy' part, we now know of two ways users may want this. BPF would allow them to inspect more kernel state to make that decision.
2
There are use cases where REUSEPORT + BPF may be superior but it really isn't for a public-facing high load web server, and you're sacrificing latency to 'perfectly' balance the new connections across workers via REUSEPORT despite the very mixed/varying usage so it isn't really.
1
LIFO EPOLLEXCLUSIVE could be mitigated by nginx doing work stealing to re-balance load but... you still want it to balance new connections with low latency (unlike REUSEPORT) and evenly (unlike EPOLLEXCLUSIVE). I also don't know how practical work stealing is for socket stuff.