I've worked a lot on performance-oriented allocators too and most of those projects are focused on the wrong things. For example, using thread caches as anything more than a way to amortize locking costs is missing the point and hurting real world performance by wasting memory.
Conversation
One of the highest priorities for a perf-oriented allocator should be keeping the working set of memory small by minimizing internal / external fragmentation and reducing memory held in caches. Understanding real world performance well is why jemalloc is being adopted everywhere.
1
1
2
The jemalloc design choices are bad for security, but it's an incredibly good allocator for long-term performance in everything but small programs. It has a strong focus on having minimal fragmentation and keeping other waste (metadata) at a tiny percentage of allocated memory.
1
Google's TCMalloc was too concerned with performance of the allocator rather than applications and systems as a whole. Android chose jemalloc as the dlmalloc replacement despite jemalloc being maintained by Facebook and being somewhat more oriented towards server workloads.
1
I do think thread caches make sense most allocators, but they can accomplish nearly their entire purpose without being larger than arrays of about 16 pointers per size class, to reducing locking overhead to 1/8 (a simple approach is fill half when empty, flush half when full).
2
Replying to
Is part of the problem perhaps that people writing them are thinking about scaling to 1024 cores or something?
1
The lock cost you're trying to amortize should scale with the number of cores due to scaling of cache coherency traffic/contention, no?
1
Replying to
It might scale up somewhat but at least in jemalloc there are arenas assigned to threads either via round-robin (static assignment) or dynamic load balancing (sched_getcpu is the naive way, but Facebook has some kind of sophisticated replacement). The arenas what gives scaling.
1
Thread caches batch together a bunch of work which results in holding the locks for much longer, especially if the thread caches are large. They're kept quite small in jemalloc and have non-trivial garbage collection / heuristics esp compared to the massive ones used by TCMalloc.
1
1
In TCMalloc, it was a bandaid for not having an efficient underlying allocator. It's a lot better for the throughput to come from a very high performance allocator behind the locks with the thread caches only needing to make small batches of work to wipe out locking costs.
2
For my new hardened allocator, providing the option of having arenas will be natural since it just requires having multiple slab allocation regions and then assigning threads to them. It will only be a dozen lines of code.

