I've worked a lot on performance-oriented allocators too and most of those projects are focused on the wrong things. For example, using thread caches as anything more than a way to amortize locking costs is missing the point and hurting real world performance by wasting memory.
-
Show this thread
-
One of the highest priorities for a perf-oriented allocator should be keeping the working set of memory small by minimizing internal / external fragmentation and reducing memory held in caches. Understanding real world performance well is why jemalloc is being adopted everywhere.
1 reply 2 retweets 3 likesShow this thread -
The jemalloc design choices are bad for security, but it's an incredibly good allocator for long-term performance in everything but small programs. It has a strong focus on having minimal fragmentation and keeping other waste (metadata) at a tiny percentage of allocated memory.
1 reply 1 retweet 1 likeShow this thread -
Google's TCMalloc was too concerned with performance of the allocator rather than applications and systems as a whole. Android chose jemalloc as the dlmalloc replacement despite jemalloc being maintained by Facebook and being somewhat more oriented towards server workloads.
1 reply 0 retweets 0 likesShow this thread -
I do think thread caches make sense most allocators, but they can accomplish nearly their entire purpose without being larger than arrays of about 16 pointers per size class, to reducing locking overhead to 1/8 (a simple approach is fill half when empty, flush half when full).
2 replies 0 retweets 0 likesShow this thread -
Replying to @DanielMicay
Is part of the problem perhaps that people writing them are thinking about scaling to 1024 cores or something?
1 reply 0 retweets 0 likes -
Replying to @RichFelker @DanielMicay
The lock cost you're trying to amortize should scale with the number of cores due to scaling of cache coherency traffic/contention, no?
1 reply 0 retweets 0 likes -
Replying to @RichFelker
It might scale up somewhat but at least in jemalloc there are arenas assigned to threads either via round-robin (static assignment) or dynamic load balancing (sched_getcpu is the naive way, but Facebook has some kind of sophisticated replacement). The arenas what gives scaling.
1 reply 0 retweets 0 likes -
Replying to @DanielMicay @RichFelker
Thread caches batch together a bunch of work which results in holding the locks for much longer, especially if the thread caches are large. They're kept quite small in jemalloc and have non-trivial garbage collection / heuristics esp compared to the massive ones used by TCMalloc.
1 reply 0 retweets 1 like -
Replying to @DanielMicay @RichFelker
In TCMalloc, it was a bandaid for not having an efficient underlying allocator. It's a lot better for the throughput to come from a very high performance allocator behind the locks with the thread caches only needing to make small batches of work to wipe out locking costs.
2 replies 0 retweets 0 likes
In any case your thread is interesting because rewriting malloc is on the mid-term agenda for musl. Our goals don't match entirely but have a lot of overlap.
-
-
Replying to @RichFelker @DanielMicay
And I think you're very right that heavy thread caching is a bad idea, and that organizing storage by size classes is smart.
0 replies 0 retweets 1 likeThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.