does it make sense to spin up a thread for cleaning up (merging deallocations) the global allocator?
Conversation
Replying to
in this case the merging thread would not need to be particularly fast, so spin-atomic-compare-exchanging stuff would be sufficient
1
1
but this only makes sense to do for the global allocator, spinning up a thread for every local allocator would be extremely wasteful
1
1
Traditional malloc implementations have quick coalescing via headers with a pointer to the previous header and the next one.
Modern slab allocation designs don't need that stuff. Purging spans of dirty pages is the slow part partly since they need to be faulted in again too.
1
1
In jemalloc, they support enabling worker threads for that kind of maintenance work. They keep a ratio of freed dirty pages to allocated memory and purge ranges beyond it in two phases with time delayed MADV_FREE followed by a further delayed MADV_DONTNEED.
1
1
dlmalloc-style allocators have quick coalescing and track free spans via power of 2 size class bins for a bad approximation of best fit allocation. Has a ton of metadata overhead (16 bytes or more per alloc instead of 2-4 bits) and more importantly still too much fragmentation.
That's essentially O(1) when not obtaining memory from the OS or returning it, and you don't always have to do that. It's faster than you'd think. An intrusive rbtree ordered by (size, address) can give you precise first-best-fit but that's O(log n) and takes 2 pointers.
1
1
Modern ones use slabs of allocations of a specific size class with internal free lists or bitmaps and then free lists of those slabs. Usually have mostly implicit (address alignment / ranges, etc.) and out-of-line metadata, What remains slow: getting / returning memory from OS.


