Conversation

Long thread about allocator design choices: twitter.com/DanielMicay/st. It's entirely possible to provide a hardened allocator with decent performance, low memory use and great scalability without giving up core security properties like fully out-of-line and authoritative metadata.
Quote Tweet
Replying to @ebeip90 and @crypt0ad
Scudo is entirely based on inline metadata and free lists. It relies on CRC32 to detect metadata corruption and can't reliably detect invalid free in the same way. Having fully out-of-line metadata is extremely important for providing many other security properties too.
1
10
I've worked a lot on performance-oriented allocators too and most of those projects are focused on the wrong things. For example, using thread caches as anything more than a way to amortize locking costs is missing the point and hurting real world performance by wasting memory.
1
2
One of the highest priorities for a perf-oriented allocator should be keeping the working set of memory small by minimizing internal / external fragmentation and reducing memory held in caches. Understanding real world performance well is why jemalloc is being adopted everywhere.
1
2
The jemalloc design choices are bad for security, but it's an incredibly good allocator for long-term performance in everything but small programs. It has a strong focus on having minimal fragmentation and keeping other waste (metadata) at a tiny percentage of allocated memory.
1
Google's TCMalloc was too concerned with performance of the allocator rather than applications and systems as a whole. Android chose jemalloc as the dlmalloc replacement despite jemalloc being maintained by Facebook and being somewhat more oriented towards server workloads.
1
I do think thread caches make sense most allocators, but they can accomplish nearly their entire purpose without being larger than arrays of about 16 pointers per size class, to reducing locking overhead to 1/8 (a simple approach is fill half when empty, flush half when full).
2
Replying to
It might scale up somewhat but at least in jemalloc there are arenas assigned to threads either via round-robin (static assignment) or dynamic load balancing (sched_getcpu is the naive way, but Facebook has some kind of sophisticated replacement). The arenas what gives scaling.
1
Replying to and
Thread caches batch together a bunch of work which results in holding the locks for much longer, especially if the thread caches are large. They're kept quite small in jemalloc and have non-trivial garbage collection / heuristics esp compared to the massive ones used by TCMalloc.
1
1
Replying to and
For my new hardened allocator, providing the option of having arenas will be natural since it just requires having multiple slab allocation regions and then assigning threads to them. It will only be a dozen lines of code.