Holy shit mallocng is doing something right. I'm used to mutt growing to over 120 MB and sticking there forever after opening large folders...
Conversation
Welp, with mallocng LD_PRELOADed, it's dropping down to 2-4 MB after switching back to a small folder.
1
8
Replying to
Have you tried using it with a web browser and notoriously resource-hungry pages? I remember the people behind the Mesh allocator having good luck with Firefox?
1
Replying to
Yeah, I've been using it with Firefox for a long time with usage & performance "feeling" better but I don't have a good way to measure.
1
Running in headless mode to render one page to output and then exit would be a good benchmark probably?
1
Yes, it probably would for performance. I'm not sure how to compare memory usage though. The interesting thing for a large app is not initial usage but non-leaking (non-growth-over-time, return-to-previous-usage after closing stuff).
1
2
Reducing the working set of pages by avoiding fragmentation and metadata overhead improves performance too. Not only frees up memory for caching but reduces TLB pressure. For small allocations, lower metadata overhead and better packing together improves usage of the cache.
2
1
It improves performance of application's usage of mostly static allocations but not of frequent allocation/deallocation.
2
The performance of the allocator itself definitely still matters. The issue is that that's generally all that's considered. The impact on performance from expanding the working set of pages is ignored, and that can end up having a substantial impact for large applications.
1
Similarly, making more efficient use of cache by having everything nicely packed together tends to be ignored. I don't like the approach of layering on caches to try hiding the costs of a less efficient underlying decision in common cases. The wasted memory hurts performance.
I do think thread caches make sense for a performance-oriented allocator not caring about hardening but ONLY for amortizing the cost of locking. They should be tiny to avoid waste and latency spikes, and using arrays is nicer than free lists to avoid touching cold allocations.
1
Linux 4.18+ supports restartable sequences which can be used to implement per-core caches instead of per-thread caches. It's unfortunate that it conflicts with robustness/security. Could have tiny per-core allocation queues for smallest size classes with little sacrifice though.
1



