I love libdivide. Moving to libdivide-2.0 in hardened_malloc is an easy win.
16 byte malloc microbenchmark on Broadwell-E:
Hardware division: 1s
libdivide-1.1: 0.74s
libdivide-2.0: 0.71s
In a lightweight build:
Hardware division: 1s
libdivide-1.1: 0.62s
libdivide-2.0: 0.59s
Conversation
Replying to
This is comparing to hardware-based division on a modern x86 CPU... so consider how much difference it makes on hardware without hardware-based division. The entire library is a single header with ~1.4k lines of code (~760 without C++ and SSE/AVX). Did I mention it's awesome?
1
6
It gives you compiler-style optimizations for division by a constant where the divisor is a runtime constant rather than a compile-time constant. In hardened_malloc, I use it to quickly perform division by the allocation size and slab size. It makes the out-of-line metadata fast.
2
4
The performance win is almost as good even for non-power-of-two size classes and slab sizes (see github.com/GrapheneOS/har), which I why I use it. I had started working on micro-optimizing the small size classes myself, then stumbled across github.com/ridiculousfish via a search.
1
1
15
If you need to divide vectors by a runtime constant, libdivide is even more amazing because it supports integer division with vectors while the vector instruction sets (AVX512, AVX2, SSE) don't provide it. You get an amplified version of the scalar win. Sadly I don't need this...
1
1
9
Replying to
I’m surprised libdivide would need malloc at all (knowing next to nothing about how division is implemented.) Could it be even faster using fixed allocation?
2
Replying to
I'm not saying it uses malloc but rather I use it in hardened_malloc to perform division by slab size (to find the index of the slab metadata) and by the size class (to find the bitmap index for the slot in the slab): github.com/GrapheneOS/har. The benchmark is of hardened_malloc.
1
2
Show replies
Replying to
this is amazing. I guess divide isn't a bottleneck often enough for Intel to give that unit a lot of attention?
3
1
Replying to
I never used to think about it but I'm come to realize that integer division is ridiculously slow and it stands out to me in code that's supposed to perform well. The trick with libdivide is that it's doing the same kind of division by a constant optimizations as compilers.
1
6
Show replies
Replying to
Interesting! WebKit has been doing this in its IsoHeaps for close to two years now, but not with libdivide. IsoHeaps are C++ template program that generates a type specialized malloc, and internally it uses division by type size, which then gets compiled to mul-immediate.
3



