I love libdivide. Moving to libdivide-2.0 in hardened_malloc is an easy win.
16 byte malloc microbenchmark on Broadwell-E:
Hardware division: 1s
libdivide-1.1: 0.74s
libdivide-2.0: 0.71s
In a lightweight build:
Hardware division: 1s
libdivide-1.1: 0.62s
libdivide-2.0: 0.59s
Conversation
Replying to
this is amazing. I guess divide isn't a bottleneck often enough for Intel to give that unit a lot of attention?
3
1
Replying to
I never used to think about it but I'm come to realize that integer division is ridiculously slow and it stands out to me in code that's supposed to perform well. The trick with libdivide is that it's doing the same kind of division by a constant optimizations as compilers.
1
6
So it's doing a fair bit of work to figure out the proper shifts / multiplications when you set up the divisor and then you reuse it many times. In hardened_malloc, it sets up a slab size divisor and size divisor for each size class, and then uses those to find the metadata.
1
2
Replying to
ok, this answers my question "why doesn't the compiler just emit the right thing in the first place"
3
1
Replying to
github.com/GrapheneOS/har
You can search for size_divisor and slab_size_divisor in this file to see where it gets used. Each allocation size class has a dedicated address space region, and a separate region reserved for metadata, so it needs to map slab address -> slab metadata.
Once it has the slab metadata, it needs to figure out the bit index in the bitmap(s) corresponding to the slot within the slab. Division is crucial to this approach to fully out-of-line metadata without needing something like a hash table and without needing power of two sizes.
1
There's an explanation of the size class choices and a table of the sizes / slab sizes in github.com/GrapheneOS/har. It has much less waste from rounding than power of two sizes, and finer-grained address space isolation for security. Using powers of two can't even be considered.
4

