Conversation

Replying to
I never used to think about it but I'm come to realize that integer division is ridiculously slow and it stands out to me in code that's supposed to perform well. The trick with libdivide is that it's doing the same kind of division by a constant optimizations as compilers.
1
6
Replying to and
So it's doing a fair bit of work to figure out the proper shifts / multiplications when you set up the divisor and then you reuse it many times. In hardened_malloc, it sets up a slab size divisor and size divisor for each size class, and then uses those to find the metadata.
1
2
Replying to and
fwiw the same “slightly too sophisticated for compilers” problem comes up when implementing a decimal floating-point library. Ideally you'd like to write: uint64_t power_of_ten[17]={1,10,100,1000…}; and then let the compiler optimize “e/power_of_ten[k]” but current compilers…
2
1
So for example, for free(p), it figures out it's a slab allocation with an address range check: github.com/GrapheneOS/har Next, a subtraction and division by a constant (size class region size) to determine the size class and get the size class index: github.com/GrapheneOS/har
1
In my benchmarks (twitter.com/DanielMicay/st), the first set of results is with the default configuration which enables all the optional security features including the quarantine. The 2nd set of results is with all the optional security features disabled, which speeds it up a lot.
Quote Tweet
I love libdivide. Moving to libdivide-2.0 in hardened_malloc is an easy win. 16 byte malloc microbenchmark on Broadwell-E: Hardware division: 1s libdivide-1.1: 0.74s libdivide-2.0: 0.71s In a lightweight build: Hardware division: 1s libdivide-1.1: 0.62s libdivide-2.0: 0.59s
Show this thread
I normalized them to 1s to make it easy to read, but the 2nd set of tests is definitely running much faster for the same amount of work. I wanted to show overhead of hardware division vs. libdivide rather than the overhead of the optional security features which is more complex.
1