I love libdivide. Moving to libdivide-2.0 in hardened_malloc is an easy win.
16 byte malloc microbenchmark on Broadwell-E:
Hardware division: 1s
libdivide-1.1: 0.74s
libdivide-2.0: 0.71s
In a lightweight build:
Hardware division: 1s
libdivide-1.1: 0.62s
libdivide-2.0: 0.59s
Conversation
This is comparing to hardware-based division on a modern x86 CPU... so consider how much difference it makes on hardware without hardware-based division. The entire library is a single header with ~1.4k lines of code (~760 without C++ and SSE/AVX). Did I mention it's awesome?
1
6
It gives you compiler-style optimizations for division by a constant where the divisor is a runtime constant rather than a compile-time constant. In hardened_malloc, I use it to quickly perform division by the allocation size and slab size. It makes the out-of-line metadata fast.
2
4
The performance win is almost as good even for non-power-of-two size classes and slab sizes (see github.com/GrapheneOS/har), which I why I use it. I had started working on micro-optimizing the small size classes myself, then stumbled across github.com/ridiculousfish via a search.
Replying to
If you need to divide vectors by a runtime constant, libdivide is even more amazing because it supports integer division with vectors while the vector instruction sets (AVX512, AVX2, SSE) don't provide it. You get an amplified version of the scalar win. Sadly I don't need this...
1
1
9
