So it's doing a fair bit of work to figure out the proper shifts / multiplications when you set up the divisor and then you reuse it many times. In hardened_malloc, it sets up a slab size divisor and size divisor for each size class, and then uses those to find the metadata.
Conversation
Replying to
ok, this answers my question "why doesn't the compiler just emit the right thing in the first place"
3
1
This Tweet was deleted by the Tweet author. Learn more
This Tweet was deleted by the Tweet author. Learn more
For my use case, I could technically try doing something like making a switch covering all possible divisors with each case performing division by a constant. I do know the set of divisors in advance but it's a specific one at runtime. Not sure Clang / GCC would handle it well.
1
I'm also not sure that would perform better, and it's a lot more work and it would be a bunch of code that needs to be kept consistent with the chosen size classes. Could also generate the libdivide divisors at compile-time and read them from a table but I don't think it'd help.
3
1
Jemalloc used to "switch over divisor", and we went to table-based magic constants.
It turns out that you can use the fact that you're dividing an exact multiple to shave off a couple of instructions, too; just a multiply followed by taking the high half bits.
1
1
See e.g. github.com/jemalloc/jemal. This enables some other tricks, like selecting size classes at runtime, so that they can be a binary-level tuning parameter.
1
2
That's essentially what's happening with libdivide. I generate size (u32) and slab size (u64) divisors for each size class during initialization, which are these: github.com/ridiculousfish and they're stored in the size class metadata next to fields it already has to read anyway.
1
1
IIRC, libdivide has to do a couple of extra dynamic-size shifts (and corresponding magic number loads), since it needs to handle non-exact divisors. But with slab offsets, we know that the division result is exact; we can *just* do the multiply.
1
1
I don't have a guarantee that it's an exact multiple though: github.com/GrapheneOS/har. I need to check for unaligned frees as part of detecting all invalid frees. There's a lot of extra work to do for all the optional hardening features too but definitely room for optimization.


