still can't make int128 division "fast", but have 0.9-9M divs/sec via x86 ASM (shift-cmp-sub loop). dunno if faster exists for general case.
-
-
I had considered some LUT based special-case strategies, but not tested them yet. LD with bigger units not yet explored.
-
a concern though is that I am on an AMD K10, which has ~ 70 cycle DIV / IDIV, break-even means under 3000 cycles for full divide
-
currently full-width 128-bit integer divides range between ~ 400 and 3000 clock cycles, depending somewhat on the input values.
-
ADD: this is for 32-bit x86. for x86-64, it should be possible to do it faster. test: int128 ADD/SUB ~30 cycles, MUL ~50 cycles.
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.