still can't make int128 division "fast", but have 0.9-9M divs/sec via x86 ASM (shift-cmp-sub loop). dunno if faster exists for general case.
Am I missing something? Why not just long division with 32- (on x86_64) or 16- (on i386) bit units?
-
-
There may also be ways to take advantage of floating point division.
-
I had considered some LUT based special-case strategies, but not tested them yet. LD with bigger units not yet explored.
-
a concern though is that I am on an AMD K10, which has ~ 70 cycle DIV / IDIV, break-even means under 3000 cycles for full divide
-
currently full-width 128-bit integer divides range between ~ 400 and 3000 clock cycles, depending somewhat on the input values.
-
ADD: this is for 32-bit x86. for x86-64, it should be possible to do it faster. test: int128 ADD/SUB ~30 cycles, MUL ~50 cycles.
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.