The bottleneck in fine-grained multithreading is the cost of interlocked operations like LOCK XADD, LOCK OR, etc. In the case where code doesn't depend on the opcode's result or flags, couldn't these be made free by extending CPU write buffers to support GPU-style ROPs?
CMPXCHG = 18 clocks, no ILP, so 100x slower than a non-atomic operation. And this is stupid, as some CPU architecture improvements could make uncontended atomics much faster.
-
-
Could they? Excellent. Drop me a line, I'll add them.
-
Please educate me on how write-buffers are implemented and whether adding a ROP is a feasible idea? I mean, if it's just a queue of (address,size,data), then adding an operation wouldn't be hard, and the hardware for add/or/xor/and ROPs would be minimal.
- 8 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.