The bottleneck in fine-grained multithreading is the cost of interlocked operations like LOCK XADD, LOCK OR, etc. In the case where code doesn't depend on the opcode's result or flags, couldn't these be made free by extending CPU write buffers to support GPU-style ROPs?
TSX costs >60 clocks when uncontended. All atomics on Intel hardware are way too slow to be workable for fine-grained multithreading. Here, like 99.99% of operations won't contend, but they're 100x slower than regular ops.
-
-
Well there's a reason for that, right? It's not like Intel architects don't also want it to go fast.
-
Please explain the reason.
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.