The bottleneck in fine-grained multithreading is the cost of interlocked operations like LOCK XADD, LOCK OR, etc. In the case where code doesn't depend on the opcode's result or flags, couldn't these be made free by extending CPU write buffers to support GPU-style ROPs?
That's my understanding. Nontemporal stores aside, I think all the write-buffers to is wait for the current core to own its cache line, then they write their data into the local cache. That would be the place for a ROP to occur.
-
-
So isn’t the cache line transfer the bottleneck? On x86 I suspect the memory model is the limiting factor. May be that some instructions that allow a looser memory model might be easier to integrate to the hardware.
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.