The bottleneck in fine-grained multithreading is the cost of interlocked operations like LOCK XADD, LOCK OR, etc. In the case where code doesn't depend on the opcode's result or flags, couldn't these be made free by extending CPU write buffers to support GPU-style ROPs?
-
-
Replying to @TimSweeneyEpic
The question is whether they're slow because they're actually contesting or not. IIRC they're about 20-30 clocks if uncontested and the data's in the cache? Deferring the update breaks the whole ordered memory thing, which is like - the whole point of doing the LOCK!
3 replies 0 retweets 1 like -
Replying to @tom_forsyth @TimSweeneyEpic
What I mean is - the whole point of the LOCKed instruction is usually to say "I own this data now". If you defer it and allow updates to proceed, then... you just broke the code!
1 reply 0 retweets 0 likes -
Replying to @tom_forsyth @TimSweeneyEpic
Now you could defer ALL the updates (locked and not), and that's what TSX/HLE basically does. And it's really difficult and scary and there's lots of caveats!
2 replies 0 retweets 0 likes -
Replying to @tom_forsyth
TSX costs >60 clocks when uncontended. All atomics on Intel hardware are way too slow to be workable for fine-grained multithreading. Here, like 99.99% of operations won't contend, but they're 100x slower than regular ops.
1 reply 0 retweets 0 likes -
Replying to @TimSweeneyEpic
Well there's a reason for that, right? It's not like Intel architects don't also want it to go fast.
1 reply 0 retweets 0 likes
Please explain the reason.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.