The bottleneck in fine-grained multithreading is the cost of interlocked operations like LOCK XADD, LOCK OR, etc. In the case where code doesn't depend on the opcode's result or flags, couldn't these be made free by extending CPU write buffers to support GPU-style ROPs?
-
-
Replying to @TimSweeneyEpic
The question is whether they're slow because they're actually contesting or not. IIRC they're about 20-30 clocks if uncontested and the data's in the cache? Deferring the update breaks the whole ordered memory thing, which is like - the whole point of doing the LOCK!
3 replies 0 retweets 1 like -
Replying to @tom_forsyth @TimSweeneyEpic
What I mean is - the whole point of the LOCKed instruction is usually to say "I own this data now". If you defer it and allow updates to proceed, then... you just broke the code!
1 reply 0 retweets 0 likes -
Replying to @tom_forsyth @TimSweeneyEpic
Now you could defer ALL the updates (locked and not), and that's what TSX/HLE basically does. And it's really difficult and scary and there's lots of caveats!
2 replies 0 retweets 0 likes -
Replying to @tom_forsyth @TimSweeneyEpic
So you have to fit your entire state update inside a "ROP packet" or similar, which is a bit of a limited circumstance. Note that you can actually do this - make a thread that is the "ROP" unit and have other threads send it update packets, and it applies the packets in order.
1 reply 0 retweets 1 like -
Replying to @tom_forsyth
Write buffers store (address,value) already - both up to 64 bits. Couldn't we easily add a ROP byte? Semantics are clean for ADD, OR, AND, XOR, etc.
1 reply 0 retweets 0 likes -
Replying to @TimSweeneyEpic
Well, you can do any 64-bit operation you like by wrapping it in CMPXCHG.
1 reply 0 retweets 0 likes -
Replying to @tom_forsyth
CMPXCHG = 18 clocks, no ILP, so 100x slower than a non-atomic operation. And this is stupid, as some CPU architecture improvements could make uncontended atomics much faster.
1 reply 0 retweets 0 likes -
Replying to @TimSweeneyEpic
Could they? Excellent. Drop me a line, I'll add them.
1 reply 0 retweets 3 likes
Please educate me on how write-buffers are implemented and whether adding a ROP is a feasible idea? I mean, if it's just a queue of (address,size,data), then adding an operation wouldn't be hard, and the hardware for add/or/xor/and ROPs would be minimal.
-
-
Replying to @TimSweeneyEpic
Right, but if you allow it to go out of order with other memory operations, you just broke all the code - coz they're not barriers any more. And if you don't then hey - it stalls.
1 reply 0 retweets 2 likes -
Replying to @tom_forsyth
For many common operations, atomicity is vital but order is not: - ORing a shared bitmask, a common technique used by concurrent garbage collectors. - Increasing or decreasing a reference count without checking its value - Adding to a shared counter
2 replies 0 retweets 1 like - 6 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.