The question is whether they're slow because they're actually contesting or not. IIRC they're about 20-30 clocks if uncontested and the data's in the cache? Deferring the update breaks the whole ordered memory thing, which is like - the whole point of doing the LOCK!
-
-
More simply put, I can atomically MOV [BYTE PTR eax],IMM without LOCK but I cant atomically set bits that way. I see no fundamental reason why, if the CPU supports the former, it can’t economically support the later.
-
You could probably build special hardware to pipeline atomics to specific memory locations, but generally you have to read/write an entire cache line at a time, so that piece of memory has to be locked while you do that
- 4 more replies
New conversation -
-
-
Those are certainly interesting use cases. I know we kicked around the ROP idea for KNF+KNC. It's pretty complex, and of course totally destroys the x86 memory model. In the end we decided that making a "ROP thread" was the better way - especially since we had 4 threads per core.
-
Adding features that walk the borders of the core architecture are always dangerous to say the least. This also applies to higher level software, which is why for production code it’s usually better to choose the slightly slower path but that is the more reliable one
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.