The bottleneck in fine-grained multithreading is the cost of interlocked operations like LOCK XADD, LOCK OR, etc. In the case where code doesn't depend on the opcode's result or flags, couldn't these be made free by extending CPU write buffers to support GPU-style ROPs?
-
-
Replying to @TimSweeneyEpic
The question is whether they're slow because they're actually contesting or not. IIRC they're about 20-30 clocks if uncontested and the data's in the cache? Deferring the update breaks the whole ordered memory thing, which is like - the whole point of doing the LOCK!
3 replies 0 retweets 1 like -
Replying to @tom_forsyth
On Skylake, LOCK XADD / LOCK OR are 18 clocks when uncontested. That's fine for infrequent operations. But if you're building a fine-grained C++ multithreading library, they're so frequent that they can create a 2x-4x slowdown.
2 replies 0 retweets 1 like -
Replying to @TimSweeneyEpic @tom_forsyth
This is where per-thread work queues are a win, along with task stealing when a thread is out of tasks.
1 reply 0 retweets 0 likes -
Unless you’re talking about the task themselves. In which case there’s not much to do other than changing the algorithm.
1 reply 0 retweets 0 likes -
Replying to @dubejf @tom_forsyth
CPUs are fine for large-grained concurrent tasks with infrequent synchronization. But performance falls apart with fine-grained concurrency, as happens with concurrent garbage collection, pervasive futures, and software transactional memory, for example.
1 reply 0 retweets 1 like -
I think this type of code will play an important role in the future with many-core CPUs, but writing it today feels like writing a renderer on 486: It works, yet it’s brutally inefficient, and is just a couple microarchitectural steps away from excellence.
1 reply 0 retweets 2 likes -
Yes, but modern rendering is very different from 486 - non-parallel smart algorithms were replaced by bruteforce parallel ones (e.g. zbuffer instead of tri-sorting / bsp). IMO future CPUs will follow this pattern (Amdahl's law and x86 strict memory architecture).
1 reply 0 retweets 2 likes
Since GPUs are already so awesome for parallel processing, I think we’ll see the further specialization of CPUs towards branchy, random-access, synchronization-heavy, transactional processing tasks rather than data parallelism.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.