The bottleneck in fine-grained multithreading is the cost of interlocked operations like LOCK XADD, LOCK OR, etc. In the case where code doesn't depend on the opcode's result or flags, couldn't these be made free by extending CPU write buffers to support GPU-style ROPs?
This is a good optimization for small things, but for something huge (like a garbage collector's reachability bitmask), it doesn't scale well. E.g. instead of 1 bit per flag, you have to choose either 1 byte per flag, or 1 bit per flag per thread.
-
-
For small objects probably too much overhead. Might be ok for islands of objects with fused lifetime.
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.