The bottleneck in fine-grained multithreading is the cost of interlocked operations like LOCK XADD, LOCK OR, etc. In the case where code doesn't depend on the opcode's result or flags, couldn't these be made free by extending CPU write buffers to support GPU-style ROPs?
Consider updating a large shared bit mask with LOCK OR. You need the LOCK's atomicity to prevent a read-modify-write race. You don't want or need the LOCK to act as a memory ordering fence.