Conversation
Was about to lament the lack of warp-level barrier in #CUDA, but seems as simple as "bar.sync (1+warpIndex), 32;" inline PTX asm? #untested is the size of a warp set in stone? everything I know of is 32, but doc updates hint that this may change in future.