Conversation
Was about to lament the lack of warp-level barrier in #CUDA, but seems as simple as "bar.sync (1+warpIndex), 32;" inline PTX asm? #untested Note there are 16 (0-15) numbered barriers. Count of 32 is practical for sm_1x and sm_2x but prob. too few for sm_30. thanks, but this is per block, right? So I just need to have more blocks per SM to fill sm_30? Correct! That's a good strategy for avoiding running out of numbered barriers.