Conversation
Was about to lament the lack of warp-level barrier in #CUDA, but seems as simple as "bar.sync (1+warpIndex), 32;" inline PTX asm? #untested is the size of a warp set in stone? everything I know of is 32, but doc updates hint that this may change in future. Note there are 16 (0-15) numbered barriers. Count of 32 is practical for sm_1x and sm_2x but prob. too few for sm_30. thanks, but this is per block, right? So I just need to have more blocks per SM to fill sm_30? ... and it begs the questions "why?" given that warps are implicitly sync'd (for now). want to safely use shared memory for some warp-level storage/reductions, not confident about unsynced since NV removed from docs!