Conversation
Was about to lament the lack of warp-level barrier in #CUDA, but seems as simple as "bar.sync (1+warpIndex), 32;" inline PTX asm? #untested ... and it begs the questions "why?" given that warps are implicitly sync'd (for now). want to safely use shared memory for some warp-level storage/reductions, not confident about unsynced since NV removed from docs! I think a lot of us are wondering what that doc edit implies. I wonder if "bar.sync 15,32" winds up being a "free" barrier? Will try. yeah no idea if special case of 32 has fast path somewhere. Also no idea yet if asm needs "memory" clobber syntax (suspect yes).