It's interesting to observe how the spread of ideas actually works in practice. I've seen DeOldify cited many places, yet the key insights that to me seem like low hanging fruit don't actually get traction. I suppose not having a paper might be part of it but it's still odd.
3/ Hence I've also noticed that adding more self-attention has a disproportionate effect of making the network more powerful compared to amount of memory/computation added vs simple widening of the convolutional channels.