Shivon Zilis@shivon·Dec 13, 2020OH: Why are language models so much bigger than computer vision models? ... Because a picture is worth a thousand words 🥁🤦🏻♀️2870830
Lucas Beyer@giffmanaReplying to @shivonWell actually... we found that a picture is only worth 16x16 words ;-)arxiv.orgAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleWhile the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either...1:01 PM · Dec 13, 2020·Twitter Web App13 Likes