Two *necessary* conditions (often met in practice): (a) Multiple heads, ex: 3x3 kernel requires 9 heads, (b) Relative positional encoding to allow translation invariance. Each head can attend on pixels at a fixed shift from the query pixel forming the pixel receptive field. 2/5
-
-
Prikaži ovu nit
-
Our work explains the recent success of Transformer architecture applied to vision: Attention Augmented Convolutional Networks.
@IrwanBello et al., 2019. https://arxiv.org/abs/1904.09925 Stand-Alone Self-Attention in Vision Models. Ramachandran et al., 2019. https://arxiv.org/abs/1906.05909 3/5Prikaži ovu nit -
Interactive website displays attention maps: - some heads ignore content and attend on pixels at *fixed* shifts (confirms theory) sliding a grid-like receptive field, - some heads seem to use content-based attention -> expressive advantage over CNN.
http://epfml.github.io/attention-cnn 4/5Prikaži ovu nit -
I deeply appreciated the engagement of the reviewers (esp. the critics from Reviewer 3). I am thankful to
@loukasa_tweet and Martin Jaggi for their support at@epfl_en. My Ph.D. is funded by@SDSCdatascience, Andreas is supported by@snsf_ch. Adis Ababa, here we come!
5/5Prikaži ovu nit
Kraj razgovora
Novi razgovor -
-
-
Great paper! Well written and Insightfull. Convolution is a strong inductive prior, alleviating the need to see shifted examples. Fully-connected layers can also learn convolution, but will only do so if the training data has many shifted variants. How is this for attention?
-
Thank you for your interest! MHSA layers appear to be more efficient than FC at learning the right inductive biases. Not certain why. My hypothesis is that softmax promotes the learning of sparse (due to exp) semi-orthogonal among heads (due to non-negativity) attention patterns.
- Još 3 druga odgovora
Novi razgovor -
-
-
#ICRL2020@threader_app please compile it. -
Hi, you can read this thread from
@jb_cordonnier here: https://threader.app/thread/1215581826187743232 …#ICRL2020
Kraj razgovora
Novi razgovor -
-
-
Can someone please try positional encoding with bresenham circle instead of square? I’m sure it will revolutionize this (KxK) space.pic.twitter.com/KYfFyzgXg2
-
I tried with mnasnet once for the bigger kernels. It seems you can cut corners without losing much precision.
- Još 1 odgovor
Novi razgovor -
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.
supervised by Prof. Martin Jaggi. Interested in deep learning on graphs, optimization and NLP. Mountain lover 
Paper:
Code:
Blog: