Just realized positional encoding plus “attention” in transformers is basically generalized, dynamic, non-local convolution 🧐
Or conversely, CNNs, 1-d convs of time series to model sensors, are “static” local transformers
Ie transformers=semantic “spooky action at a distance”
Conversation
Replying to
I’m vaguely reminded of alpha/beta mixing processes (generalizations of markov) here. I think transformers is basically an “unmixing” process
3

