Conversation

Just realized positional encoding plus “attention” in transformers is basically generalized, dynamic, non-local convolution 🧐 Or conversely, CNNs, 1-d convs of time series to model sensors, are “static” local transformers Ie transformers=semantic “spooky action at a distance”
Replying to
I’m vaguely reminded of alpha/beta mixing processes (generalizations of markov) here. I think transformers is basically an “unmixing” process
3