2/ more concretely, we find, on downstream word analogy task, it beats 1) the NTK limit and 2) all finite-width models, whose perf approaches the infinite-width perf from below as width increases.pic.twitter.com/FJ8QVsGKOO
Możesz dodawać lokalizację do Twoich Tweetów, jak miasto czy konkretne miejsce, z sieci lub innych aplikacji. W każdej chwili możesz usunąć historię lokalizacji swoich Tweetów. Dowiedz się więcej
2/ more concretely, we find, on downstream word analogy task, it beats 1) the NTK limit and 2) all finite-width models, whose perf approaches the infinite-width perf from below as width increases.pic.twitter.com/FJ8QVsGKOO
3/ Same thing for few-shot learning on Omniglot via MAML. IMO this is the right kind of infinite-width neural network to study!pic.twitter.com/Zbc0npnHiq
4/ This paper is the 4th in the Tensor Programs series (following https://arxiv.org/abs/1910.12478 https://arxiv.org/abs/2006.14548 https://arxiv.org/abs/2009.10685 ). In fact, I started this series 2 years ago precisely so I could eventually write this paper! I'm so happy and relieved this is now out :)
5/ Thanks to @ZeyuanAllenZhu @prfsanjeevarora @BachFrancis @yasamanbb @LenaicChizat @deepcohen @yaringal @QuanquanGu Bobby He, Jiaoyang Huang, Arthur Jacot, @hoonkp @jasondeanlee Zhiyuan Li Etai Littwin @2prime_PKU Song Mei @ARomanNovak @vinaysrao Michael Santacroce @sschoenholz
6/ Lisa Schut @jaschasd @murefil Denny Wu, Huishuai Zhang, Pengchuan Zhang for discussions and feedback!
Also @roydanroy this is what I promised for "beyond NTK" :)
7/ Shoutout to my co-author, former Microsoft AI Resident @edwardjhu (who graduated into a full time position at Microsoft). Fast learner, hard worker, curiosity driven, great communicator, trustworthy - what a monster of a researcher. Follow him now! (and more to come from us :)
Can we not "explain" something like BERT/GPT3* by just computing the posterior of the NNGP conditioned on all the data that GPT3 got, and then use any new data to further update the posterior? I would guess most of GPT3's performance can be explained that way
*GPT3 isn't even "pre-training" in the usual sense of "for later fine-tuning". It's just evaluated and it does "metalearning" on the function from inputs to outputs it implements
Twitter jest przeciążony lub wystąpił chwilowy problem. Spróbuj ponownie lub sprawdź status Twittera, aby uzyskać więcej informacji.