Conversation

I respect Jacob a lot but I find it really difficult to engage with predictions of LLM capabilities that presume some version of the scaling hypothesis will continue to hold - it just seems highly implausible given everything we already know about the limits of transformers!
Quote Tweet
Some disconcerting predictions about developments in large language modes over the next 7 years from @JacobSteinhardt: bounded-regret.ghost.io/what-will-gpt-
If someone can explain how the predictions above could still come true in light of the following findings, that'd honestly be helpful. - Transformers appear unable to learn non-finite or context-free languages, even autoregressively:
Quote Tweet
Very cool paper to start the week: "Neural Networks and the Chomsky Hierarchy", showing which NLP architectures are able to generalize to which different formal languages! (1/8) 🧵
Show this thread
Diagram showing the Chomsky hierarchy of languages and the models that can learn them. FFNNs and transformers can learn finite languages, RNNs regular languages, LSTMs counter languages, Stack-RNNs deterministic context free languages, and Tape-RNNs context-sensitive languages. No model can learn recursively enumerable languages.
6
34
- Transformers learn shortcuts (via linearized subgraph matching) to multi-step reasoning problems instead of the true algorithm that would systematically generalized:
Quote Tweet
Faith and Fate: Limits of Transformers on Compositionality Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly… Show more
Image
2
37
- Similarly, transformers learn shortcuts to recursive algorithms from input / output examples, instead of the recursive algorithm itself:
Quote Tweet
New preprint just dropped! "Can Transformers Learn to Solve Problems Recursively?" With @dylanszzhang, @CurtTigges, @BlancheMinerva, @mraginsky, and @TaliaRinger. arxiv.org/abs/2305.14699
Show this thread
https://arxiv.org/abs/2305.14699
2
24
These are all limits that I don't see how "just add data" or "just add compute" could solve. General algorithms can be observationally equivalent with ensembles of heuristics on arbitrarily large datasets as long as the NN has capacity to represent that ensemble.
1
23
So unless you restrict the capacity of the model or do intense process-based supervision (requiring enough of the desired algorithm to just program it directly in the first place), it seems exceedingly unlikely that transformers would learn generalizable solutions.
7
18
Some additional thoughts on what autoregressive transformers can express (some variants are Turing-complete), vs. what they can learn, in this thread!
Quote Tweet
Replying to @xuanalogue
Consider that all those recent proofs of hard limits do not engage with how Transformers are used in practice, and they are qualitatively more expressive when utilized with even rudimentary scaffolding. twitter.com/bohang_zhang/s
1
7
I should also add that I don't find all the predictions implausible in the original piece - inference time will definitely go down, model copying and parallelization is already happening, as is multimodal training. I just don't buy the superhuman capabilities.
1
8
(Also not convinced that multimodal training buys that much - more tasks will become automatable, but I don't think there's reason to expect synergistic increase in capabilities. And PaLM-E was pretty underwhelming...)
Quote Tweet
Okay I skimmed this paper, and I don't think this is really a scary / unexpected advance in LLM capabilities, mostly just clever integration. twitter.com/DannyDriess/st…
Show this thread
4
LLMs doesn't have to mean transformers though. Seems like there is a lot of research effort going into finding better architectures, and 7 years is a decent chunk of time to find them
1
5
Yup - I think predictions should make that clear though! Then it's not based on the scaling hypothesis but also algorithmic advances - which the post doesn't base it's predictions upon, as far as I can tell.
13