"just engineering" 😂😭
Susan Zhang
@suchenzang
Research Engineer . Past: , , , , etc. Generally found using an excessive amount of compute.
San Francisco, CAJoined April 2014
Susan Zhang’s Tweets
Now apply this to annual performance reviews on a bell curve in any industry research lab.
(No good solution under any implicitly zero-sum situation...)
Quote Tweet
I think awards are, at best, pointless, and at worst harmful.
Some thoughts in this thread
, feel free to share your own (in either direction). 1/n twitter.com/thegautamkamat…
Show this thread
6
We’re awarding prizes to 7/48 submissions to the Inverse Scaling Prize Round 2! Tasks show inverse scaling on models, often even after training with human feedback. Details at irmckenzie.co.uk/round2 and 🧵 on winners:
3
72
250
Show this thread
"Transformers were invented at Google and OpenAI just scaled them up" is such a terrible take.
The transformer was a breakthrough, but the science and engineering required to scale up correctly is equally important. 1/
28
103
1,117
Show this thread
At this point, as long as benchmarks are not "exhaustive", it's impossible to claim "state-of-the-art" in any meaningful way. Perhaps we can amend SOTA to mean maximal performance on maximal number of "domains", but with exhaustive prompt-tuning, that's probably a stretch too. 8/
1
1
30
Show this thread
In the Codex intro, they mention what was surprising about GPT-3 was its ability to generate code _without even intending for the model to do so_. This requires sufficient capacity in a model size to find and hone-in on, which is what scale enables. 7/8
1
10
Show this thread
So can smaller models "outperform" larger ones with more data? Of course. Just look at Codex (arxiv.org/pdf/2107.03374), which fine-tunes small models on GitHub and handily beats all GPT models (175B too).
But citing that result completely misses the point of (model) scale. 6/8
2
13
Show this thread
Not to mention all the other formatting tricks for where to add newlines/whitespace, etc.
Section 8.2 in Stanford's HELM paper (arxiv.org/pdf/2211.09110) explains this perfectly: they found the "best prompt" for a (model, task) pair to not be consistent across models. 🤯 5/8
2
1
19
Show this thread
So here's a job for a prompt-engineer: convert a pronoun-disambiguation task into a "Final Exam with Answer Key" task. Now the models are going to be purring with eagerness to respond with the right answers. 😑
Oh, and note the "=" separating out the instructions on top. 4/8
1
1
13
Show this thread
Next up: formatting. Why does CB get prompted for true/false and RTE with True/False?
Why does WebQA use "Q/A", WiC use "question/answer", and ARC use "Question/Answer"?
Could it be... that you simply get better results switching it up? 🤔
It just keeps going... 3/8
1
17
Show this thread
First up: BoolQ. If you download the actual benchmark, it's true/false completions. GPT-3 swaps in yes/no instead. Why? Well when we did the same swap to yes/no, we saw a +10% accuracy jump on this benchmark.
Wonderful. Clearly on track for a better model already. 2/8
1
2
21
Show this thread
Piling on to the pile-on (sorry - it's always easy to criticize 😛), here's a rant about benchmarks for LLMs that are used to back claims of "stronger" or "better" models.
Let's start with a tour through GPT-3's Appendix G... 1/8
Quote Tweet
The thread is mostly skeptical that any of this work will extrapolate in practice, but that was why we trained chinchilla 70B. It's not a perfect comparison, but trained on the same arch, codebase, dataset as gopher 280B for the same compute --- and it was a lot better 2/
Show this thread
5
17
126
Show this thread
Seeing a bit of a chinchilla pile-on from this thread. The 'train smaller models longer' paper. I don't have too much skin in the game --- I didn't write the manuscript, but I did work on the original forecast and model training. There seems to be a few misconceptions 1/
Quote Tweet
After ignoring the details in all these "lets-fit-a-cloud-of-points-to-a-single-line" papers (all likely wrong when you really extrapolate), @stephenroller finally convinced me to work through the math in the Chinchilla paper and as expected, this was a doozy. [1/7] twitter.com/stephenroller/…
Show this thread
4
12
108
Show this thread
While I fully agree that we should be training these LLMs on more data, the above seems to be particularly misleading claim on _how much more data_ we should be aiming for.
And then hiding everything in a log-log graph is just...😡 [7/end rant]
5
3
60
Show this thread
Not to mention this insane point-cloud of a plot for their Figure 1 (aka main result) that draws 3-lines for each of the "Approaches" to claim that all 3 yield "mostly similar" results.
Since when did a 10x difference become "mostly similar" in literature? [6/7]
3
2
38
Show this thread
...anything from a 28B model (on 2.5T tokens) to a 260B model (on 270B tokens) between their 3 "Approaches".
That's an unhelpful order of magnitude difference in how large of a model you should be training in order to be considered "compute optimal" 😐. [5/7]
1
4
41
Show this thread
... you have to go all the way into their Appendix D.2 to see some concrete numbers given in equation (10).
Combine with equation (4) + plug some value for C (say, something like 4.30E+23, the total FLOPs budget of OPT-175B), we get... [4/7]
1
15
Show this thread
So then you naturally start wondering what A/B/a/b could be. First stop: (a,b) is set to different values for 3 different "Approaches" in Table 2, each seeming to differ by just a hair: (0.5,0.5) vs (0.49,0.51) vs (0.46,0.54). Ok, sure, why not.
Now for A,B... [3/7]
1
21
Show this thread
First thing to make me eye-roll a bit was this fancy equation (4) that seems to re-parameterize the key exponent terms (a,b) into (alpha,beta) to define a coefficient term G. Why this level of indirection just to define a scalar-coefficient? No idea. [2/7]
1
24
Show this thread
After ignoring the details in all these "lets-fit-a-cloud-of-points-to-a-single-line" papers (all likely wrong when you really extrapolate), finally convinced me to work through the math in the Chinchilla paper and as expected, this was a doozy. [1/7]
Quote Tweet
Have any of you ever set up a calculator of the Chinchilla equations and plugged in their upper and lower error bars? It’s pretty interesting.
Show this thread
5
45
231
Show this thread
I'm excited to present: Scaling Laws for Generative Mixed-Modal Language Models. In this paper we explore the scaling properties of mixed-modal generative models, discovering new scaling laws that unify the contributions of individual modalities and the interactions between them.
7
85
280
Show this thread
Even after all that fancy RLHF work into aligning with "values", this seems like an unfortunate side-effect.
Maybe with all the data ChatGPT is collecting right now, it can RL its way out of the RL hole it trained right into.
Quote Tweet
But HELM measures not just accuracy, but also calibration, robustness, fairness, efficiency, bias, toxicity. On fairness, text-davinci-003 outperforms text-davinci-002, but on calibration, bias, and toxicity, both OpenAI models are much worse than other models.
Show this thread
9
And huge shoutout to for bringing together , , and I to organize this workshop (and doing the lion's share of the work 🙏🙏)!
4
Show this thread
In a world where compute resources are bounded, this line of research will enable us to make the most use out of our collective compute footprint in the field - a lesson hopefully generalizable to many domains!
Can't wait to see all the submissions to this #ICLR2023 workshop!
Quote Tweet
Announcing the #ICLR2023 workshop on "Reincarnating Reinforcement Learning".
Have you ever wondered why do we almost always train RL agents from scratch? Our @iclr_conf workshop instead focuses on reusing prior computational work in RL. See reincarnating-rl.github.io for details.
Show this thread
2
4
33
Show this thread
lmao no transformers at attention layers at all
incredibly telling
Quote Tweet
8
21
144
Wonder which research direction will end up with the most compute in all of the major labs...
Quote Tweet
All research groups will somehow scale to consume all available GPUs
2
4
We are apparently all bad at poetry, coding, arithmetic, and drawing...
Quote Tweet
I don't want human intelligence, I want super-human performance in tasks humans are bad at. RL seems a totally fine tool for this.
Show this thread
1
3
"Multimodal" means nothing without a language interface for "control"
Quote Tweet
19
🫣
Quote Tweet
So while FTX's onboarding process was easier than gmail in 2004, I made burner accounts that I would loop my capital and continue to extract value from SBF. After they changed the strat to frontrun their own rebalancing (wonder with who's funds), I would frontrun their (15/x)
Show this thread
2
Watching my friend gallop around on her white tiger in the latest #AssassinsCreedValhalla and can't stop gawking at how beautiful this game is. Even the shrubs are pretty here...
1
4
We use higher order optimization in production for neural networks.
4
7
21
Train large, then compress
Quote Tweet
Google employee reports LLMs need a 10x inference cost decrease to be deployed at scale given infinitesimal ad revenue per search
@moneyincineratingVCs wya
10
33
375
Found in 's huge HELM benchmark - in 5-shot, OPT-175B is well above the original 2020 GPT-3's performance, and BLOOM is just behind despite being trained on only 30% English. Give it time, and there's nothing open-source can't catch up to.
6
53
205
another variant on meritocracy myth: i used to think i could get ahead of the disrespect & misogyny i experienced as a woman in tech by working really hard and getting really good at what i do. HA now i know that my being competent makes insecure men even more resentful and mean.
114
885
6,638
Show this thread
Post NeurIPS recovery checklist:
✅ eat a giant 🌯
✅ pet a chubby 🐰 in a bunny stroller
✅ order a dozen 🍪 from uber eats
✅ make sure 📉 are still alive
✅ take a 😴
✅ lay in 🛌 and scroll twitter
10/10 would recommend
7
1
105
A recording of the full session of exchanging model training horror stories can be found at neurips.cc/virtual/2022/w with a panel discussion at the very end (starting at the 7:01:27 mark)!
Now for that 6am flight out of here...
#NeurIPS2022 #aiburningman
1
2
14
Show this thread
A huge thanks to the HITY (Has It Trained Yet) workshop organizers ( ) for bringing and me together for a day of commiserating on "graduate student descent"!
2
9
72
Show this thread
just over here watching all the LLM/AGI startups trade researchers like Pokémon
#NeurIPS2022
3
9
340






















