Opens profile photo
Follow
Susan Zhang
@suchenzang
Research Engineer . Past: , , , , etc. Generally found using an excessive amount of compute.
San Francisco, CAJoined April 2014

Susan Zhang’s Tweets

"just engineering" 😂😭
Quote Tweet
AI startups actually have a very strong moat we simply refer to it as "python virtual environment" and "cuda version" most of your competitors will give up before they get past this step
Show this thread
14
Now apply this to annual performance reviews on a bell curve in any industry research lab. (No good solution under any implicitly zero-sum situation...)
Quote Tweet
I think awards are, at best, pointless, and at worst harmful. Some thoughts in this thread 🧵, feel free to share your own (in either direction). 1/n twitter.com/thegautamkamat…
Show this thread
6
At this point, as long as benchmarks are not "exhaustive", it's impossible to claim "state-of-the-art" in any meaningful way. Perhaps we can amend SOTA to mean maximal performance on maximal number of "domains", but with exhaustive prompt-tuning, that's probably a stretch too. 8/
1
30
Show this thread
In the Codex intro, they mention what was surprising about GPT-3 was its ability to generate code _without even intending for the model to do so_. This requires sufficient capacity in a model size to find and hone-in on, which is what scale enables. 7/8
Image
1
10
Show this thread
So here's a job for a prompt-engineer: convert a pronoun-disambiguation task into a "Final Exam with Answer Key" task. Now the models are going to be purring with eagerness to respond with the right answers. 😑 Oh, and note the "=" separating out the instructions on top. 4/8
Image
Image
1
13
Show this thread
Next up: formatting. Why does CB get prompted for true/false and RTE with True/False? Why does WebQA use "Q/A", WiC use "question/answer", and ARC use "Question/Answer"? Could it be... that you simply get better results switching it up? 🤔 It just keeps going... 3/8
Image
Image
Image
Image
1
17
Show this thread
First up: BoolQ. If you download the actual benchmark, it's true/false completions. GPT-3 swaps in yes/no instead. Why? Well when we did the same swap to yes/no, we saw a +10% accuracy jump on this benchmark. Wonderful. Clearly on track for a better model already. 2/8
Image
Image
1
21
Show this thread
Piling on to the pile-on (sorry - it's always easy to criticize 😛), here's a rant about benchmarks for LLMs that are used to back claims of "stronger" or "better" models. Let's start with a tour through GPT-3's Appendix G... 1/8
Quote Tweet
The thread is mostly skeptical that any of this work will extrapolate in practice, but that was why we trained chinchilla 70B. It's not a perfect comparison, but trained on the same arch, codebase, dataset as gopher 280B for the same compute --- and it was a lot better 2/
Show this thread
5
126
Show this thread
Seeing a bit of a chinchilla pile-on from this thread. The 'train smaller models longer' paper. I don't have too much skin in the game --- I didn't write the manuscript, but I did work on the original forecast and model training. There seems to be a few misconceptions 1/
Quote Tweet
After ignoring the details in all these "lets-fit-a-cloud-of-points-to-a-single-line" papers (all likely wrong when you really extrapolate), @stephenroller finally convinced me to work through the math in the Chinchilla paper and as expected, this was a doozy. [1/7] twitter.com/stephenroller/…
Show this thread
4
108
Show this thread
While I fully agree that we should be training these LLMs on more data, the above seems to be particularly misleading claim on _how much more data_ we should be aiming for. And then hiding everything in a log-log graph is just...😡 [7/end rant]
5
60
Show this thread
Not to mention this insane point-cloud of a plot for their Figure 1 (aka main result) that draws 3-lines for each of the "Approaches" to claim that all 3 yield "mostly similar" results. Since when did a 10x difference become "mostly similar" in literature? [6/7]
Image
3
38
Show this thread
...anything from a 28B model (on 2.5T tokens) to a 260B model (on 270B tokens) between their 3 "Approaches". That's an unhelpful order of magnitude difference in how large of a model you should be training in order to be considered "compute optimal" 😐. [5/7]
Image
1
41
Show this thread
... you have to go all the way into their Appendix D.2 to see some concrete numbers given in equation (10). Combine with equation (4) + plug some value for C (say, something like 4.30E+23, the total FLOPs budget of OPT-175B), we get... [4/7]
Image
1
15
Show this thread
So then you naturally start wondering what A/B/a/b could be. First stop: (a,b) is set to different values for 3 different "Approaches" in Table 2, each seeming to differ by just a hair: (0.5,0.5) vs (0.49,0.51) vs (0.46,0.54). Ok, sure, why not. Now for A,B... [3/7]
Image
1
21
Show this thread
First thing to make me eye-roll a bit was this fancy equation (4) that seems to re-parameterize the key exponent terms (a,b) into (alpha,beta) to define a coefficient term G. Why this level of indirection just to define a scalar-coefficient? No idea. [2/7]
Image
1
24
Show this thread
After ignoring the details in all these "lets-fit-a-cloud-of-points-to-a-single-line" papers (all likely wrong when you really extrapolate), finally convinced me to work through the math in the Chinchilla paper and as expected, this was a doozy. [1/7]
Quote Tweet
Have any of you ever set up a calculator of the Chinchilla equations and plugged in their upper and lower error bars? It’s pretty interesting.
Show this thread
5
231
Show this thread
I'm excited to present: Scaling Laws for Generative Mixed-Modal Language Models. In this paper we explore the scaling properties of mixed-modal generative models, discovering new scaling laws that unify the contributions of individual modalities and the interactions between them.
Quote Tweet
Scaling Laws for Generative Mixed-Modal Language Models abs: arxiv.org/abs/2301.03728
Image
7
280
Show this thread
Even after all that fancy RLHF work into aligning with "values", this seems like an unfortunate side-effect. Maybe with all the data ChatGPT is collecting right now, it can RL its way out of the RL hole it trained right into.
Quote Tweet
But HELM measures not just accuracy, but also calibration, robustness, fairness, efficiency, bias, toxicity. On fairness, text-davinci-003 outperforms text-davinci-002, but on calibration, bias, and toxicity, both OpenAI models are much worse than other models.
Show this thread
9
In a world where compute resources are bounded, this line of research will enable us to make the most use out of our collective compute footprint in the field - a lesson hopefully generalizable to many domains! Can't wait to see all the submissions to this #ICLR2023 workshop!
Quote Tweet
Announcing the #ICLR2023 workshop on "Reincarnating Reinforcement Learning". Have you ever wondered why do we almost always train RL agents from scratch? Our @iclr_conf workshop instead focuses on reusing prior computational work in RL. See reincarnating-rl.github.io for details.
Show this thread
Image
2
33
Show this thread
🫣
Quote Tweet
So while FTX's onboarding process was easier than gmail in 2004, I made burner accounts that I would loop my capital and continue to extract value from SBF. After they changed the strat to frontrun their own rebalancing (wonder with who's funds), I would frontrun their (15/x)
Show this thread
2
another variant on meritocracy myth: i used to think i could get ahead of the disrespect & misogyny i experienced as a woman in tech by working really hard and getting really good at what i do. HA now i know that my being competent makes insecure men even more resentful and mean.
114
6,638
Show this thread
What's actually dangerous isn't a toxic model but a lack of "AI literacy" in general: open-source will be critical to educating all on the limitations (and potential!) of this technology.
Quote Tweet
The future of AI must be open-source. twitter.com/elonmusk/statu…
1
18
Post NeurIPS recovery checklist: eat a giant 🌯 pet a chubby 🐰 in a bunny stroller order a dozen 🍪 from uber eats make sure 📉 are still alive take a 😴 lay in 🛌 and scroll twitter 10/10 would recommend
7
105