A simple overview of the state of massive language models like GPT-3.
/thread
Conversation
Since 2018, each year has brought new models that are typically 10x+ larger than models from the year prior.
1
2
2018:
GPT-1 | 110M Parameters
BERT | 340M Parameters
2019:
GPT-2 | 1.5B Parameters
Megatron | 8.3B Parameters
2020:
Turing-NLG | 17B Parameters
GPT-3 | 175B Parameters
2021:
Google Switch | 1.6T Parameters
What is coming next?
1
6
We’re at a point where these models are capable enough to perform many tasks. Optimization now becomes just as important as scaling up further.
1
5
Techniques like Mixture of Experts, PPLM, distillation, random feature attention are all being actively researched.
These will both optimize costs and reduce compute needs, as well as improve the control developers have over large language models.
1
3
The largest models (GPT-3, Turing-NLG, etc.) already have lots of knowledge and capabilities. The question is, how do we more effectively, reliably, and systematically retrieve that knowledge?
As answers to this question become clearer, language models will become more useful.
1
1
5
In this paper from OpenAI: cdn.openai.com/papers/ai_and_
“We argue that algorithmic progress has an aspect that is both straightforward to measure and interesting: reductions over time
in the compute needed to reach past capabilities.”
1
4
We’re seeing algorithmic efficiency doubling every 16 months.
By the end of 2021, it will cost around half of what it cost in early 2020 to train a GPT-3-sized model.
2
1
10
Replying to
Great 🧵! Are there datasets that illustrate the 16 month phenomenon anywhere?

