Conversation

2018: GPT-1 | 110M Parameters BERT | 340M Parameters 2019: GPT-2 | 1.5B Parameters Megatron | 8.3B Parameters 2020: Turing-NLG | 17B Parameters GPT-3 | 175B Parameters 2021: Google Switch | 1.6T Parameters What is coming next?
1
6
We’re at a point where these models are capable enough to perform many tasks. Optimization now becomes just as important as scaling up further.
1
5
Techniques like Mixture of Experts, PPLM, distillation, random feature attention are all being actively researched. These will both optimize costs and reduce compute needs, as well as improve the control developers have over large language models.
1
3
The largest models (GPT-3, Turing-NLG, etc.) already have lots of knowledge and capabilities. The question is, how do we more effectively, reliably, and systematically retrieve that knowledge? As answers to this question become clearer, language models will become more useful.
1
1
5
We’re seeing algorithmic efficiency doubling every 16 months. By the end of 2021, it will cost around half of what it cost in early 2020 to train a GPT-3-sized model.
2
1
10