The AI explosion is warping our sense of time. Can you believe Stable Diffusion is only 4 months old, and ChatGPT <4 weeks old 🤯? If you blink, you miss a whole new industry. Here are my TOP 10 AI spotlights, from a breathtaking 2022 in rewind ⏮: a long thread 🧵
Conversation
🎉No. 1: Text -> Image
DALLE-2 was the first large-scale diffusion model that can generate realistic, high-res images from an arbitrary caption. It kickstarted the AI4art revolution that spawned many new applications, startups, and ways of thinking.
openai.com/dall-e-2/
1.1/
3
11
109
But DALLE-2 is behind OpenAI’s walled garden. , LMU, and took the heroic step to train their own internet-scale text2image model, based on the “latent diffusion” algorithm. They called the model “Stable Diffusion”, and open-sourced the code & weights.
1.2/
3
11
128
There were 2 other image2text models from this year. Neither released the model or an API to play with, but the papers still had interesting insights.
1. Imagen: imagen.research.google
2. Parti: parti.research.google. A transformer model without diffusion!
1.4/
1
6
78
🎉No. 2: Text -> Text
Well, that’s an easy guess - ChatGPT! The only app in history that gained 1 million users in 5 days.
ChatGPT prompted out our human creativity as well. I refer you to this list for all the useful & imaginative ChatGPT ideas: github.com/f/awesome-chat
2.1/
Replying to
ChatGPT & GPT-3.5 both use a new technology called RLHF (“Reinforcement Learning from Human Feedback”). Its profound implication is that prompt engineering will disappear very soon. See my deep dive 🧵 on this topic:
2.2/
1
8
105
ChatGPT’s popularity spawned a wave of new startups and competitors, notably Jasper Chat, YouChat, Ghostwriter chat, and . Some of them offer much more intuitive ways to do search, so Google execs are getting sweaty! has a nice thread:
2.3/
Quote Tweet
Publicly announced ChatGPT variants and competitors: a thread
Show this thread
2
5
83
In October, my coauthors and I took a step towards building a “Robot GPT” that takes in any *mixture* of text, images, and videos as prompt, and outputs robot motor control! Our model is called VIMA (“VisuoMotor Attention”) and has been fully open-sourced 🧵:
3.2/
Quote Tweet
We trained a transformer called VIMA that ingests *multimodal* prompt and outputs controls for a robot arm. A single agent is able to solve visual goal, one-shot imitation from video, novel concept grounding, visual constraint, etc. Strong scaling with model capacity and data!
Show this thread
0:47
67.8K views
1
12
133
Along a similar path as VIMA, researchers from announced RT-1, a robot transformer trained on 700 tasks and 130K human demonstrations. These data were collected by a literal *Iron Fleet* - 13 robots over 17 months!
I also wrote a deep dive 🧵 for RT-1:
3.3/
Quote Tweet
ChatGPT can plan a theme party, but can it clean up your messy house after the party? Sadly no. What might a GPT for robotics look like? My friends at @Google Robotics just announced RT-1, a Transformer
with eyes, arms & wheels! How does it read, see & act? Let’s dive in:

Show this thread
0:29
93K views
1
10
76
Make-A-Video: Text-to-Video Generation without Text-Video Data. From .
You can sign up for trial access here: makeavideo.studio
Paper: arxiv.org/abs/2209.14792
4.2/
Quote Tweet
Yes, it’s here :) Zero-shot text-to-video generation!
Introducing Make-A-Video.
The new SOTA, by a large margin, in text-to-video generation. And it doesn’t use any paired text-video data!
Examples, paper, and a sign up sheet at makeavideo.studio @MetaAI
Prompts 
Show this thread
2
3
68
Imagen Video: High Definition Video Generation with Diffusion Models. From - a natural follow-up on the Imagen static image generator.
Demos: imagen.research.google/video/
Paper: arxiv.org/abs/2210.02303
4.3/
2
3
57
Phenaki: Variable Length Video Generation From Open Domain Textual Description. Also from
Demos: phenaki.video
Paper: arxiv.org/abs/2210.02399
4.4/
1
2
39
🎉No. 5: Text -> 3D
From designing innovative products to creating stunning visual effects in movies & games, 3D modeling will be the next step in the creative AI field to materialize ideas from text. 2022 has seen a few primitive but promising 3D generative models!
5.1/
1
4
57
DreamFusion: Text-to-3D using 2D Diffusion. From . Based on the NeRF algorithm, the 3D model generated from a given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment.
Project site: dreamfusion3d.github.io
5.2/
1
3
51
2 works from my colleagues :
👉 GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. nv-tlabs.github.io/GET3D/
👉 Magic3D: High-Resolution Text-to-3D Content Creation. deepimagination.cc/Magic3D/
5.3/
0:05
4.6K views
2
4
69
Point-E: A System for Generating 3D Point Clouds from Complex Prompts. By , a preliminary version of 3D DALLE!
Paper: arxiv.org/abs/2212.08751
5.4/
Quote Tweet
OpenAI just dropped a prototype of 3D DALLE (called “Point-E”)
. It isn’t as good as Google’s DreamFusion, but blazing fast! Like ~600x faster to generate
. 2D DALLE has already turned the creative world upside down. How will 3D DALLE disrupt games, VR, metaverse, …? 
Show this thread
GIF
1
3
43
My coauthors and I developed the first Minecraft AI that can solve many tasks given a *natural language prompt*. Our ultimate goal is to build an “Embodied ChatGPT”. We’ve fully open-sourced our development platform “MineDojo”. Deep dive 🧵 on our NeurIPS Outstanding Paper:
6.2/
Quote Tweet
GPT3 is powerful but blind. The future of Foundation Models will be embodied agents that proactively take actions, endlessly explore the world, and continuously self-improve. What does it take? In our NeurIPS Outstanding Paper “MineDojo”, we provide a blueprint for this future:
Show this thread
0:16
137.3K views
3
7
89
Concurrently, ’s team also announced a model called VPT (“Video Pre-Training”) that directly outputs keyboard and mouse actions. It is able to solve longer horizons, but not language conditioned. MineDojo & VPT complement each other!
openai.com/blog/vpt/
6.3/
1
1
40
🎉No. 7: AI learns to negotiate!
CICERO is the first AI agent to achieve human-level performance in Diplomacy, a strategy game that requires extensive natural language negotiation to cooperate & compete with humans. AI can now persuade and bluff effectively 😮!
7.1/
Quote Tweet
3 years ago my teammates and I set out toward a goal that seemed like science fiction: to build an AI that could strategically outnegotiate humans *in natural language* in Diplomacy. Today, I’m excited to share our Science paper showing we’ve succeeded!
twitter.com/MetaAI/status/…
Show this thread
1
6
63
Concurrently, also announced their Diplomacy-playing agent. What would happen if CICERO plays DeepMind’s AI? 🤔
dpmd.ai/diplomacy-natu
7.2/
GIF
1
3
51
🎉No. 8: Audio -> Text
OpenAI Whisper is a large Transformer that approaches human-level robustness & accuracy on English speech recognition. It’s trained on 680,000 hours of audio data from the web! Will Whisper unlock more text tokens to feed GPT-4? 8/
1
9
74
🎉No. 9: Nuclear Fusion control☢️
& developed the first deep reinforcement learning system that can keep nuclear fusion plasma stable inside its tokamaks (a device that uses a powerful magnetic field to confine plasma in a torus).
nature.com/articles/s4158
9.1/
2
14
92
also expands the BioNeMo LLM framework to help biotech companies and researchers to generate, predict and understand biomolecular data.
blogs.nvidia.com/blog/2022/09/2
10.2/
4
5
61



