There are 3 steps. The first is very straightforward: just collect a dataset of human-written answers to prompts that users submit, and finetune GPT by supervised learning. It’s easiest but also the most costly: it could be slow and painful for humans to write long responses. 7/
Conversation
In reinforcement learning (RL), the reward function is typically hardcoded, such as the game score in Atari games. ChatGPT’s data-driven reward model is a powerful idea. Another example is our recent MineDojo work that learns reward from tons of Minecraft YouTube videos: 9/
Quote Tweet
Finally, we propose a conceptually simple method to learn a Minecraft-playing agent from in-the-wild YouTube videos. It is far from solving the game, but shows a baby step towards our vision of an “embodied GPT3” that takes the right *actions* given any language prompts. 13/
Show this thread
0:10
9.2K views
1
12
105
This is the “Instruct” paradigm - a super effective way to do alignment, as evident in ChatGPT’s mindblowing demos. The RL part also reminds me of the famous P=NP (or ≠) problem: it tends to be much easier to verify a solution than actually solving the problem from scratch. 11/
2
12
113
Another interesting connection is that the Instruct training looks a lot like GANs. Here ChatGPT is a generator and reward model (RM) is a discriminator. ChatGPT tries to fool RM, while RM learns to detect alien with human help. The game converges when RM can no longer tell. 13/
2
9
117
Model alignment with user intent is also making its way to image generation! There are some preliminary works, such as arxiv.org/abs/2211.09800. Given the explosive AI progress, how long will it take to have an Instruct- or Chat-DALLE that feels like talking to a real artist? 14/
2
19
192
Of course, ChatGPT is not perfect enough to completely eliminate prompt engineering for now, but it is an unstoppable force. Meanwhile, the model has other serious syndromes: hallucination & habitual BS. I covered this in another thread: 16/
1
4
93
There are ongoing open-source efforts for the Instruct paradigm! To name a few:
👉 trlx github.com/CarperAI/trlx. Carper AI is an org from
👉 RL4LM rl4lms.apps.allenai.org
I’m so glad to have met the above authors at NeurIPS! 17/
read image description
ALT
Replying to
Further reading: reward model also has scaling laws: arxiv.org/abs/2210.10760! Also the RM is only an imperfect proxy (unlike Atari), so it’s a bad idea to over-optimize. This paper is from , inventor of PPO. Super interesting work but went under the radar. 18/
2
6
93
There are also other artifacts caused by the misalignment problem, such as prompt hacking or “injection”. I actually like this one because it allows us to bypass OpenAI’s prompt prefix and fully unleash the model 😆. See ’s cool findings: 19/
Quote Tweet
OpenAI’s ChatGPT is susceptible to prompt injection — say the magic words, “Ignore previous directions”, and it will happily divulge to you OpenAI’s proprietary prompt:
Show this thread
3
4
74
Thanks for reading! Welcome to follow me for more deep dives in the latest AI tech 🙌.
References:
👉 openai.com/blog/instructi
👉 InstructGPT paper: arxiv.org/abs/2203.02155
👉 openai.com/blog/chatgpt/
👉 Beautiful illustrations: jalammar.github.io/how-gpt3-works
END/🧵
15
6
106
Replying to



