You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance. arxiv.org/abs/2302.08582
Conversation
Reinforcement learning from human feedback (RLHF) is the secret sauce behind InstructGPT, ChatGPT and Claude. It’s a technique for finetuning pretrained language models (LMs) to maximize a reward function expressing human preferences, e.g. being a helpful and harmless assistant.
1
17
Current RLHF finetuning methods only work w/ already pretrained LMs, requiring LMs to unlearn many undesirable behaviors (e.g. imitating falsehoods or offensive language) acquired from internet text. We explore objectives for aligning LMs via pretraining w/ human feedback (PHF)
1
14
We test 5 PHF objectives across 3 tasks: generating (1) non-offensive text, (2) text without personally identifiable information (PII), and (3) PEP8-compliant Python code. In addition to alignment w/ preferences, we estimate pretrained LMs' capabilities using their KL from GPT-3.
1
11
We found one objective that achieves similar capabilities (Y axis) to standard LM pretraining (MLE), while significantly improving alignment with human preferences (X axis). This objective is conditional training, or next-token prediction conditional on human preference scores.
2
13
Conditional training decreases the average undesirability of LM samples by up to an order of magnitude and reaps continued benefits with increasing training data (in some cases reminiscent of power law scaling).
1
10
Good alignment persists when the LM is prompted by an LM-based adversary that iteratively designs attacks that elicit undesired behavior (red-teaming). Conditional training results in LMs that are significantly harder to jailbreak.
1
11
Conditional training generally maintains the performance of standard pretraining (MLE) on downstream tasks after finetuning. See the below results on GLUE, a text classification benchmark that tests how well LMs can be finetuned for downstream tasks.
1
7
Finally, we compare pretraining w/ feedback with conventional pretraining followed by finetuning w/ feedback. Pretraining w/ feedback results in (sometimes dramatically) better alignment. Learning good behavior from scratch may be easier than learning and unlearning bad behavior
1
1
10
Involving human feedback earlier on also translates into greater robustness to red-teaming. Pretraining with feedback from the start is consistently better than e.g. using feedback only on the last 10% of the data.
1
2
9
These findings challenge the practice of aligning LMs only during finetuning, as in RLHF, suggesting that the way forward is to involve human preferences from the start of pretraining itself.
1
1
18
Thanks to my amazing coauthors , , , , , and !
1
5
arXiv: arxiv.org/abs/2302.08582
Code: github.com/tomekkorbak/pr
Datasets: huggingface.co/datasets?other
2
1
8
