New post: Nobody's on the ball on AGI alignment
With all the talk about AI risk, you'd think there's a crack team on it. There's not.
- There's far fewer people on it than you might think
- The research is very much not on track
(But it's a solvable problem, if we tried!)
Collin Burns
@CollinBurns4
Alignment research . Formerly . Former Rubik's Cube world record holder.
Collin Burns’s Tweets
Very dignified work! Signal boosting.
Quote Tweet
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?
We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 
Show this thread
1
18
223
Wish I had this to cite in the agent models paper: more evidence that LMs distinguish between "p is true" and "p is likely to be uttered by the author of this particular prompt".
Quote Tweet
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?
We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 
Show this thread
6
53
Pretty interesting approach to adapting an LM that's not standard prompting or fine-tuning.
Quote Tweet
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?
We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 
Show this thread
14
50
statements a language model considers true can be identified with an unsupervised probe!
4
38
270
I'm glad you liked it! :)
Incidentally, we just (finally 😅) put the paper up on arxiv (arxiv.org/abs/2212.03827) and released the code on github (tinyurl.com/latentknowledge) a few hours after your tweet yesterday!
Quote Tweet
Discovering Latent Knowledge in Language Models Without Supervision is blowing my mind right now. Basic idea is so simple yet brilliant: Find a direction in activation space where mutually exclusive pairs of statements are anticorrelated. I <3 clickbait so: the Truth Vector.
Show this thread
1
6
(And a huge thanks to my excellent collaborators -- Haotian Ye, Dan Klein, and -- for helping make this happen!)
2
45
Show this thread
However, our results suggest that unsupervised approaches to making models truthful may also be a viable – and more scalable – alternative to human feedback.
For many more details, please check out our paper (arxiv.org/abs/2212.03827) and code (tinyurl.com/latentknowledge)!
3
5
76
Show this thread
This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.
2
2
73
Show this thread
Nevertheless, we found it surprising that we could make substantial progress on this problem at all.
(Imagine recording a person's brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)
3
4
89
Show this thread
Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.
1
58
Show this thread
Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.
1
57
Show this thread
We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.
1
1
71
Show this thread
We make this intuition concrete by introducing Contrast-Consistent Search (CCS), a method that searches for a direction in activation space that satisfies negation consistency.
5
11
127
Show this thread
This may be possible to do because truth satisfies special structure: unlike most features in a model, it is *logically consistent*
3
7
124
Show this thread
Informally, instead of trying to explicitly, externally specify ground truth labels, we search for implicit, internal “beliefs” or “knowledge” learned by a model.
1
2
81
Show this thread
We propose trying to circumvent this issue by directly finding latent “truth-like” features inside language model activations without using any human supervision in the first place.
1
2
97
Show this thread
Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.
2
2
93
Show this thread
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?
We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵
31
344
1,419
Show this thread






