Opens profile photo
Follow
Collin Burns
@CollinBurns4
Alignment research . Formerly . Former Rubik's Cube world record holder.
San Franciscocollinpburns.comJoined March 2020

Collin Burns’s Tweets

Wish I had this to cite in the agent models paper: more evidence that LMs distinguish between "p is true" and "p is likely to be uttered by the author of this particular prompt".
Quote Tweet
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵
Show this thread
53
Pretty interesting approach to adapting an LM that's not standard prompting or fine-tuning.
Quote Tweet
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵
Show this thread
50
I'm glad you liked it! :) Incidentally, we just (finally 😅) put the paper up on arxiv (arxiv.org/abs/2212.03827) and released the code on github (tinyurl.com/latentknowledge) a few hours after your tweet yesterday!
Quote Tweet
Discovering Latent Knowledge in Language Models Without Supervision is blowing my mind right now. Basic idea is so simple yet brilliant: Find a direction in activation space where mutually exclusive pairs of statements are anticorrelated. I <3 clickbait so: the Truth Vector.
Show this thread
Image
1
6
This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.
2
73
Show this thread
Nevertheless, we found it surprising that we could make substantial progress on this problem at all. (Imagine recording a person's brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)
3
89
Show this thread
Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.
1
58
Show this thread
Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.
1
57
Show this thread
We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.
1
71
Show this thread
Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.
2
93
Show this thread