Some of the most exciting and on point AI safety work.
See also Collin's excellent conceptual post on how this fits into a broader scalable alignment scheme: alignmentforum.org/posts/L4anhrxj
Quote Tweet
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?
We show (arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 
Show this thread
2
8



























