We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. đź§µ twitter.com/AnthropicAI/st
read image description
ALT
Quote Tweet
It’s hard work to make evaluations for language models (LMs). We’ve developed an automated way to generate evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.
anthropic.com/model-written-
Show this thread


