Conversation

Aligned AGI doesn’t need to be “humanlike” in some philosophical sense, and it doesn’t need to be “internally optimizing” the One True Utility Function. It just needs to behave more or less the way we’d want it to on reflection, once we let it out of the box.
8
78
You exhaustively search for truthlike properties in the AI’s brain using logical consistency requirements, not labels, validate that it tracks truth in all sorts of adversarially constructed situations inside the box, then use this to peer into the AI’s github.com/EleutherAI/elk Show more
5
42
We've found that ELK works a lot better when you add a *prompt invariance* term: that is, you search for a reporter whose output is approximately the same across different ways of asking the same question. In particular, it works for autoregressive models!
Image
In the plot "squared" is the standard CCS loss proposed by , and "prompt_var" is the CCS loss plus a term penalizing the variance of the reporter outputs across different prompts. Thanks to for running this particular experiment
1
11
Show replies