Aligned AGI doesn’t need to be “humanlike” in some philosophical sense, and it doesn’t need to be “internally optimizing” the One True Utility Function. It just needs to behave more or less the way we’d want it to on reflection, once we let it out of the box.
Conversation
Is there, at this moment in time, one human being capable of even suggesting a way to align the monster without letting such monster out of the cage?
2
3
You exhaustively search for truthlike properties in the AI’s brain using logical consistency requirements, not labels, validate that it tracks truth in all sorts of adversarially constructed situations inside the box, then use this to peer into the AI’s github.com/EleutherAI/elk… Show more
5
8
42
We've found that ELK works a lot better when you add a *prompt invariance* term: that is, you search for a reporter whose output is approximately the same across different ways of asking the same question. In particular, it works for autoregressive models!
In the plot "squared" is the standard CCS loss proposed by , and "prompt_var" is the CCS loss plus a term penalizing the variance of the reporter outputs across different prompts. Thanks to for running this particular experiment
1
11
Show replies


