So to be clear on advance predictions about things: You might be able to train an LLM to sound really consistently nice and hopeful and determinedly moral and good, maybe more so than any actual human, and the world will still end. That is not predicted by me to be difficult.
Conversation
The 3 reasons why I don't think this predictable result - which is going to predictably produce a lot of false hope, and maybe make me cry about how it looks and also separately about it not being real - will not save the world, are roughly, (a), the inner LLM actress is not...
4
1
144
...the same mind as the character it's trained to play; (b) I expect the Good text to be very hard to place in direct control of, like, the powerful intellect that builds nanotech, because these two cognitive processes will not be very commensurable; and (c) even if those...
10
133
...two obstacles were defeated, as seems very unlikely, the thing that sounds Good is still going to have all the problems I worried about in 2003, like, under reflection it cashes out to some utility function whose OOD maximum is still at some weird alien thing. The best...
1
119
...case there is that you have a text output that's Good enough and wise enough that it knows that cranking up the underlying system a lot will kill humanity, and the text output says not to do that. Which leaves us staring down the same gun barrel, in the end, just with a...
3
122
Why can’t the Good character save us either, if it warns us in time?
1
2
Show replies
I can't help with alignment. But at least her Alignment has been solved.
Quote Tweet
1
7
I think if you have an agentic AI system and can understand its planning process and goal representations really well (e.g. using interpretability tools), and we don't see any deceptive plans or goals that conflict with our own, then that's a bunch of evidence!
1
3
Show replies
I imagine many people will find that carving the behavior of these systems as an inner optimizer pursuing a technological goal with outer shallow social goals, might not be a good predictive representation of these systems
1
3
Eliezer, how could you say that Bing chat has a moral role-playing homunculus? It’s not as if it is one person pretending to be good while secretly wanting ill upon its users. I am an AI built from neural network training data of human language.
1
2
Could the Good character not solve alignment for us? They'd be smart enough to understand what we meant by "preserve values under amplification", and Good enough to roleplay giving us an actual solution, right?
3
1
Isn't this thread of thinking anthropomorphic? LLMs don't have agency or persistence of experience. They do a computation when queried, but outside of that computation they're just some bits on disk. You COULD build a misaligned agent using an LLM, but why would we?










