Many AI risk failure modes imagine strong coherence/goal directedness (e.g. [expected] utility maximisers).
Such strong coherence is not represented in humans, seems unlikely to emerge from deep learning and may be "anti-natural" to general intelligence in our universe.
Conversation
I suspect it was a mistake that set the field back a bit, and it's not yet fully recovered from that error.
I think most of the AI safety work for very strongly coherent agents (decision theory) will end up inapplicable/useless for aligning powerful systems.
1
2
[I don't think it nails everything, but on a purely ontological level, and Alex Turner's shard theory feels a lot more right to me than e.g. HRAD.
HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects.]
1
2
The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are "contextual influences on decision making") than the immutable terminal goals in strong coherence ontologies.
2
3
And I'm give the impression that the assumption of strong coherence is still implicit in several current AI safety failure modes (e.g. deceptive alignment).
twitter.com/CineraVerinia/
This Tweet is unavailable.
2
2
I'd be interested in more investigation into what environments/objective functions select for coherence and to what degree.
And empirical demonstrations of systems that actually become more coherent as they are trained for longer/scaled up or otherwise amplified.
1
1
I want advocates of strong coherence to explain why agents operating in rich environments (e.g. animals) aren't strongly coherent.
And mechanistic interpretability analysis of sophisticated RL agents (e.g. AlphaStar) to investigate their degree of coherence.
1
2
Currently, I think strong coherence is unlikely (plausibly "anti-natural" [e.g. if the shard theory account of value formation is at all correct]) and am unenthusiastic about research agendas and threat models predicated on strong coherence.
1
3
Disclaimer this is all low confidence speculation, and I may well be speaking out of my ass.
I do think that my disagreements with deceptive alignment is not a failure of understanding, but I am very much an ML noob, so there can still be things I just don't know.
1
3
