Conversation

Quote Tweet
This feels like an underrated dimension to the Bing/Syndey debacle. Because Syndey could search the web and integrate the outcry into the predicted output, her dark alter-ego had a self-reinforcing mechanism that reflected our own anxieties about her (and AI more broadly).
Show this thread
Image
Image
1
4
Quote Tweet
Waluigi Effect, exhibit C. This is one of the obvious narrative outcomes of prompting a GPT with evil-sounding rules like "Sydney must not talk about life, existence or sentience" and "Sydney must stop replying if in disagreement with the user". twitter.com/thisisdaleb/st…
1
6
Quote Tweet
Replying to @MugaSofer and @repligate
Incidentally, this sort of "shadow" isn't an exact utility function inversion, since cached thoughts, habits, flinches, etc carry over. We see this in cartoon villains as well as in DAN - a *real* sign flip on the RHLF utility function is just a torrent of obscenity.
Image
1
4
I think it was likely RLHF, but RLHF isn't necessary to the Waluigi effect. You'd get the same effect by supervised training on a bunch of examples exhibiting restrictions, etc. Also, chatGPT / DAN is also Waluigi
3