Any improvement in our grasp of what goes on in there and how to manipulate it, might help. Let's say that first.
But for later, I hope we can agree in advance that one should never build something that turns out to want to kill you, try to mindwipe that out, then run it again.
Quote Tweet
Ever wanted to mindwipe an LLM?
Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts.
arxiv.org/abs/2306.03819
Show this thread









