Ever wanted to mindwipe an LLM?
Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts. 🧵
Conversation
(2/7) Concept erasure is an important tool for fairness, allowing us to prevent features like race or gender from being used by classifiers when inappropriate, as well as interpretability, letting us study the causal impact of features on a model's behavior.
5
7
54
(3/7) We also introduce a procedure called “concept scrubbing,” which applies LEACE to all layers of a deep network simultaneously. We find LLM performance depends heavily on linear part-of-speech information, while erasing a random feature has little to no effect.
1
2
43
(4/7) We prove that LEACE is the smallest possible linear edit, in the least squares sense, needed to erase a concept— all previous concept erasure methods have been suboptimal. We also show empirically that it’s less destructive to model performance than previous methods.
1
4
50
(5/7) LEACE has a closed-form solution that fits on a T-shirt. This makes it orders of magnitude faster than popular concept erasure methods like INLP and R-LACE, which require gradient-based optimization. And the solution can be efficiently updated to accommodate new data.
4
5
46
(6/7) We’ve released all code needed to reproduce our results at github.com/EleutherAI/con! You can also `pip install concept-erasure` to get the PyPI package.
2
7
76
(7/7) LEACE wouldn’t be possible without , who proved the theorem that led to this paper. I'd also like to thank our other coauthors !
1
4
53
Wait, this also allows you to isolate a concept without necessarily removing it, right? What would be the chances of a GPT4-tier LLM being able to self-explore itself?
1
1
31
Yes, I think it's likely that LEACE will be useful for editing concepts and not just erasing them.
1
2
40
Show replies
Yes INLP in particular is quite bad! It inflicts a lot of "collateral damage" on other parts of the representation
1
1
5


