A common criticism of LLM watermarks is they can be removed by AI paraphrasing or human editing. Let's put this theory to the test! Can a watermark be automatically removed by GPT? Can a grad student do any better? The results surprised me 🧵
arxiv.org/pdf/2306.04634
Conversation
First, if you don’t remember how watermarks work, you might revisit my original post on this issue.
TL;DR The watermark is a subtle pattern embedded in LLM outputs that labels it as machine generated. High accuracy detection usually requires 50-ish words.
Quote Tweet
#OpenAI is planning to stop #ChatGPT users from making social media bots and cheating on homework by "watermarking" outputs. How well could this really work? Here's just 23 words from a 1.3B parameter watermarked LLM. We detected it with 99.999999999994% confidence. Here's how 
Show this thread
1
7
22
The experiment: We generated watermarked text using the Llama model, then asked a non-watermarked LLM (GPT-3.5) to re-write it. We did lots of prompt engineering to try to get rid of the watermark. Finally, we checked whether we could detect the watermark in the rewritten text.
1
5
22
Even after AI paraphrasing, we still detect the watermark - BUT we need more text to do it. Once we observe about 500 tokens (≈ a half page), we can reliably detect the watermark with a false positive rate of about 1 in a million.
1
9
35
Here’s why this happened. GPT is statistically likely to recycle word combinations, multi-token long words, and short phrases from the original text. This preserves the watermark in the paraphrased text. But it's been diluted - it takes 10X more tokens to reliably detect it.
2
3
25
Now let’s see if humans can do any better. We had CS grad students rewrite passages to remove the watermark. The top performers got $100 coffee shop gift certificates (if you’ve been to grad school you know that both coffee and free stuff are a big deal). Here’s the results.
1
4
37
Grad students do better than GPT. For an average grad student, we need to observe about 1100 tokens (≈ 1 page) before we can detect the watermark with a 1 in a million error rate. There’s quite a bit of variation between people, though.
3
12
39
It seems scrubbing the watermark is tougher than people think. But there’s some caveats worth mentioning. Maintaining this much security requires the watermark parameters to be secret, otherwise the attacker knows which “green” tokens to remove, which helps scrub the watermark.
1
3
22
Also, removing the watermark is tough but not impossible. A sophisticated actor might have the capability to remove it through (very) meticulous scrubbing or via an LLM with a specialized decoder. Still, many malicious LLMs users are, shall we say, not so sophisticated.
1
6
33
Some people argue that we shouldn't watermark because we can’t catch everything. I don't like this argument. In a world with spam, harassment, misinformation, and bots, I’m ok with stopping some things even if we can’t catch everything.
3
7
67
