`LC_ALL=en_US.UTF-8 grep -c . surrogate` reports a count of 1 where `surrogate` contains \xED\xAE\xA0, which is a UTF-8 encoding of a surrogate codepoint. But surrogates are not allowed in UTF-8. (Note that `.` correctly fails to match invalid UTF-8 such as `\xFF`.) Is it a bug?
-
Show this thread
I discovered this while tracking down a discrepancy in match count between ripgrep and grep. ripgrep does not permit `.` to match surrogates.
6:04 PM - 6 Sep 2018
0 replies
0 retweets
3 likes
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.