`LC_ALL=en_US.UTF-8 grep -c . surrogate` reports a count of 1 where `surrogate` contains \xED\xAE\xA0, which is a UTF-8 encoding of a surrogate codepoint. But surrogates are not allowed in UTF-8. (Note that `.` correctly fails to match invalid UTF-8 such as `\xFF`.) Is it a bug?
Good question. I don't even know if it's well formed though haha! Usually `.` is "matches a single codepoint," but certainly, in UTF-16, you want it to match the pair, not one surrogate. I guess you would want to carry that over to UTF-8 as well, but idk what ramifications are...