Your point is either not clear in <140-char units or not well thought-out.
This is because non-ASCII UTF-8 is full of bytes which are either C1 control characters (garbage) or nonsensical pairings of printable chars
-
-
UTF-8 really has the _maximal possible_ properties to help you out when you have to guess, without breaking non-negotiable requirements.
-
Have you never seen a string interpreted as Latin-1 when it was utf-8 or the other way around?
-
The above 4 tweets literally just explained how the properties of UTF-8 make it so you can easily avoid that if heuristics are acceptable.
-
If you have to guess at encoding, any string that parses as UTF-8 is UTF-8. Otherwise you need more elaborate heuristics or a fixed fallback
-
FWIW this approach is used successfully in most modern IRC clients.
End of conversation
New conversation -
-
-
I'm talking about making it obvious to a programmer that she's using the wrong encoding even with English test strings.
-
Not doing that is literally requirement 0. Sorry you can't understand this and I'm starting to wonder if you're not just an anti-UTF-8 troll
-
Sorry, you fail to see my point. I'm not hating utf-8 here. Maybe I'll have to write a longer explanation, utf-8 fanboy! :-)
End of conversation
New conversation -
-
-
This is to avoid noticing it only when the system is in production and suddenly a string with non-ascii characters appear. Fail fast!
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.