That's inherent not a property of UTF-8. The same applies to any character encodings you pick.
-
-
Replying to @RichFelker @vilcans
You can make the situation even worse by not even having a known common subset (ASCII) but you can't make it any better without metadata.
2 replies 0 retweets 0 likes -
Replying to @RichFelker
My point is that not knowing the encoding of a text string is common. Some don't even care/know that there are different encodings.
1 reply 0 retweets 0 likes -
Replying to @vilcans
If a string is valid UTF-8, either it's ASCII (in which case it's valid as nearly any encoding, but doesn't tell you how to add new data)...
1 reply 0 retweets 0 likes -
Replying to @RichFelker @vilcans
...or it's extremely unlikely (heuristically, yes) that it was intended as anything other than UTF-8.
1 reply 0 retweets 0 likes -
Replying to @RichFelker @vilcans
This is because non-ASCII UTF-8 is full of bytes which are either C1 control characters (garbage) or nonsensical pairings of printable chars
3 replies 0 retweets 0 likes -
Replying to @RichFelker @vilcans
UTF-8 really has the _maximal possible_ properties to help you out when you have to guess, without breaking non-negotiable requirements.
1 reply 0 retweets 0 likes -
Replying to @RichFelker
Have you never seen a string interpreted as Latin-1 when it was utf-8 or the other way around?
1 reply 0 retweets 0 likes -
Replying to @vilcans
The above 4 tweets literally just explained how the properties of UTF-8 make it so you can easily avoid that if heuristics are acceptable.
1 reply 0 retweets 0 likes -
Replying to @RichFelker @vilcans
If you have to guess at encoding, any string that parses as UTF-8 is UTF-8. Otherwise you need more elaborate heuristics or a fixed fallback
1 reply 0 retweets 0 likes
FWIW this approach is used successfully in most modern IRC clients.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.