Apparently the extent of UTF-8's amazingness still isn't widely know. Today on #musl some1 asked to confirm mb chars never have ASCII bytes.
-
-
So that, for example, c0 80 would represent 0x80+[decoded_bits] = 0x80 rather than being an illegal sequence.
Show this thread -
This would have allowed shorter encoding of a few chars and eliminated the confusion about "overlong sequences" & their invalidity.
Show this thread
End of conversation
New conversation -
-
-
Maybe it would've been better if it was incompatible with ASCII , avoiding confusion about which encoding a given string has.
-
Demonstrating the original point of the thread? :-)
-
In any case there's no such confusion. An ASCII string *is* (not "is confusable with") UTF-8.
-
As ASCII is a subset of both Latin-1 and UTF-8 it causes a lot of confusion I'd say.
-
Your point is either not clear in <140-char units or not well thought-out.
-
Only "confusion" I can see here is that when existing data is all ASCII you don't inherently know whether processes will accept UTF-8 edits.
-
But there's never a confusion about how to interpret the existing ASCII data. Any way you choose is right.
-
If you are given a string, e.g. "Hello world", how do you know its encoding? You can't see the difference in that specific case.
- 2 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.