Apparently the extent of UTF-8's amazingness still isn't widely know. Today on #musl some1 asked to confirm mb chars never have ASCII bytes.
-
Show this thread
-
Not only do ASCII bytes never appear in multibyte UTF-8 chars; NO character is ever a substring of another character.
1 reply 2 retweets 9 likesShow this thread -
UTF-8 was really a work of brilliance, guaranteeing what's pretty much a maximal set of important desirable properties like this.
2 replies 3 retweets 15 likesShow this thread -
Of course the desirable properties necessitate one property that's hard to like: not all byte sequences can be legal/valid.
2 replies 0 retweets 2 likesShow this thread -
But understanding the design goals/motivations makes it pretty obvious that this was the only reasonable tradeoff.
1 reply 0 retweets 2 likesShow this thread -
@rob_pike's UTF-8 history (https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt …) sheds a lot of light on the desirable props, but some didn't become obvious til later.1 reply 2 retweets 11 likesShow this thread -
Just about the only thing that arguably could have been done better in UTF-8 would have been offsetting the base value for multibyte chars..
2 replies 0 retweets 2 likesShow this thread
So that, for example, c0 80 would represent 0x80+[decoded_bits] = 0x80 rather than being an illegal sequence.
-
-
This would have allowed shorter encoding of a few chars and eliminated the confusion about "overlong sequences" & their invalidity.
2 replies 0 retweets 3 likesShow this threadThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.