@rob_pike's UTF-8 history (https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt …) sheds a lot of light on the desirable props, but some didn't become obvious til later.
-
Show this thread
-
Just about the only thing that arguably could have been done better in UTF-8 would have been offsetting the base value for multibyte chars..
2 replies 0 retweets 2 likesShow this thread -
So that, for example, c0 80 would represent 0x80+[decoded_bits] = 0x80 rather than being an illegal sequence.
1 reply 0 retweets 2 likesShow this thread -
This would have allowed shorter encoding of a few chars and eliminated the confusion about "overlong sequences" & their invalidity.
2 replies 0 retweets 3 likesShow this thread -
Replying to @RichFelker
Punycode does that offset thang. Another UTF-8 stupid is artificially limiting it to UTF-16 scope. UTF-8 coulda supported up to U+7FFFFFFF.
1 reply 1 retweet 1 like -
-
-
Replying to @bofh453 @FakeUnicode
Because a 1M entry table is large but manageable. A 2G entry table is absolutely not.
1 reply 0 retweets 1 like -
Thankfully there are ways you can reduce the awfulness with multilevel tables & sparseness of ranges with non-const properties. But still...
2 replies 0 retweets 0 likes -
Replying to @RichFelker @FakeUnicode
so b/c assigned chars aee sparse, you can limit range for *properties* by mapping most to "Unknown" while still preserving conversion range.
1 reply 0 retweets 0 likes
Only as long as they remain unassigned. That's not very comforting.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.