Apparently the extent of UTF-8's amazingness still isn't widely know. Today on #musl some1 asked to confirm mb chars never have ASCII bytes.
-
-
Punycode does that offset thang. Another UTF-8 stupid is artificially limiting it to UTF-16 scope. UTF-8 coulda supported up to U+7FFFFFFF.
-
Limiting range is a feature.
-
Because a 1M entry table is large but manageable. A 2G entry table is absolutely not.
-
Thankfully there are ways you can reduce the awfulness with multilevel tables & sparseness of ranges with non-const properties. But still...
-
-
Converting between encodings is simpler and faster if there is a small upper bound on the byte size of each encoded code point.
-
I don't think it's simpler or faster, but a limit of 4 bytes does allow in-place (reusing input buffer) conversion from UTF-32 to UTF-8.
- 3 more replies
New conversation -
-
-
there is JVM's M-UTF-8 coding, which is like UTF-8 but uses some overlong encodings, and encodes non-BMP chars in UTF-8 as surrogate pairs.
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.