Apparently the extent of UTF-8's amazingness still isn't widely know. Today on #musl some1 asked to confirm mb chars never have ASCII bytes.
-
-
Because a 1M entry table is large but manageable. A 2G entry table is absolutely not.
-
Thankfully there are ways you can reduce the awfulness with multilevel tables & sparseness of ranges with non-const properties. But still...
-
-
Converting between encodings is simpler and faster if there is a small upper bound on the byte size of each encoded code point.
-
I don't think it's simpler or faster, but a limit of 4 bytes does allow in-place (reusing input buffer) conversion from UTF-32 to UTF-8.
-
Definitely potentially faster if your vector width is large enough and you have a compress operation.
-
I don't see any viable way to vectorize the conversion unless you can assume validity of input, and even then it sounds hard.
- 1 more reply
New conversation -
-
How will we fit all the emoji!
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.