People writing fast UTF-8 decoders are crying over Zen 2 having slow PEXT, to give some idea how esoteric that gets.
-
-
-
I have no problem with PEXT (called BEXT in RISC-V BitManip). Here are my encoder and decoder for LEB128 as used in the DWARF debug file format (without returning symbol length, bc that's trivial). But I have a really hard time coming up with reasonable branchless code for UTF-8.pic.twitter.com/oVxKYeye95
- Show replies
New conversation -
-
-
UTF-8 is, like, made for bitshifting. It's all about the bits.pic.twitter.com/Jl1pMlbqU8
-
So far this is the best I could do. The lookup tables are ugly of course and even with them it is still approx 18 RISC-V instructions per encoded codepoint.pic.twitter.com/5navEJ0ZTD
- Show replies
New conversation -
-
-
UTF is quite mad, and UTF-8 actually made it worse because it mostly is interchangeable with ASCII so mostly appears to work - except when it doesn’t.
-
The range of surrogate pairs is forbidden as code point forever because people thought 16 bits had to be enough for everyone but now the range of all code points ever is like a few millions again because the surrogate pairs can't encode beyond. Haha you say UTF-8 is crazy...
- Show replies
New conversation -
-
-
The interesting design point of UTF-8 is the error detection capability, designed around serial with no stop bits. In a modern world, with layering of error detection and correction, it ends up being the only place for errors in terms of degrees of freedom. It makes it complex.
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
The very first code I got committed to Samsung Android was NEON code to encode and decode UTF-8 to UCS-2. It’ll be easier in RVV.
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.