I liked UTF-8 more before I started to think hard about efficient RISC-V BitManip code for encoding/decoding it.
-
-
Replying to @oe1cxw
UTF-8 is, like, made for bitshifting. It's all about the bits.pic.twitter.com/Jl1pMlbqU8
1 reply 1 retweet 9 likes -
Replying to @FakeUnicode
So far this is the best I could do. The lookup tables are ugly of course and even with them it is still approx 18 RISC-V instructions per encoded codepoint.pic.twitter.com/5navEJ0ZTD
1 reply 0 retweets 2 likes -
Replying to @oe1cxw @FakeUnicode
Just finished the decoder. I'm reasonably happy with the result. Also 18 instructions per code-point, but no tables needed for this one.pic.twitter.com/b0qOcYmED3
1 reply 1 retweet 8 likes -
Replying to @oe1cxw
How is it with edge cases? overbyte: c0 80 e0 80 80 f0 80 80 80 f8 80 80 80 80 fc 80 80 80 80 80 cesu-8: ed a0 bd ed b1 86 >10FFFF: f4 90 80 80 5 byte nonsense: f9 84 80 80 80 split surrogates: ed a0 80 ed bf bf overbyte split surrogates: f0 8d a0 80 f0 8d bf bf
2 replies 0 retweets 1 like
This is just a benchmark example to see if the hot loop can be written branch-free using RISC-V BitManip instructions. Edge cases can be handled in branches that are __builtin_expected() to be not taken.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.