Apparently the extent of UTF-8's amazingness still isn't widely know. Today on #musl some1 asked to confirm mb chars never have ASCII bytes.
-
-
But understanding the design goals/motivations makes it pretty obvious that this was the only reasonable tradeoff.
Show this thread -
@rob_pike's UTF-8 history (https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt …) sheds a lot of light on the desirable props, but some didn't become obvious til later.Show this thread -
Just about the only thing that arguably could have been done better in UTF-8 would have been offsetting the base value for multibyte chars..
Show this thread -
So that, for example, c0 80 would represent 0x80+[decoded_bits] = 0x80 rather than being an illegal sequence.
Show this thread -
This would have allowed shorter encoding of a few chars and eliminated the confusion about "overlong sequences" & their invalidity.
Show this thread
End of conversation
New conversation -
-
-
I actually see that as a very nice "feature". It makes it easy to guess if a long piece of data is UTF-8 or random binary.
-
Yes, once you weigh all the pros and cons, I agree it's a feature. But in my experience it's hard for ppl to accept it at first.
-
I'm probably biased because I've written auto-detectors for various binary formats. When a format accepts all bit patterns, it's a pain.
-
With regard to auto-detection, the best part about UTF-8 is that false positives are extremely rare.
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.