Also UTF-8 has problems with uniform representation for POSIX-compatible and Windows-compatible path names . Currently Rust have some workarounds like special arf-strings, OsString and etc
Conversation
That's not accurate. UTF-8, UTF-16 and UTF-32 represent exactly the same set of strings. They're encodings of Unicode. JavaScript is not using UTF-16 but rather has a legacy implementation that's really an array of UCS2 characters which permits invalid Unicode strings.
1
1
3
UTF-8 is able to encode every single string that UTF-16 and UTF-32 can encode. Your claims are inaccurate.
Rust is not working around any issues with UTF-8 but rather OS path names are usually not guaranteed to be Unicode. NTFS and ext4 paths are allowed to be invalid Unicode.
1
1
2
Most *nix filesystems permit paths to be any NUL-terminated C string with a special meaning for the slash character. That's why you can't represent them with Unicode strings. This is not an issue with UTF-8. You can't represent them as UTF-16 or UTF-32 either. You're very wrong.
1
2
Replying to
What's specifically wrong with the slash character in a path? At least for UTF8, slash is in the first 128 characters (since slash exists in ASCII as well).
Semantically, it carries additional info, but I don't see what's wrong w/ storing it in a UTF8 string.
1
Replying to
It's not a problem. I was explaining that the restrictions on what can be in a path stop far short of allowing only valid Unicode. UTF-8 works fine as the encoding for *nix paths but there's nothing stopping anyone from using any other byte strings without internal \0 characters.
1
I just mentioned slash because the special meaning means a filename can't contain either NUL or slash unlike a path as a whole which just can't contain NUL.
Unicode permits NUL inside strings so not every Unicode strings can be converted in a lossless way to a path either.
This Tweet was deleted by the Tweet author. Learn more
Should ideally avoid becoming a problem elsewhere and should be solved for JavaScript. They could add a document/program wide mode where valid Unicode strings are enforced and then people can opt-out of the problem. Can require it to use new features like they often do with TLS.
1
I don't really think they need new APIs or API redesign to fix JavaScript's Unicode issues. Need a way to opt-out of legacy strings. In practice, not much would break, and the breakage would be a nice way of uncovering a lot of latent bugs that are potentially already serious.
1
It would then be using UTF-16, which is still unfortunate due to wasted memory, engine complexity from optimizations to avoid wasting as much memory, conversion overhead, etc.
Separate feature could be adding a nice new string type using UTF-8 and requiring the Unicode mode.
It really wouldn't be that hard to just turn JavaScript strings into UTF-16 in where you opt-in with either an equivalent to "use strict" (globally or not at all) or via document metadata like a header.
It'd be nice to give it a modern immutable UTF-8 string type but not needed.
This Tweet was deleted by the Tweet author. Learn more
Show replies


