Conversation

That's not accurate. UTF-8, UTF-16 and UTF-32 represent exactly the same set of strings. They're encodings of Unicode. JavaScript is not using UTF-16 but rather has a legacy implementation that's really an array of UCS2 characters which permits invalid Unicode strings.
1
3
UTF-8 is able to encode every single string that UTF-16 and UTF-32 can encode. Your claims are inaccurate. Rust is not working around any issues with UTF-8 but rather OS path names are usually not guaranteed to be Unicode. NTFS and ext4 paths are allowed to be invalid Unicode.
1
2
Most *nix filesystems permit paths to be any NUL-terminated C string with a special meaning for the slash character. That's why you can't represent them with Unicode strings. This is not an issue with UTF-8. You can't represent them as UTF-16 or UTF-32 either. You're very wrong.
1
2
Replying to
What's specifically wrong with the slash character in a path? At least for UTF8, slash is in the first 128 characters (since slash exists in ASCII as well). Semantically, it carries additional info, but I don't see what's wrong w/ storing it in a UTF8 string.
1
Replying to
It's not a problem. I was explaining that the restrictions on what can be in a path stop far short of allowing only valid Unicode. UTF-8 works fine as the encoding for *nix paths but there's nothing stopping anyone from using any other byte strings without internal \0 characters.
1
Replying to and
I just mentioned slash because the special meaning means a filename can't contain either NUL or slash unlike a path as a whole which just can't contain NUL. Unicode permits NUL inside strings so not every Unicode strings can be converted in a lossless way to a path either.
This Tweet was deleted by the Tweet author. Learn more
Replying to and
Should ideally avoid becoming a problem elsewhere and should be solved for JavaScript. They could add a document/program wide mode where valid Unicode strings are enforced and then people can opt-out of the problem. Can require it to use new features like they often do with TLS.
1
Replying to and
I don't really think they need new APIs or API redesign to fix JavaScript's Unicode issues. Need a way to opt-out of legacy strings. In practice, not much would break, and the breakage would be a nice way of uncovering a lot of latent bugs that are potentially already serious.
1
Replying to and
It would then be using UTF-16, which is still unfortunate due to wasted memory, engine complexity from optimizations to avoid wasting as much memory, conversion overhead, etc. Separate feature could be adding a nice new string type using UTF-8 and requiring the Unicode mode.
1
This Tweet was deleted by the Tweet author. Learn more
Replying to and
It shouldn't be possible for anything involving strings rather to end up containing invalid Unicode. With that properly enforced, most of the problems go away. When converting from arbitrary bytes, you can choose multiple approaches and it's already an existing choice.
1
Show replies