I'm considering requiring file names to be UTF-8 on Sortix. Traditionally Unix allows any byte sequence (except '\0' and '/') in an unspecified encoding. This just creates problems. File names you can't type. Programming languages with UTF-16 strings can't access them.
-
Show this thread
-
It's nice, in a way, that the kernel doesn't care about the encoding. But it shifts the problem to user-space. Just today a bot broke at work because of a file name that wasn't UTF-8 and python just couldn't handle that. It's 2018, I can try and make everyone use a UTF-8 locale.
2 replies 0 retweets 3 likesShow this thread -
I'm also considering disallowing newlines in file names. I can't think of a good use. Their existence in file names is why some tools produce/consume \0 delimited records, but many tools just do newlines (especially portability). Forbidding them makes things a lot simpler.
3 replies 0 retweets 5 likesShow this thread -
Interoperability is a key concern. What if I mount such a filesystem? I could try to 'correct' the file name (drop or fix chars), have a fallback representation (hexencode). I could move the inode to /lost+found as corrupted or return EIO.
4 replies 0 retweets 2 likesShow this thread -
Replying to @sortiecat
IMO, the best approach is to have a fallback representation, and maybe do charset transcoding on mounted filesystems.
2 replies 0 retweets 0 likes -
Replying to @pikhq @sortiecat
You already need to handle something like this if you want to mount FAT, after all, since the only charsets there are legacy and UTF-16.
1 reply 0 retweets 0 likes -
Replying to @pikhq @sortiecat
s/UTF-16/UCS-2 with illegal code values in the range D800-DFFF allowed/ (the FS makes no requirement that they be in pairs that are valid as UTF-16).
2 replies 0 retweets 0 likes -
Replying to @RichFelker @sortiecat
Oh, right. Either UCS-2 or potentially-invalid UTF-16, depending on how you describe it.
1 reply 0 retweets 0 likes -
The fun part is that countries like Japan widely adopted a mutibyte encoding long before utf8 was invented, and have stuck with them.
4 replies 0 retweets 0 likes -
And the USA telling a billion chinese to stop using https://en.wikipedia.org/wiki/Chinese_character_encoding#Guobiao … because Beared White Guy Says So... not gonna happen.
1 reply 0 retweets 0 likes
GB18030 is a UTF (bijective mapping with Unicode scalar values) and future direction is all Unicode. A huge portion of Chinese sw now uses Unicode internally anyway.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.