Yeah, bytes -> JS string -> bytes is a lossy process
-
-
Replying to @jaffathecake @rem
Wait, is it? I was trying to claim the opposite
You can store any byte sequence in a JS string because they are unsanitized UTF-16. So invalid byte sequences will continue to sit there. But once you pipe it through TextEncoder/Decoder, you loose data.2 replies 0 retweets 0 likes -
UTF-16 can't be unsanitized. JavaScript strings use UCS-2, which is "unsanitized" version of UTF-16, but technically a different encoding which is why actual UTF-16 is lossy in Text{Encoder,Decoder}.
1 reply 1 retweet 2 likes -
That’s why I said “unsanitized UTF-16” (I wasn’t sure if UCS-2 is exactly that or not). I guess I’m trying to distinguish between build a JS string from bytes using String.fromCharCode et al and using TextEncoder (which doesn’t support UTF-16, interestingly enough)
2 replies 0 retweets 0 likes -
Heh yeah the problem is that Web specs (including Text*coder) can't deal with invalid Unicode in any encoding, while JS spec can.
1 reply 0 retweets 0 likes -
Replying to @RReverser @DasSurma and
I'm still strongly convinced that TextEncoder should at least have `fatal` option (see discussion at https://github.com/whatwg/encoding/issues/174#issuecomment-478959889 …) to make it easy to catch these mismatches, but not everyone agrees :(
1 reply 0 retweets 0 likes -
Replying to @RReverser @DasSurma and
I was wondering today if
@mathias’s String.wellFormed idea (though perhaps on the prototype?) could go through TC39. (Still not convinced to special case some APIs taking USVString over others.)1 reply 0 retweets 1 like -
Replying to @annevk @RReverser and
Is the proposed idea about something beyond str => !/[uD800-UDFFF]/u.test(str) ?
1 reply 0 retweets 0 likes -
(missing backslash typo, twitter is maybe not the best coding env)
1 reply 0 retweets 0 likes -
Replying to @bhathos @RReverser and
It’s slightly more complicated (see the
@encodings issue for@mathias’s take) as surrogate pairs are fine.2 replies 0 retweets 0 likes
Actually, that regular expression makes use of the `u` flag to correctly match only lone surrogates. The way to do it without `u` is using lookbehind assertions.
-
-
But now I have two problems!
1 reply 0 retweets 3 likes -
That's alright, you can make a combined regular expression to match both your problems.
1 reply 0 retweets 1 like - 1 more reply
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.
JavaScript, HTML, CSS, HTTP, performance, security, Bash, Unicode, i18n, macOS.