JavaScript strings aren't UTF-16. JavaScript is perfectly capable of supporting actual UTF-16 or supporting much more sensible UTF-8 strings. WTF-16 is not UTF-16 and doesn't belong in new languages / environments.
JavaScript strings are arrays of 16-bit integers, not Unicode.
Conversation
I agree that JS does not have UF16 but rather WTF16, but that is not a big deal, since all FFI crossings and UTF16 -> UTF8 conversions within the engines are sanitized and unpairad surrogates are replaced by U+FFFD. The Wasm IT discusses now how this sanitization will take place
1
Why doesn't JavaScript add proper Unicode strings with a sensible in-memory format that doesn't waste a ton of memory and force conversions? Why should the people doing things better start using UTF-16?
1
Same question. Why UTF8 so waste a ton of memory for Asian languages and emojii?
1
UTF-8 uses at most 4 bytes for a code point, just like UTF-16. Emojis aren't in the BMP. UTF-16 needs the same amount of data to represent them.
It takes 3 bytes for basic Chinese, Japanese and Korean characters rather than 2 but their markup is still 1 byte instead of 2.
1
curl news.baidu.com > data.utf8
wc -c data.utf8
76013 data.utf8
iconv -f UTF-8 -t UTF-16 data > data.utf16
wc -c data.utf16
142622 data.utf16
They're using ASCII for their classes, ids, most of the URLs, etc. so this isn't just about it being an HTML file either.
3
You are overlooking the fact that UTF16 is much faster for text processing. But no one disputes that it is better to use UTF8 for serialization, which is exactly what happens on the web: JSON, CSS, HTML are UTF8-encoded
1
Data starts as UTF-8 and ends up as UTF-8. Main string operation is appending. UTF-8 is faster from being smaller in nearly every case. What are you doing with text beyond reading, writing, appending (including formatting) and displaying it? UTF-8 is faster in the real world.
2
UTF-8 may be faster only for Latin1. But in general UTF8 has much more complex code point encoder/decoder than UTF16
1
Surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.
1
You don't need to decode UTF-8 to append strings and do formatting. It's smaller, so it's faster. A fast SIMD-accelerated UTF-8 decoder is ridiculously fast anyway. String performance is almost entirely about optimized memory allocation/layout and the best way is not needing it.
Treating UTF-16 as a fixed-size character encoding is broken. It's being widely done and is the source of massive Unicode compatibility issues. It causes substantial harm due to these problems.
What exactly are you doing with strings where UTF-16 would be faster than UTF-8?

