UTF8 doesn't solve security / validation problems:
onlineutf8tools.com/validate-utf8
websec.github.io/unicode-securi
Conversation
It's the standard string encoding in 2021 & good enough. Other encodings should be used for legacy purposes only
1
2
Dart, C#, Java, JavaScript internally uses UTF16. And They won't be able to change it to UTF8 even in the future. Otherwise it would require a complete re-do of the String API and loose compatibility. Also UTF8 does not solve the security problems like I said
2
1
This isn't accurate. JavaScript doesn't have UTF-16 strings. If it did, every JavaScript string could be represented with standard UTF-8.
JavaScript implements strings as arrays of UCS2 characters which is not standard Unicode or UTF-16. That's why an extended UTF-8 is required.
1
3
JavaScript has mixed UCS2 and UTF16 api. Some methods like codePointAt, fromCodePoint, toUpperCase/toLowerCase, for..of str, Array.from(str), normalize, regex's test are unicode aware. The rest are USC2 include substr and length. But it doesn't really make much difference...
2
The super duper fun gotcha of .codePointAt() is the pos parameter is still UCS-2.
'💩💩'.codePointAt(0) //128169
'💩💩'.codePointAt(1) //56489
Imagine all the devs swapping .charCodeAt(i) with .codePointAt(i) in their loops and thinking their code is now astral-safe. 😱
2
2
A nicer API would use opaque cursor objects for nearly all cases not covered by iterators. Internally indexing by UTF-8 or UTF-16 code units is fine. Indexing by code point is still extremely arbitrary and doesn't make much sense outside of cursor/iterator use cases.
2
The closest thing to cursor objects would probably be "Extended Grapheme Cluster", and there has been discussion about how that could possibly be done in JS for a while: esdiscuss.org/topic/working-
[cc ]
But this definition evolves as new combining blocks or ZWJs are added.
2
All I mean is that UTF-8 and UTF-16 encoding for strings can and should be an implementation detail where code units aren't heavily exposed. Code points aren't glyphs but they at least aren't an implementation detail of encoding and are something with actual meaning to people.
1
I don't think it can be expected that code points are hidden away as an implementation detail and aren't used in the APIs. It definitely shouldn't be encouraged to index strings by code point index, etc. though. It's hard to see when that would ever be useful.
1
You need to treat strings as indexable arrays of grapheme clusters for monospace-font-based UIs like a terminal but stuff like deciding where to wrap lines, etc. is more complicated. In between input/parsing and display, strings should largely be opaque blobs.



