Conversation

Replying to and
I'm optimistic JavaScript will get proper strings because WebAssembly is UTF-8 focused and they're making good decisions about how interoperability works with JavaScript despite it upsetting people invested in UCS2. Python switched from UCS2 to UCS4 which was a dumb decision...
1
2
Python took that same multiple string representation approach and browsers did the same for JavaScript. The strings are always fixed size units in Java / JavaScript / Python and they bloat themselves up to a larger unit if you add a single character from the range requiring it.
2
Using UTF-8 everywhere (strings, input/output) and having iterators/cursors for code points and grapheme clusters works fine. It's perfectly efficient to get back to the same place in the string as before. There's no real world use case for jumping to the nth code point quickly.
2
2
The funny thing is I think I've never actually used code that truly needs to know that character index in a string truly is the beginning of a real character. strcmp doesn't care. strlen doesn't care. strdup doesn't care. What cares, aside maybe from font rendering code?
1
A lot of the common functions treat the string as nothing more than an array of elements (uint8_t, uint16_t, uint32_t) and doesn't really need to know what they correspond to individually. Maybe one exception would be uppercasing/lowercasing because it needs the real "character"
1
Unicode code points are only low-level semantic characters. They form higher-level grapheme clusters composed out of an arbitrary number of code units. In a terminal, the cells are generally grapheme clusters, but free flowing text has higher level groups due to ligatures, etc.
1
It's not like that outside of those cell-based cases because the fonts can define ligatures, etc. and combine together multiple grapheme clusters when they have a fancier way of rendering them together, etc. Java style UCS2 strings would still be misguided if they used UCS4.
1
Trying to define strings based on grapheme clusters would also be horribly wrong because they change the rules over time such as adding combining forms of emojis and it's not actually useful. It's not actually what a terminal or text editor wants. It's more flexible than that.