Conversation

Replying to and
I'm optimistic JavaScript will get proper strings because WebAssembly is UTF-8 focused and they're making good decisions about how interoperability works with JavaScript despite it upsetting people invested in UCS2. Python switched from UCS2 to UCS4 which was a dumb decision...
1
2
Python took that same multiple string representation approach and browsers did the same for JavaScript. The strings are always fixed size units in Java / JavaScript / Python and they bloat themselves up to a larger unit if you add a single character from the range requiring it.
2
Using UTF-8 everywhere (strings, input/output) and having iterators/cursors for code points and grapheme clusters works fine. It's perfectly efficient to get back to the same place in the string as before. There's no real world use case for jumping to the nth code point quickly.
2
2
The funny thing is I think I've never actually used code that truly needs to know that character index in a string truly is the beginning of a real character. strcmp doesn't care. strlen doesn't care. strdup doesn't care. What cares, aside maybe from font rendering code?
1
A lot of the common functions treat the string as nothing more than an array of elements (uint8_t, uint16_t, uint32_t) and doesn't really need to know what they correspond to individually. Maybe one exception would be uppercasing/lowercasing because it needs the real "character"
1
There are useful operations which can be performed on code points but indexing code points isn't one of them. Monospace text editor or terminal is presumably storing a string of code points (grapheme clusters) for each cell. You may input code points but you edit them in groups.
1
It's not like that outside of those cell-based cases because the fonts can define ligatures, etc. and combine together multiple grapheme clusters when they have a fancier way of rendering them together, etc. Java style UCS2 strings would still be misguided if they used UCS4.
1
Show replies