Seems this was a regression caused by the complexity of Compact Strings which has been fixed for a while now.
bugs.openjdk.java.net/browse/JDK-817
Maybe one day Java will get proper strings instead of pretending that their UCS2 strings are UTF-16 and that UTF-16 is actually a good idea.
Conversation
Java just ossified around all the worst ideas of the 90s so I wouldn't count on it.
1
1
I'm optimistic JavaScript will get proper strings because WebAssembly is UTF-8 focused and they're making good decisions about how interoperability works with JavaScript despite it upsetting people invested in UCS2.
Python switched from UCS2 to UCS4 which was a dumb decision...
1
2
Python took that same multiple string representation approach and browsers did the same for JavaScript. The strings are always fixed size units in Java / JavaScript / Python and they bloat themselves up to a larger unit if you add a single character from the range requiring it.
2
I think the most reasonable approach today would be for most languages to use UTF-8 internally, yet natively support UTF-16 and UTF-32 as well, with easy conversion. Trying to settle on *one* isn't going to work because that boat has already sailed ⛵
2
Using UTF-8 everywhere (strings, input/output) and having iterators/cursors for code points and grapheme clusters works fine. It's perfectly efficient to get back to the same place in the string as before. There's no real world use case for jumping to the nth code point quickly.
2
2
The funny thing is I think I've never actually used code that truly needs to know that character index in a string truly is the beginning of a real character. strcmp doesn't care. strlen doesn't care. strdup doesn't care. What cares, aside maybe from font rendering code?
1
A lot of the common functions treat the string as nothing more than an array of elements (uint8_t, uint16_t, uint32_t) and doesn't really need to know what they correspond to individually. Maybe one exception would be uppercasing/lowercasing because it needs the real "character"
1
Unicode code points are only low-level semantic characters. They form higher-level grapheme clusters composed out of an arbitrary number of code units. In a terminal, the cells are generally grapheme clusters, but free flowing text has higher level groups due to ligatures, etc.
1
There are useful operations which can be performed on code points but indexing code points isn't one of them.
Monospace text editor or terminal is presumably storing a string of code points (grapheme clusters) for each cell. You may input code points but you edit them in groups.
1
It's not like that outside of those cell-based cases because the fonts can define ligatures, etc. and combine together multiple grapheme clusters when they have a fancier way of rendering them together, etc.
Java style UCS2 strings would still be misguided if they used UCS4.
Trying to define strings based on grapheme clusters would also be horribly wrong because they change the rules over time such as adding combining forms of emojis and it's not actually useful. It's not actually what a terminal or text editor wants. It's more flexible than that.


