Many users know that a major OS upgrade without rebuilding text indexes can corrupt data. But few realize the extent of risk. I'm working on an analysis of GNU C Library collation changes over the last 10 years - for a start, as seen in Ubuntu.
Conversation
I understand neither why these rules are so complicated nor why they keep changing. At least in English, alphabetical order can be assessed one character at a time, and the ordering of the letters is well-known.
2
Maybe try and learn a couple other languages, read some history about language and cultural evolutions over time? My understanding is that a language is only fixed entirely when it’s dead (i.e. latin).
3
4
I mean that might be a good idea but I don’t see how it’s going to help me understand why the example he’s using sorta as it does. “Wine glass” isn’t a character from a language I haven’t bothered learning.
2
1
Ah I see. Well yeah I suppose we kind of need a sorting rule for those parts of unicode too, and that the usual “design by committee” problems are at play. In other words, it sounds like a political problem to me, one with engineering consequences. File under time zones etc?
1
1
Yeah I guess. But sort things English speakers are likely to have opinions about in a way they will like and the rest by code point would seem to be good enough. Yet it’s clearly much more complicated than that.
1
Yes it must be. To accommodate actual world. In French letters È, Ê, Ê can be represented in more than one way in unicode IIRC. Could be 1 byte, could be 3 bytes. The 3 bytes version must sort the same as the 1 byte version. And I never know if it is E < È < É or something else.
1
1
Right and I agree those need to be sorted in with the letters even in en_US. What I don’t understand is the rules where the sort order depends on adjacent characters. There are languages with such rules but English is not one of them. So why is the collation like that?
2
en_US is about as complicated as most other UCA collations, at least at the implementation level. Punctuation and whitespace are compared in separate "passes" over each string/UCA level, which native English speakers actually rely on. See: unicode.org/reports/tr10/#
1
6
That's a great link! "Table 3. Canonical Equivalence" gives a few examples that directly illustrate the answer your question about adjacent characters, a.k.a. "sequences that are canonically equivalent"
1
2
This is also why strxfrm() output is ~3.5x larger than caller's original C string. It more or less produces a materialized binary string for each of the levels, and concatenates them (in level-wise order) to produce its final output.


