Today's adventure: parsing Unicode's awful xml from CLDR with sed, because xml is unusably awful to deal with.
/<calendar.*gregorian/,/<\/calendar/{ /<monthWidth.*wide/,/<\/monthWidth/{ /<month /p } }
-
-
Wait... are you... parsing....xml...with..reg....e...x..... https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 …pic.twitter.com/T5s5IcRCqL
-
Yes. Because all the alternatives are just as awful.
-
Of course you can't parse arbitrary XML (or [X]HTML) with regex. (Ignoring the fact that bounded-nesting-depth is actually a regular language and _could_ be parsed with a hideously huge regex...) On the other hand...
-
XML that has to meet a fixed form to be meaningful can be parsed with regex assuming a particular pretty-printing (and a blackbox xml pretty-printer can fix that if it ever changes without having to dirty hands on xml outside the black box).
-
If you trust Unicode not to change the sigils on you, causing your rune casting to 𝖶Âke the beast, so be it, bu̺t 𐌜e it Ớn ͨyȮur ȟeͮȀ𝗱 Ṧ͟Н͌oựld ̃̇Ύ̜̅̉o̒ꓴ ̸b̬ͭ͏r̢̼ͅᶧ🄽Ԍ̻ r͓͙ǘ̒ͬі͒ͅṄͤ̃ ̄Ŧ̱Ὁ̓ ̈́̏ͭ͢ͅᵁ͂𝚂͖́ ̮̥̦ͬͅἊ̡̡̢̪ͯ̓ℓ̷̤̬̋͝҉̙̖𝙻̺̓̀͏̐ͬ̉!̛͇ͤͫ̂
-
If they change anything the code consuming the data has to be changed anyway. Aside from malformed nesting constructs the sed line-match patterns are equivalent to chained XML element selectors for matching elements.
End of conversation
New conversation -
-
-
i can write some python scripts to dump it in a more reasonable format if that would make your life easier
-
Probably not. There're actually (or at least there used to be) some already-done "posix" versions of the CLDR data, but they're in POSIX localedef format which is even worse than XML.
-
They're full of crap like <LATIN _SMALL_LETTER_E_WITH_ACUTE> (rather than literal text) that's equivalent to XML entity defs (defined custom per file).
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.