In the video, you mentioned looking for a C library that does grapheme clustering. utf8proc is a C library with a simple API for finding grapheme cluster breaks: https://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html#aae83bdcabf3a97c1046c0700ba353640 …
-
-
Replying to @kssreeram
I had tried utf8proc, because someone else recommended it. Unfortunately it doesn't seem to actually produce extended grapheme boundaries, only smaller ones? Maybe there are ways to use it better than I did, but it did not seem to produce the information I actually needed.
1 reply 0 retweets 1 like -
Replying to @cmuratori @kssreeram
Here's a ~200 line extended grapheme cluster break finder I use in my editors. Haven't abused it, but I believe it's complete - you can see the Unicode spec transpiled from English to C.https://gist.github.com/MarkMendell/e854207bedcf34145197cd12fd0003c3 …
2 replies 1 retweet 4 likes -
Replying to @mark_dev_ @kssreeram
Thank you very much for the pointer! I will take a look.
1 reply 0 retweets 1 like -
Replying to @cmuratori @mark_dev_
Are you using EGCs as the unit for caching glyphs? If so, that might be the real problem. To render many scripts correctly, EGC to glyph mapping must be treated as a many-to-many mapping (called “shaping”). EGC code like utf8proc won’t do that. But Uniscribe presumably does.
2 replies 0 retweets 1 like -
This page has some examples. (They use “character” to means EGC!) https://gankra.github.io/blah/text-hates-you/ …
1 reply 0 retweets 1 like -
Replying to @kssreeram @mark_dev_
Yes - we already do all that, because as you can see from the demo we handle Arabic, Hindi, Hebrew, etc. That's why it is necessary to do more than simple grapheme clustering.
1 reply 0 retweets 1 like -
Replying to @cmuratori @mark_dev_
I could have communicated better. Refterm does indeed render correctly. It's just a matter of terminology. The clusters from Uniscribe's ScriptItemize/ScriptBreak are not "extended grapheme clusters". They also do "shaping".
1 reply 0 retweets 2 likes -
To do that without Uniscribe, an algorithm for extended grapheme clusters isn't enough. Shaping is needed too. Harfbuzz can do that for example.
1 reply 0 retweets 2 likes -
To be more precise, the clustering algorithm must be “shaping” aware. The term “shaping” includes glyph generation/positioning too.
1 reply 0 retweets 2 likes
Yes - I find the terms to be kind of confusing and not well-defined :( So yes, we basically need "shaping-based breaking", meaning that the algorithm needs to tell us where to split inputs on boundaries where the glyphs won't change because of the split.
-
-
I guess the easiest way to say it is, we need an algorithm that will tell us where we can safely break inputs into pieces which will not "look different" when broken at those points.
2 replies 0 retweets 2 likes -
Perhaps, "appearance-preserving break"?
0 replies 0 retweets 3 likes
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.