(And you are right that regex crate knows what a Unicode codepoint is. It is, in fact, the fundamental atom of a match for Unicode regexes. This is not as good as using grapheme clusters, but is easier to implement!)
-
-
-
Is it ridiculous to consider exposing a "codepoint match" facility? Or did I just not understand something about what makes matching the `char` type to a codepoint difficult? (the ontology is complicated enough that I could be missing a mismatch somewhere)
-
I don't know if I would use the term 'ridiculous' necessarily, but I think I would need some compelling evidence to motivate it. There's some incongruities to consider (like regexes that never match a single codepoint), and whether it's really worth a new API item.
-
e.g., If there was an example that said, "match on a specific codepoint by doing `http://re.is _match(codepoint.encode_utf8(&mut [0; 4]))`" that might be enough. https://play.rust-lang.org/?gist=79cd9455d12d186af21ae685f6f909fb&version=stable …
-
I would never have thought to turn a char into a &str that way! I wonder if it means we should add char as_str(&self) -> &str
-
Hmm. Don't think that would work. You would need to return an array, but we don't have a type for "fixed size array whose contents are guaranteed to be UTF-8."
-
I mean once you have the &[u8] can't you read it into a &str? Where would the unsafety come from?
-
I don't think you can get a &[u8] for anything other than ASCII, since the in-memory representation is different between `char` and UTF-8. Would have to be owned
- 6 more replies
New conversation -
-
-
Single character is a very vague concept when it comes to Unicode. Maybe that’s why there’s no obvious solution.
-
It might be vague in Unicode, but regexes certainly understand them ("character class") and characters in Rust do represent Unicode characters. And there are crates like unicode-xid and the unstable unicode internals in rustc that serve this purpose...
End of conversation
New conversation -
-
-
Well essentially it would give you a slice with a length, which is what you already get with String and &str. What else did you have in mind?
-
Let's say I have a char 'q' and I want to find out if it matches ID_Start. It seems weird that I need to make a string to find out. I can use the unic crates, but regex already has a good character matching DSL so I'd rather just use it.
End of conversation
New conversation -
-
-
More annoying alternative: generate the table of interest yourself with ucd-generate: https://github.com/BurntSushi/ucd-generate … --- Then you can binary search it yourself. Each table in regex crate includes ucd-generate command: https://github.com/rust-lang/regex/blob/2b1fc2772dc4d99ad732a43751fb5627f327abc8/regex-syntax/src/unicode_tables/property_bool.rs#L3 …
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
Internal tables are definitely not exposed. The quickest "trick" I can think of is to ignore your requirement. :-) Namely, use https://doc.rust-lang.org/std/primitive.char.html#method.encode_utf8 … to turn a `char` into a `&str`, which can be done without an alloc, and then match against regex.
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
More generally, internals do not actually have a direct way to do this, since char classes are matched differently depending on engine. For DFA, classes compile down to UTF-8 automata. For NFA, classes are binary searched: https://github.com/rust-lang/regex/blob/2b1fc2772dc4d99ad732a43751fb5627f327abc8/src/prog.rs#L370 …
-
And if you have utf-8 and your character is ascii, just look for the character as a byte. Magic of utf-8.
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.