This seems like just a special case of Matcher with an added label, how else does it differ? Tagger seems most appropriate, but if it could extend to fuzzy or similarity based an EntityClassifier would be great.
-
-
-
Kind of, yes – it's a component that you can add to your pipeline and that uses both the Matcher and PhraseMatcher to add (and optionally, overwrite) entities. It's probably one of the most common custom components users build, so I want to ship it as a package or in core.
-
Example patterns, which can be loaded from Python or JSON, and saved to a file: {'label': 'ORG', 'pattern': 'Apple'} {'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]}
-
The component can also be part of a model pipeline, with the patterns file included in the model data. This would let you develop, train, package and ship a model that uses both statistical predictions and a word list / ontology for NER.
-
That's great, I've been wanting to do this for some time, be fantastic to have that built in.
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
This feels a bit like a Matcher with an EntityFactory callback (would you ever need to compose one with a PhraseMatcher?). Not that they're snappy names
Hard thing #1... -
Yeah, it actually supports both matchers via patterns like this: {'label': 'ORG', 'pattern': 'Apple'} {'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]} The difference to a matcher callback is that it's a pipeline component, so you can add it at any step.
-
You can also require it in the pipeline defined in a model package, so you could ship a model that combines statistical and rule-based NER and ships with both the binary weights and a lexicon of match patterns.
-
This sounds really useful!
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
PatternMatcher?
-
Ah, yes – that's actually what we call the similar functionality in https://prodi.gy . The spaCy component will even accept data in the same format. (Just worried that this might cause confusion? We already have overlapping names for spaCy's/Prodigy's EntityRecognizer etc)
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
Why not like RegexpEntityLinker and then have an ML-based EntityLinker?
-
Thanks! The only difficulty here is that the patterns would go beyond regular expressions so anything "regex" won't work. And maybe "linker" could be confusing because people might think it's also doing entity disambiguation when it's only labeling spans? Ahhh, naming is hard
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
Is there an ETA on when this new component will be released? It would come really handy.
-
WIP PR is here: https://github.com/explosion/spaCy/pull/2513 …
I really want to get it into the v2.1.0 nightly, but we still haven't decided on the name. There were many great suggestions in this thread, but I feel like that made it even more difficult to decide, haha.
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
Some funky ones, EntityHunter, EntitySeeker, EntityScout, or maybe naming it after the tools/algo used. Like in one of the comments I saw you mentioned that things other than regex were utilized as well.
-
Thanks – I love how creative everyone is!
And yeah, the matching happens via spaCy's `Matcher` and `PhraseMatcher` (see here: https://spacy.io/usage/linguistic-features#rule-based-matching …). So it's really a RuleBasedMatcherPhraseMatcherEntityAdderAndOptionallyReplacerPipelineComponent.
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
EntityRecognizer
-
Unfortunately, that already exists – it's the the model / pipeline component that *predicts* named entities (see https://spacy.io/api/entityrecognizer …). It'd definitely be nice to find a name that's just as straightforward and logical / easy to type.
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
Since the vanilla entity tagger is called EntityRecognizer... EntityLexiconRecognizer or EntityRuleRecognizer seems to keep API consistency while being suggestive about the logic backing this component.
-
Thanks! API consistency is definitely a very important factor. I also really like the "lexicon" terminology btw (you're the first one to suggest that, actually). Even just `EntityLexicon` would be pretty spot-on...
-
Yeah, that would work... and isn't a mouthful (:
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
লোড হতে বেশ কিছুক্ষণ সময় নিচ্ছে।
টুইটার তার ক্ষমতার বাইরে চলে গেছে বা কোনো সাময়িক সমস্যার সম্মুখীন হয়েছে আবার চেষ্টা করুন বা আরও তথ্যের জন্য টুইটারের স্থিতি দেখুন।
Developing
spaCy users: Imagine a pipeline component that takes token/phrase patterns & labels them as entities. Basically, a rule-based entity recognizer. How would you expect it to be named?
`EntityMatcher` is confusing because it's *not* a matcher. But `EntityRuler` sounds weird?