Thanks! Will fix. Im not sure how my benchmark went so wrong though? I evaluated 2014-10-14, using the provided http://batch.sh
-
-
-
https://github.com/explosion/spacy-benchmarks/blob/master/bin/run_raw_stanford.py … I don't see why I've called ssplit here; spaCy's tokenizer doesn't split sentences. Could this explain it?
-
Thanks for the rapid & friendly response! Yeah, that could totally explain it – if anything the tokenizer is probably a bit slower since 2014 as more exceptions & non-BMP Unicode support were added, but I think the sentence splitter used to be quite slow for no very good reason.
-
V2 v2's current tokenisation speed is considered a bug btw --- it's high on our list
-
http://batch.sh is missing from GitHub… TIming SpaCy inside python runner script vs. forking subprocess with bash script—run_stanford.py—isn’t fair for fast tokenization. stanford.ini also missing so unsure what run_raw_stanford.py is doing but also likely more overhead?
-
Still reconstructing, but I believe http://batch.sh just invokes CoreNLP with the list of filenames. My current theory is I checked run_stanford.py against run_raw_stanford.py and saw little difference, but raw is wrong for tokenization, as it also calls ssplit.
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
Thanks for this information. The reason we're using spacy is it's a way robustly implemented than other python nlp libraries. Would like to try Stanford tokenizer but I need to see first if the Java code getting called inside a python module is a relatively significant overhead.
-
There is no need to call java CoreNLP code within python since CoreNLP has an embedded api.
-
FWIW, I tried this—and now I promise to stop. It is faster to tokenize in Python using CoreNLP via the stanfordcorenlp Python package than with SpaCy v2, but with conversions to/from JSON etc., it becomes only 30% faster, not 20x faster. https://nlp.stanford.edu/software/tokenizer.html#Speed …
-
At this point, question: how can I write a simple PTBTokenizer test case?
কথা-বার্তা শেষ
নতুন কথা-বার্তা -
-
-
@honnibal it seems like the benchmark is also measuring the load time of Java and the CoreNLP library, and the cost of serializing the text to disk and then call CoreNLP...ধন্যবাদ। আপনার সময়রেখাকে আরো ভালো করে তুলতে টুইটার এটিকে ব্যবহার করবে। পূর্বাবস্থায়পূর্বাবস্থায়
-
লোড হতে বেশ কিছুক্ষণ সময় নিচ্ছে।
টুইটার তার ক্ষমতার বাইরে চলে গেছে বা কোনো সাময়িক সমস্যার সম্মুখীন হয়েছে আবার চেষ্টা করুন বা আরও তথ্যের জন্য টুইটারের স্থিতি দেখুন।