Very impressed with https://github.com/BurntSushi/ripgrep … $ time grep -v -x -i -f stoplist words > words-stop real 100m39.604s user 97m24.045s sys 0m13.054s $ time rg -v -x -i -f stoplist words > words-stop-rg real 4m9.471s user 1m5.310s sys 2m57.231s
-
-
-
Replying to @burntsushi5
Whoa, how did I miss this? It was GNU grep, I think version 2.25.
1 reply 0 retweets 1 like -
Replying to @_devkev_
Thanks for responding! Is it possible to share `words` and `stoplist`?
1 reply 0 retweets 0 likes -
Replying to @burntsushi5
stoplist yes, but words no. I could give you summary stats, if it would help. What's your interest?
2 replies 0 retweets 0 likes -
Replying to @_devkev_
One possible explanation is that the first run with GNU grep was reading `words` from disk (it is presumably a large file), but the second run with rg was mostly reading `words` from the I/O cache. You could test this by running GNU grep a few times.
2 replies 0 retweets 0 likes -
Replying to @burntsushi5
Later I can try to narrow down the culprit grep flag(s). I could also maybe try running a profiler or something - let me know if you have any ideas about what might be fruitful.
1 reply 0 retweets 0 likes -
Replying to @_devkev_ @burntsushi5
Looks like grep just has terrible scaling factors wrt the number of patterns (especially with -x)pic.twitter.com/DQJ8UqnGK6
2 replies 0 retweets 0 likes
I bet if you keep increasing the number of patterns, you will see a performance cliff. (And then you should be able to fix it by increasing `--dfa-size-limit`.)
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.