Continuing the string searching story, I had the brain wave to try ripgrep. Searching for fixed strings in binary files, it tops out at 900 MiB/s. That's the target to meet. I should inspect the code to see how it's done.
-
Show this thread
-
Replying to @chvest
The key to success in this space is less about algorithm and more about making efficient use of the hardware via vectorization. ripgrep's primary literal searcher is here: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs … And in particular, the single substring case is here: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs …
1 reply 0 retweets 1 like -
Replying to @burntsushi5 @chvest
Make sure your benchmarks include a healthy variety. 900MB/s sounds a bit slow to me in many cases. For example: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/bench/log/07/rust#L85 …
4 replies 0 retweets 0 likes -
Replying to @burntsushi5
The 900 MiB/s were achieved with searching for an 11 byte string in a 4 GiB binary file, after drop_caches, on Ext4 with LUKS encryption, 2xPCIe 3.0. So roughly half of link-speed.
1 reply 0 retweets 0 likes -
Replying to @chvest
I see. Does the speed remain the same if the file is in cache? Also, how many matches are there? e.g., For a 13GB file in cache, `rg 'Sherlock Holmes' OpenSubtitles2018.raw.en -c` takes 1.7s for 7673 matches, which is about 7.5 GB/s.
3 replies 0 retweets 0 likes -
Replying to @burntsushi5
It drops from 5.5s to 1.4s, but I also just noticed that I’ve been running rg —version 0.9.0. Trying again with… *squints* 11.0.2.
1 reply 0 retweets 0 likes -
Replying to @chvest @burntsushi5
4.8s and 1.2s. Oddly no count is printed, though I run with `-caF --no-ignore`.
2 replies 0 retweets 0 likes -
Replying to @chvest @burntsushi5
`perf report` says 53.2% are in memchr::x86::avx::memchr. Then 12.2% in regex::re_bytes::Regex::shortest_match_at. The rest is in kernel page cache.
1 reply 0 retweets 0 likes -
Replying to @chvest
Anyway, it helps to standardize on the same input. Otherwise it's hard to diagnose. :-)
1 reply 0 retweets 0 likes -
Replying to @burntsushi5
I guess I had a typo. Tried a simpler string, and now rg and fgrep both find matches but disagree on the count (different handling of binary?). I'll try your linked file when I get home.
1 reply 0 retweets 0 likes
If you're using `-a` then the counts should be identical, otherwise you may have found a bug. If you aren't using `-a`, then indeed, there can be differences if your input contains a NUL byte. Try running with `--no-mmap` to see if that changes things.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.