Continuing the string searching story, I had the brain wave to try ripgrep. Searching for fixed strings in binary files, it tops out at 900 MiB/s. That's the target to meet. I should inspect the code to see how it's done.
-
Show this thread
-
Replying to @chvest
The key to success in this space is less about algorithm and more about making efficient use of the hardware via vectorization. ripgrep's primary literal searcher is here: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs … And in particular, the single substring case is here: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs …
1 reply 0 retweets 1 like -
Replying to @burntsushi5 @chvest
Make sure your benchmarks include a healthy variety. 900MB/s sounds a bit slow to me in many cases. For example: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/bench/log/07/rust#L85 …
4 replies 0 retweets 0 likes -
Replying to @burntsushi5
The 900 MiB/s were achieved with searching for an 11 byte string in a 4 GiB binary file, after drop_caches, on Ext4 with LUKS encryption, 2xPCIe 3.0. So roughly half of link-speed.
1 reply 0 retweets 0 likes -
Replying to @chvest
I see. Does the speed remain the same if the file is in cache? Also, how many matches are there? e.g., For a 13GB file in cache, `rg 'Sherlock Holmes' OpenSubtitles2018.raw.en -c` takes 1.7s for 7673 matches, which is about 7.5 GB/s.
3 replies 0 retweets 0 likes -
Replying to @burntsushi5
You got faster CPU or RAM than I do: $ time rg 'Sherlock Holmes' ~/Downloads/en.txt -c 7673 real0m16,405s user0m1,998s sys0m3,852s $ time rg 'Sherlock Holmes' ~/Downloads/en.txt -c 7673 real0m2,734s user0m1,548s sys0m1,179s $
1 reply 0 retweets 0 likes -
Replying to @chvest
Yeah that's fine. My main point here was to question whether 900 MB/s should be your target. :-) Your search there (in cache) is running at ~4.9 GB/s. It's interesting though that your out of cache search isn't saturating your disk bandwidth? Or is it what you'd expect?
1 reply 0 retweets 0 likes -
Replying to @burntsushi5
I'd expect it to not saturate IO, yeah. It's single-threaded, alternating between searching and IO. Add in system call and decryption overheads, and it doesn't strike me as out of the ordinary.
1 reply 0 retweets 0 likes -
Replying to @chvest
Interesting. I wouldn't necessarily think that, although perhaps decryption overhead is the variable I'm missing. What does `pv < OpenSubtitles2018.raw.en > /dev/null` say after dropping caches? If ripgrep can get 4.9 GB/s in cache, then I'd expect it to be as fast as `pv`.
1 reply 0 retweets 0 likes -
Replying to @burntsushi5
`pv` gives me the same numbers; ~800 MiB/s for IO, 4.7 GiB/s for cache. Fio says I should be able to get 1.6 GiB/s for sequential read using io_uring, so that's something to try. I should also see what the other IO engines can do, though.
1 reply 0 retweets 0 likes
Ah I see, okay good. So I think this at least confirms the problem isn't with ripgrep. :-) Thanks for your patience debugging this with me! Especially on an imperfect medium.
-
-
Replying to @burntsushi5
Yeah, no worries. And thanks for the pointers earlier. I plan on giving the vectorised Two Way implementations a try.
0 replies 0 retweets 1 likeThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.