Oh, I had missed that the Galil rule gives Boyer-Moore linear worst case performance. This might mean that BM is always a better choice than KMP.
-
Show this thread
-
And Apostolico claims to place an O(2n) upper bound on BM. That's the same worst case as KMP.
1 reply 0 retweets 2 likesShow this thread -
Continuing the string searching story, I had the brain wave to try ripgrep. Searching for fixed strings in binary files, it tops out at 900 MiB/s. That's the target to meet. I should inspect the code to see how it's done.
2 replies 0 retweets 0 likesShow this thread -
Replying to @chvest
The key to success in this space is less about algorithm and more about making efficient use of the hardware via vectorization. ripgrep's primary literal searcher is here: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs … And in particular, the single substring case is here: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs …
1 reply 0 retweets 1 like -
Replying to @burntsushi5 @chvest
Make sure your benchmarks include a healthy variety. 900MB/s sounds a bit slow to me in many cases. For example: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/bench/log/07/rust#L85 …
4 replies 0 retweets 0 likes -
Replying to @burntsushi5
The 900 MiB/s were achieved with searching for an 11 byte string in a 4 GiB binary file, after drop_caches, on Ext4 with LUKS encryption, 2xPCIe 3.0. So roughly half of link-speed.
1 reply 0 retweets 0 likes -
Replying to @chvest
I see. Does the speed remain the same if the file is in cache? Also, how many matches are there? e.g., For a 13GB file in cache, `rg 'Sherlock Holmes' OpenSubtitles2018.raw.en -c` takes 1.7s for 7673 matches, which is about 7.5 GB/s.
3 replies 0 retweets 0 likes -
Replying to @burntsushi5
It drops from 5.5s to 1.4s, but I also just noticed that I’ve been running rg —version 0.9.0. Trying again with… *squints* 11.0.2.
1 reply 0 retweets 0 likes -
Replying to @chvest @burntsushi5
4.8s and 1.2s. Oddly no count is printed, though I run with `-caF --no-ignore`.
2 replies 0 retweets 0 likes -
Replying to @chvest
Maybe there are no matches? In any case, the strange thing here is that when you run a search after dropping caches, then you only get half your stated disk bandwidth. But when you run with cache, your search goes faster, which would *normally* suggest you were I/O bound before.
1 reply 0 retweets 0 likes
(But if you're I/O bound, then you should be saturating disk bandwidth.)
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.