Oh, I had missed that the Galil rule gives Boyer-Moore linear worst case performance. This might mean that BM is always a better choice than KMP.
-
Show this thread
-
And Apostolico claims to place an O(2n) upper bound on BM. That's the same worst case as KMP.
1 reply 0 retweets 2 likesShow this thread -
Continuing the string searching story, I had the brain wave to try ripgrep. Searching for fixed strings in binary files, it tops out at 900 MiB/s. That's the target to meet. I should inspect the code to see how it's done.
2 replies 0 retweets 0 likesShow this thread -
Replying to @chvest
The key to success in this space is less about algorithm and more about making efficient use of the hardware via vectorization. ripgrep's primary literal searcher is here: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs … And in particular, the single substring case is here: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs …
1 reply 0 retweets 1 like -
Replying to @burntsushi5 @chvest
Make sure your benchmarks include a healthy variety. 900MB/s sounds a bit slow to me in many cases. For example: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/bench/log/07/rust#L85 …
4 replies 0 retweets 0 likes -
Replying to @burntsushi5
The 900 MiB/s were achieved with searching for an 11 byte string in a 4 GiB binary file, after drop_caches, on Ext4 with LUKS encryption, 2xPCIe 3.0. So roughly half of link-speed.
1 reply 0 retweets 0 likes -
Replying to @chvest
I see. Does the speed remain the same if the file is in cache? Also, how many matches are there? e.g., For a 13GB file in cache, `rg 'Sherlock Holmes' OpenSubtitles2018.raw.en -c` takes 1.7s for 7673 matches, which is about 7.5 GB/s.
3 replies 0 retweets 0 likes
If I clear caches first, then it takes about 24.5 seconds, or ~543 MB/s. (I have a SATAIII SSD, so this is close to its top speed.) The input file can be download here: http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz …
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.