Oh, I had missed that the Galil rule gives Boyer-Moore linear worst case performance. This might mean that BM is always a better choice than KMP.
Make sure your benchmarks include a healthy variety. 900MB/s sounds a bit slow to me in many cases. For example: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/bench/log/07/rust#L85 …
-
-
The 900 MiB/s were achieved with searching for an 11 byte string in a 4 GiB binary file, after drop_caches, on Ext4 with LUKS encryption, 2xPCIe 3.0. So roughly half of link-speed.
-
I see. Does the speed remain the same if the file is in cache? Also, how many matches are there? e.g., For a 13GB file in cache, `rg 'Sherlock Holmes' OpenSubtitles2018.raw.en -c` takes 1.7s for 7673 matches, which is about 7.5 GB/s.
- 7 more replies
New conversation -
-
-
Finally, multiple pattern searching opens up a whole new can of worms with even more opportunities for vectorization: https://github.com/BurntSushi/aho-corasick/blob/master/DESIGN.md …
-
Thanks for sharing all this. I have a lot of homework now
End of conversation
New conversation -
-
-
And in particular, there will be no one algorithm that is good for everything. regex actually has a BM implementation too, which it uses in a limited set of circumstances: https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/literal/imp.rs#L506-L546 …
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
My hope is to move the regex code over to using bstr, which provides a Two-Way implementation that also uses the same frequency style heuristics found in regex today: https://github.com/BurntSushi/bstr/blob/f8a6c7031df1748b8469eb94372e2bd53d82ef1c/src/search/twoway.rs …
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.