Intersection of cryptography, software engineering, statistics, and performance:
If i want to do a random sample from a set of data, and I wish to simply do "hash(dataitem) & bitmask < value", what properties does my hash function need to be statistically sound? Clearly, a ...
Conversation
... cryptographic hash will suffice in terms of uniformity of sampling, but how about non-cryptographic hashes? Cityhash64?
3
2
Replying to
ah, uniformity is a harsher constraint than limiting P(collission), so not many hashes will be optimized for it. What kind of data (and amount of information, per hash) are we talking about?
3
(plus, if in doubt, assuming you ask for non-cryptographic hashes for efficiency reasons: can't you test it on a representative corpus of data? SHA-256/512 would probably be fine. Chances are you have even better-optimized DEFLATE, so maybe use that to decrease computation)
1
SipHash is cryptographically secure as long as you truly only need 64-bit output which is generally the case for these use cases. It has a variant with 128-bit output too. No point of using SHA-2 for these use cases, unless you have ridiculously good CPU acceleration for it.


