-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Add rapidhash #22085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rapidhash #22085
Conversation
Mind posting the benchmarks? Both for small keys and larger ones, I'm curious how it compares. |
Here's the results of the builtin benchmark. My use case is 32-64 byte keys, so these are nice results. I'll look into getting a comparison against the full C version. Thought that will be more of a test of the compiler difference wouldn't it? I imagine the C version compiled with zig cc or any llvm based compiler (I'm on Mac), would spit out largely similar results.
|
@Rexicon226 Here's the output from rapidhash and zig-rapidhash C rapidhash
Zig rapidhash
|
Two things I'm noticing, the verification fails at the top. I'm looking into that, not sure what that's being based on exactly. Maybe it's the seed? Although it seems like the Zig version is winning in the 4-16 byte range, keeping on par with the 1-3 byte range, with the expected jump at 17+. |
Interesting, thanks for running these. The verification must succeed; otherwise, something is wrong with the implementation. The seed provided by smhasher is 0. You can find the logic behind the verification pass here: Regarding the benchmarks themselves, I'm happy as long as they're roughly the same cycles per hash. Ping me when you figure out the issue, and I'll happily give this a code review. |
@Rexicon226 Fixed it! My inner loops were reading wrong indexes. Latest commit.
|
Here's the full output of running all the tests. https://gist.github.com/Cluster444/57f403fb7121b601295d5aeb049f702f |
Can you share how you ran this test using smhasher? |
See my first comment where I show how I did it for XXHash3. I assume it was a similar process. |
Yup, I the same zig build and link direct into smhasher. |
@Rexicon226 Inlined into file level struct. That was a little weird on Mac because apparently it is case insensitive. Kind of surprised, a git mv --force was needed to get it to actually rename. Let me know if there's anything else! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay, I've been quite busy.
I gave a couple of comments for you to work on :)
const e = len - 4; | ||
a = (r32(k) << 32) | r32(k[e..]); | ||
b = ((r32(k[d..]) << 32) | r32(k[(e - d)..])); | ||
} else if (len > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put this if
case into brackets, I think it'd be more readable that way.
return mix(a ^ sc[0] ^ len, b ^ sc[1]); | ||
} | ||
|
||
test "RapidHash.hash" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the test case below implementation-related functions.
for (sizes, outcomes) |s, e| { | ||
const r = hash(RAPID_SEED, bytes[0..s]); | ||
|
||
expectEqual(e, r) catch |err| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, you can use expectEqualSlice
here instead.
var a: u64 = 0; | ||
var b: u64 = 0; | ||
var k = input; | ||
var is: [3]u64 = .{ seed, 0, 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to name these variables something more descriptive? The code is a bit hard to read right now since it's difficult to understand what each variable is representing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya for sure, that was mostly to ease writing it since it was easier for me to read, I'll do some rename swaps on those.
|
||
if (len <= 16) { | ||
if (len >= 4) { | ||
const d: u64 = ((len & 24) >> @intCast(len >> 3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra parenthesis here and in a couple of places below.
Fixes #2497 While investigating the issue I added an integrity check that found a problem hiding in the insert logic that was unrelated- in fixing that I greatly simplified the insert logic. It turns out that #2497 is ultimately just a case of bad luck, pathological inputs that result in very non-uniform hashes so the clustering overwhelms things. The solution was just to add a check and claim we're out of memory. I tried adding an entropy folding function to fix the hash a little but it had a measurable negative impact on performance and isn't necessary so I've not included it here. Currently there's an open PR to Zig to [add RapidHash](ziglang/zig#22085), which is the successor to Wyhash and apparently has much better statistical characteristics on top of being faster. I imagine it will land in time for 0.14 so whenever we update to 0.14 we should probably switch our standard hash function to RapidHash, which I imagine should yield improvements across the board. Using the AutoHasher may also be not the best idea, I may explore ways to improve how we generate our style hashes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it outperforms wyhash this PR should probably also switch the default hasher used by auto-hashing containers.
Is Swiss Tables related? https://x.com/petermattis/status/1889080982273163466 |
Thanks! Follow-up PR welcome to switch std lib and/or compiler from Wyhash, provided that it comes with perf data points. |
This is the a zig implementation of the rapidhash successor to wyhash.
It's an improvement on small key performance and quality. Especially now that wyhash is no longer being maintained.