-
Notifications
You must be signed in to change notification settings - Fork 1.6k
optimize _Popcount_fallback #2079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
22e5f41
to
1d7bd54
Compare
Won't that hurt ARM ? |
Co-authored-by: Alex Guteniev <[email protected]>
Actually #1924 is the reason why I looked at _Popcount_fallback. I wanted to do such optimization but I didn't find https://docs.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-160 |
You are looking at documentation. Look at source instead, not all of them are documented. |
I would like to see benchmark results before making this change. It's non-obvious to me that one multiply and one shift is faster than three shifts and three additions (as results from unrolling this loop for 64-bit arguments). IIRC this fallback implementation is used only on ARM and ARM64 - and ancient/irrelevant x86 - so ARM and ARM64 performance is the only performance that matters. |
No, eight of both and does not unroll. |
This loop: for (int _Shift_digits = 8; _Shift_digits < _Digits; _Shift_digits <<= 1) {
_Val = static_cast<_Ty>(_Val + static_cast<_Ty>(_Val >> _Shift_digits));
} has one shift and on add on the single line of code in the loop body. |
Do the stl maintainers have Windows ARM64 Laptop? I have only such arm: Orange Pi(cortex A53) with Linux. I'm not sure if Orange Pi results will reflect reality... |
Approved in the absence of profiling results. Rationale:
If other maintainers want to see profiling results, I'm totally fine with that. 😸 |
I'm mirroring this to an MSVC-internal PR. Changes can still be pushed during final review, but please notify me if that happens. |
Thanks for improving/simplifying this fallback implementation! ✔️ 😺 🚀 |
What I did.
https://quick-bench.com/q/e4jOb3zpeqd7uk-6hrRApCktuGA
I looked at the assembly: https://gcc.godbolt.org/z/edhP9oY7q
and I rewrote it in C++:
STL/stl/inc/limits
Lines 1043 to 1058 in 9a9820d
unsigned char
andunsigned short
assembly is different (they don't do static_cast on the bit patterns) but result is the same.(and the codegen is better than clang libc++ version for
unsigned char
(becauseimul
andshr
were eliminated: https://godbolt.org/z/93z9zqz93 ) and same perf forunsigned short
)https://quick-bench.com/q/6gyvnQibtHGDunNHM5q350YzF14
https://quick-bench.com/q/GLcsXB1abxHSjtofp1NaUNcT_Hk
diff: https://godbolt.org/z/MfYzfov3M
tests: https://gcc.godbolt.org/z/qse6Gfdjr
See @AlexGuteniev 's explanation for more: #2079 (comment)
A brute-force testing (for uint8_t and uint16_t) and same assembly (for uint32_t and uint64_t) give me confidence that I didn't broke anything here. (I hope)