Skip to content

Conversation

fsb4000
Copy link
Contributor

@fsb4000 fsb4000 commented Jul 25, 2021

What I did.

  1. I found that current MS STL implementation is slower than clang libc++.

https://quick-bench.com/q/e4jOb3zpeqd7uk-6hrRApCktuGA

  1. I looked at the assembly: https://gcc.godbolt.org/z/edhP9oY7q

  2. and I rewrote it in C++:

#include <limits>
template<typename T>
int xxx(T x) {
    constexpr int _Digits = std::numeric_limits<T>::digits;
    T rdi = x;
    T rax = x;
    rax = rax >> 1;
    T rcx = static_cast<T>(0x5555'5555'5555'5555ull);
    rcx = rcx & rax;
    rdi = rdi - rcx;
    rax = static_cast<T>(0x3333'3333'3333'3333ull);
    rcx = rdi;
    rcx = rcx & rax;
    rdi = rdi >> 2;
    rdi = rdi & rax;
    rdi = rdi + rcx;
    rax = rdi;
    rax = rax >> 4;
    rax = rax + rdi;
    rcx = static_cast<T>(0x0F0F'0F0F'0F0F'0F0F);
    rcx = rcx & rax;
    rax = static_cast<T>(0x0101'0101'0101'0101);
    rax = rax * rcx;
    rax = rax >> (_Digits-8);
    return rax;
}

template int xxx<unsigned long long>(unsigned long long);
template int xxx<unsigned>(unsigned);
template int xxx<unsigned short>(unsigned short);
template int xxx<unsigned char>(unsigned char);
  1. I checked current implementation of _Popcount_fallback and I realized that it is almost the same. Only 2 lines are different.

STL/stl/inc/limits

Lines 1043 to 1058 in 9a9820d

// Implementation of popcount without using specialized CPU instructions.
// Used at compile time and when said instructions are not supported.
template <class _Ty>
_NODISCARD constexpr int _Popcount_fallback(_Ty _Val) noexcept {
constexpr int _Digits = numeric_limits<_Ty>::digits;
// we static_cast these bit patterns in order to truncate them to the correct size
_Val = static_cast<_Ty>(_Val - ((_Val >> 1) & static_cast<_Ty>(0x5555'5555'5555'5555ull)));
_Val = static_cast<_Ty>((_Val & static_cast<_Ty>(0x3333'3333'3333'3333ull))
+ ((_Val >> 2) & static_cast<_Ty>(0x3333'3333'3333'3333ull)));
_Val = static_cast<_Ty>((_Val + (_Val >> 4)) & static_cast<_Ty>(0x0F0F'0F0F'0F0F'0F0Full));
for (int _Shift_digits = 8; _Shift_digits < _Digits; _Shift_digits <<= 1) {
_Val = static_cast<_Ty>(_Val + static_cast<_Ty>(_Val >> _Shift_digits));
}
// we want the bottom "slot" that's big enough to store _Digits
return static_cast<int>(_Val & static_cast<_Ty>(_Digits + _Digits - 1));
}

  1. unsigned char and unsigned short assembly is different (they don't do static_cast on the bit patterns) but result is the same.

(and the codegen is better than clang libc++ version for unsigned char (because imul and shr were eliminated: https://godbolt.org/z/93z9zqz93 ) and same perf for unsigned short)

https://quick-bench.com/q/6gyvnQibtHGDunNHM5q350YzF14

https://quick-bench.com/q/GLcsXB1abxHSjtofp1NaUNcT_Hk

diff: https://godbolt.org/z/MfYzfov3M

tests: https://gcc.godbolt.org/z/qse6Gfdjr

See @AlexGuteniev 's explanation for more: #2079 (comment)

A brute-force testing (for uint8_t and uint16_t) and same assembly (for uint32_t and uint64_t) give me confidence that I didn't broke anything here. (I hope)

@fsb4000 fsb4000 requested a review from a team as a code owner July 25, 2021 22:31
@fsb4000 fsb4000 force-pushed the popcount_fallback branch from 22e5f41 to 1d7bd54 Compare July 25, 2021 22:45
@AlexGuteniev
Copy link
Contributor

Won't that hurt ARM ?
It could be faster for ARM too, but could be not

fsb4000 and others added 2 commits July 26, 2021 17:56
@fsb4000
Copy link
Contributor Author

fsb4000 commented Jul 26, 2021

Won't that hurt ARM ?
It could be faster for ARM too, but could be not

@AlexGuteniev

Actually #1924 is the reason why I looked at _Popcount_fallback. I wanted to do such optimization but I didn't find cnt intrinsic :(. And then I just took a look at what clang does with std::popcount for curiosity. The rest is in the first comment. I cannot test arm, but I hope that one multiplication and one shift should be faster than the loop. If anyone has an arm it would be great to look at some bench results.

https://docs.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-160

@AlexGuteniev
Copy link
Contributor

https://docs.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-160

You are looking at documentation. Look at source instead, not all of them are documented.

@CaseyCarter
Copy link
Contributor

I cannot test arm, but I hope that one multiplication and one shift should be faster than the loop. If anyone has an arm it would be great to look at some bench results.

I would like to see benchmark results before making this change. It's non-obvious to me that one multiply and one shift is faster than three shifts and three additions (as results from unrolling this loop for 64-bit arguments). IIRC this fallback implementation is used only on ARM and ARM64 - and ancient/irrelevant x86 - so ARM and ARM64 performance is the only performance that matters.

@CaseyCarter CaseyCarter added the performance Must go faster label Jul 26, 2021
@AlexGuteniev
Copy link
Contributor

three shifts and three additions (as results from unrolling this loop for 64-bit arguments).

No, eight of both and does not unroll.

@CaseyCarter
Copy link
Contributor

CaseyCarter commented Jul 26, 2021

three shifts and three additions (as results from unrolling this loop for 64-bit arguments).

No, eight of both and does not unroll.

This loop:

for (int _Shift_digits = 8; _Shift_digits < _Digits; _Shift_digits <<= 1) {
    _Val = static_cast<_Ty>(_Val + static_cast<_Ty>(_Val >> _Shift_digits));
}

has one shift and on add on the single line of code in the loop body. _Shift_digits is initially 8, _Digits is 8/16/32/64 depending on the source type (remember its numeric_limits<T>::digits not numeric_limits<T>::digits10), 64 for uint64_t as in the benchmark. On each loop iteration, _Shifr_digits is doubled so it takes on the values in {8, 16, 32} (in order) that are less than _Digits. Again for uint64_t, that means the loop iterates three times. Three times 1 ADD + 1 SHIFT = 3 ADD + 3 SHIFT.

@fsb4000
Copy link
Contributor Author

fsb4000 commented Jul 26, 2021

@CaseyCarter

Do the stl maintainers have Windows ARM64 Laptop?

I have only such arm: Orange Pi(cortex A53) with Linux.

I'm not sure if Orange Pi results will reflect reality...

@CaseyCarter
Copy link
Contributor

Do the stl maintainers have Windows ARM64 Laptop?

@cbezault @AnjuDel do one of you have possession of the ARM64 test box to run the benchmark in the OP?

@AlexGuteniev
Copy link
Contributor

Actually #1924 is the reason why I looked at _Popcount_fallback. I wanted to do such optimization but I didn't find cnt intrinsic :(.

@fsb4000 , the intrinsic is available, I commented in #1924

@StephanTLavavej
Copy link
Member

Approved in the absence of profiling results. Rationale:

  • For ARM32, according to my understanding, micro-optimizing performance is not a priority for us at this time.
  • For ARM64, <bit>: popcount() utilizes cnt instruction on arm64 #2127 will unconditionally use intrinsics.
  • For ARM64EC, if we want to micro-optimize performance in a followup PR, we should investigate whether x64 or ARM64 intrinsics can be activated. If neither can, then we can profile the original implementation versus this multiplication. I am not concerned about anchoring to the status quo here - it's not like we had ARM64EC profiling results when we shipped the original implementation.
  • For constexpr evaluation, I am reasonably confident that the multiplication technique results in fewer steps and therefore greater throughput - certainly no concerns about regressions. (I'm not sure if there's a way to get detailed stats from the compiler through undocumented options.)

If other maintainers want to see profiling results, I'm totally fine with that. 😸

@StephanTLavavej
Copy link
Member

I'm mirroring this to an MSVC-internal PR. Changes can still be pushed during final review, but please notify me if that happens.

@StephanTLavavej StephanTLavavej merged commit c261323 into microsoft:main Aug 27, 2021
@StephanTLavavej
Copy link
Member

Thanks for improving/simplifying this fallback implementation! ✔️ 😺 🚀

@fsb4000 fsb4000 deleted the popcount_fallback branch August 27, 2021 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants