optimize _Popcount_fallback #2079

fsb4000 · 2021-07-25T22:31:42Z

What I did.

I found that current MS STL implementation is slower than clang libc++.

https://quick-bench.com/q/e4jOb3zpeqd7uk-6hrRApCktuGA

I looked at the assembly: https://gcc.godbolt.org/z/edhP9oY7q
and I rewrote it in C++:

#include <limits>
template<typename T>
int xxx(T x) {
    constexpr int _Digits = std::numeric_limits<T>::digits;
    T rdi = x;
    T rax = x;
    rax = rax >> 1;
    T rcx = static_cast<T>(0x5555'5555'5555'5555ull);
    rcx = rcx & rax;
    rdi = rdi - rcx;
    rax = static_cast<T>(0x3333'3333'3333'3333ull);
    rcx = rdi;
    rcx = rcx & rax;
    rdi = rdi >> 2;
    rdi = rdi & rax;
    rdi = rdi + rcx;
    rax = rdi;
    rax = rax >> 4;
    rax = rax + rdi;
    rcx = static_cast<T>(0x0F0F'0F0F'0F0F'0F0F);
    rcx = rcx & rax;
    rax = static_cast<T>(0x0101'0101'0101'0101);
    rax = rax * rcx;
    rax = rax >> (_Digits-8);
    return rax;
}

template int xxx<unsigned long long>(unsigned long long);
template int xxx<unsigned>(unsigned);
template int xxx<unsigned short>(unsigned short);
template int xxx<unsigned char>(unsigned char);

I checked current implementation of _Popcount_fallback and I realized that it is almost the same. Only 2 lines are different.

STL/stl/inc/limits

Lines 1043 to 1058 in 9a9820d

    
           // Implementation of popcount without using specialized CPU instructions.
 
           // Used at compile time and when said instructions are not supported.
 
           template <class _Ty>
 
           _NODISCARD constexpr int _Popcount_fallback(_Ty _Val) noexcept {
 
               constexpr int _Digits = numeric_limits<_Ty>::digits;
 
               // we static_cast these bit patterns in order to truncate them to the correct size
 
               _Val = static_cast<_Ty>(_Val - ((_Val >> 1) & static_cast<_Ty>(0x5555'5555'5555'5555ull)));
 
               _Val = static_cast<_Ty>((_Val & static_cast<_Ty>(0x3333'3333'3333'3333ull))
 
                                       + ((_Val >> 2) & static_cast<_Ty>(0x3333'3333'3333'3333ull)));
 
               _Val = static_cast<_Ty>((_Val + (_Val >> 4)) & static_cast<_Ty>(0x0F0F'0F0F'0F0F'0F0Full));
 
               for (int _Shift_digits = 8; _Shift_digits < _Digits; _Shift_digits <<= 1) {
 
                   _Val = static_cast<_Ty>(_Val + static_cast<_Ty>(_Val >> _Shift_digits));
 
               }
 
               // we want the bottom "slot" that's big enough to store _Digits
 
               return static_cast<int>(_Val & static_cast<_Ty>(_Digits + _Digits - 1));
 
           }

unsigned char and unsigned short assembly is different (they don't do static_cast on the bit patterns) but result is the same.

(and the codegen is better than clang libc++ version for unsigned char (because imul and shr were eliminated: https://godbolt.org/z/93z9zqz93 ) and same perf for unsigned short)

https://quick-bench.com/q/6gyvnQibtHGDunNHM5q350YzF14

https://quick-bench.com/q/GLcsXB1abxHSjtofp1NaUNcT_Hk

diff: https://godbolt.org/z/MfYzfov3M

tests: https://gcc.godbolt.org/z/qse6Gfdjr

See @AlexGuteniev 's explanation for more: #2079 (comment)

A brute-force testing (for uint8_t and uint16_t) and same assembly (for uint32_t and uint64_t) give me confidence that I didn't broke anything here. (I hope)

AlexGuteniev · 2021-07-26T06:16:04Z

Won't that hurt ARM ?
It could be faster for ARM too, but could be not

stl/inc/limits

Co-authored-by: Alex Guteniev <[email protected]>

fsb4000 · 2021-07-26T11:07:12Z

Won't that hurt ARM ?
It could be faster for ARM too, but could be not

@AlexGuteniev

Actually #1924 is the reason why I looked at _Popcount_fallback. I wanted to do such optimization but I didn't find cnt intrinsic :(. And then I just took a look at what clang does with std::popcount for curiosity. The rest is in the first comment. I cannot test arm, but I hope that one multiplication and one shift should be faster than the loop. If anyone has an arm it would be great to look at some bench results.

https://docs.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-160

AlexGuteniev · 2021-07-26T11:10:06Z

https://docs.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-160

You are looking at documentation. Look at source instead, not all of them are documented.

CaseyCarter · 2021-07-26T17:04:21Z

I cannot test arm, but I hope that one multiplication and one shift should be faster than the loop. If anyone has an arm it would be great to look at some bench results.

I would like to see benchmark results before making this change. It's non-obvious to me that one multiply and one shift is faster than three shifts and three additions (as results from unrolling this loop for 64-bit arguments). IIRC this fallback implementation is used only on ARM and ARM64 - and ancient/irrelevant x86 - so ARM and ARM64 performance is the only performance that matters.

AlexGuteniev · 2021-07-26T17:34:20Z

three shifts and three additions (as results from unrolling this loop for 64-bit arguments).

No, eight of both and does not unroll.

CaseyCarter · 2021-07-26T17:44:03Z

three shifts and three additions (as results from unrolling this loop for 64-bit arguments).

No, eight of both and does not unroll.

This loop:

for (int _Shift_digits = 8; _Shift_digits < _Digits; _Shift_digits <<= 1) {
    _Val = static_cast<_Ty>(_Val + static_cast<_Ty>(_Val >> _Shift_digits));
}

has one shift and on add on the single line of code in the loop body. _Shift_digits is initially 8, _Digits is 8/16/32/64 depending on the source type (remember its numeric_limits<T>::digits not numeric_limits<T>::digits10), 64 for uint64_t as in the benchmark. On each loop iteration, _Shifr_digits is doubled so it takes on the values in {8, 16, 32} (in order) that are less than _Digits. Again for uint64_t, that means the loop iterates three times. Three times 1 ADD + 1 SHIFT = 3 ADD + 3 SHIFT.

fsb4000 · 2021-07-26T17:52:30Z

@CaseyCarter

Do the stl maintainers have Windows ARM64 Laptop?

I have only such arm: Orange Pi(cortex A53) with Linux.

I'm not sure if Orange Pi results will reflect reality...

CaseyCarter · 2021-07-26T20:29:33Z

Do the stl maintainers have Windows ARM64 Laptop?

@cbezault @AnjuDel do one of you have possession of the ARM64 test box to run the benchmark in the OP?

AlexGuteniev · 2021-08-15T18:39:41Z

Actually #1924 is the reason why I looked at _Popcount_fallback. I wanted to do such optimization but I didn't find cnt intrinsic :(.

@fsb4000 , the intrinsic is available, I commented in #1924

StephanTLavavej · 2021-08-21T00:06:22Z

Approved in the absence of profiling results. Rationale:

For ARM32, according to my understanding, micro-optimizing performance is not a priority for us at this time.
For ARM64, <bit>: popcount() utilizes cnt instruction on arm64 #2127 will unconditionally use intrinsics.
For ARM64EC, if we want to micro-optimize performance in a followup PR, we should investigate whether x64 or ARM64 intrinsics can be activated. If neither can, then we can profile the original implementation versus this multiplication. I am not concerned about anchoring to the status quo here - it's not like we had ARM64EC profiling results when we shipped the original implementation.
For constexpr evaluation, I am reasonably confident that the multiplication technique results in fewer steps and therefore greater throughput - certainly no concerns about regressions. (I'm not sure if there's a way to get detailed stats from the compiler through undocumented options.)

If other maintainers want to see profiling results, I'm totally fine with that. 😸

StephanTLavavej · 2021-08-26T01:51:43Z

I'm mirroring this to an MSVC-internal PR. Changes can still be pushed during final review, but please notify me if that happens.

StephanTLavavej · 2021-08-27T00:05:58Z

Thanks for improving/simplifying this fallback implementation! ✔️ 😺 🚀

fsb4000 requested a review from a team as a code owner July 25, 2021 22:31

optimize _Popcount_fallback

1d7bd54

fsb4000 force-pushed the popcount_fallback branch from 22e5f41 to 1d7bd54 Compare July 25, 2021 22:45

forget ull suffix :(

12ebf59

AlexGuteniev reviewed Jul 26, 2021

View reviewed changes

stl/inc/limits Show resolved Hide resolved

fsb4000 and others added 2 commits July 26, 2021 17:56

add comments

4a66872

Co-authored-by: Alex Guteniev <[email protected]>

clang-format

0d6f15c

CaseyCarter added the performance Must go faster label Jul 26, 2021

StephanTLavavej assigned barcharcraz Jul 28, 2021

Merge branch 'main' into popcount_fallback

0354f55

StephanTLavavej approved these changes Aug 20, 2021

View reviewed changes

AlexGuteniev mentioned this pull request Aug 21, 2021

make bitset use popcount #2126

Closed

StephanTLavavej self-assigned this Aug 26, 2021

CaseyCarter approved these changes Aug 26, 2021

View reviewed changes

StephanTLavavej unassigned barcharcraz Aug 26, 2021

StephanTLavavej merged commit c261323 into microsoft:main Aug 27, 2021

fsb4000 deleted the popcount_fallback branch August 27, 2021 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize _Popcount_fallback #2079

optimize _Popcount_fallback #2079

fsb4000 commented Jul 25, 2021 •

edited

Loading

Uh oh!

AlexGuteniev commented Jul 26, 2021

Uh oh!

Uh oh!

fsb4000 commented Jul 26, 2021 •

edited

Loading

Uh oh!

AlexGuteniev commented Jul 26, 2021

Uh oh!

CaseyCarter commented Jul 26, 2021

Uh oh!

AlexGuteniev commented Jul 26, 2021

Uh oh!

CaseyCarter commented Jul 26, 2021 •

edited

Loading

Uh oh!

fsb4000 commented Jul 26, 2021

Uh oh!

CaseyCarter commented Jul 26, 2021

Uh oh!

AlexGuteniev commented Aug 15, 2021

Uh oh!

StephanTLavavej commented Aug 21, 2021

Uh oh!

StephanTLavavej commented Aug 26, 2021

Uh oh!

StephanTLavavej commented Aug 27, 2021

Uh oh!

Uh oh!

	// Implementation of popcount without using specialized CPU instructions.
	// Used at compile time and when said instructions are not supported.
	template <class _Ty>
	_NODISCARD constexpr int _Popcount_fallback(_Ty _Val) noexcept {
	constexpr int _Digits = numeric_limits<_Ty>::digits;
	// we static_cast these bit patterns in order to truncate them to the correct size
	_Val = static_cast<_Ty>(_Val - ((_Val >> 1) & static_cast<_Ty>(0x5555'5555'5555'5555ull)));
	_Val = static_cast<_Ty>((_Val & static_cast<_Ty>(0x3333'3333'3333'3333ull))
	+ ((_Val >> 2) & static_cast<_Ty>(0x3333'3333'3333'3333ull)));
	_Val = static_cast<_Ty>((_Val + (_Val >> 4)) & static_cast<_Ty>(0x0F0F'0F0F'0F0F'0F0Full));
	for (int _Shift_digits = 8; _Shift_digits < _Digits; _Shift_digits <<= 1) {
	_Val = static_cast<_Ty>(_Val + static_cast<_Ty>(_Val >> _Shift_digits));
	}
	// we want the bottom "slot" that's big enough to store _Digits
	return static_cast<int>(_Val & static_cast<_Ty>(_Digits + _Digits - 1));
	}

optimize _Popcount_fallback #2079

optimize _Popcount_fallback #2079

Conversation

fsb4000 commented Jul 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexGuteniev commented Jul 26, 2021

Uh oh!

Uh oh!

fsb4000 commented Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexGuteniev commented Jul 26, 2021

Uh oh!

CaseyCarter commented Jul 26, 2021

Uh oh!

AlexGuteniev commented Jul 26, 2021

Uh oh!

CaseyCarter commented Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fsb4000 commented Jul 26, 2021

Uh oh!

CaseyCarter commented Jul 26, 2021

Uh oh!

AlexGuteniev commented Aug 15, 2021

Uh oh!

StephanTLavavej commented Aug 21, 2021

Uh oh!

StephanTLavavej commented Aug 26, 2021

Uh oh!

StephanTLavavej commented Aug 27, 2021

Uh oh!

Uh oh!

fsb4000 commented Jul 25, 2021 •

edited

Loading

fsb4000 commented Jul 26, 2021 •

edited

Loading

CaseyCarter commented Jul 26, 2021 •

edited

Loading