Skip to content

Conversation

pps83
Copy link
Contributor

@pps83 pps83 commented May 23, 2025

provides 1.05 to 1.1 times speed up on x64 targets.

x64 results

Benchmark                               Time 
---------------------------------------------
bm_has_single_bit_if<uint8_t>        13.4 ns 
bm_has_single_bit_if_new<uint8_t>    12.7 ns 
bm_has_single_bit_if<uint16_t>       13.9 ns 
bm_has_single_bit_if_new<uint16_t>   10.8 ns 
bm_has_single_bit_if<uint32_t>       15.9 ns 
bm_has_single_bit_if_new<uint32_t>   12.6 ns 
bm_has_single_bit_if<uint64_t>       14.3 ns 
bm_has_single_bit_if_new<uint64_t>   11.0 ns 
bm_has_single_bit<uint8_t>           6.34 ns 
bm_has_single_bit_new<uint8_t>       5.88 ns 
bm_has_single_bit<uint16_t>          6.19 ns 
bm_has_single_bit_new<uint16_t>      4.52 ns 
bm_has_single_bit<uint32_t>          5.57 ns 
bm_has_single_bit_new<uint32_t>      5.99 ns 
bm_has_single_bit<uint64_t>          6.07 ns 
bm_has_single_bit_new<uint64_t>      5.41 ns 

fixes #5359

@pps83
Copy link
Contributor Author

pps83 commented May 23, 2025

@AlexGuteniev, do I need to update benchmarks, or simply put benchmark numbers as part of the commit message?

@AlexGuteniev
Copy link
Contributor

@AlexGuteniev, do I need to update benchmarks, or simply put benchmark numbers as part of the commit message?

I think the current benchmarks are fine, just post before/after number

@AlexGuteniev
Copy link
Contributor

Ideally, we'd need ARM64 numbers as well. But I think we can do without them.

@AlexGuteniev
Copy link
Contributor

I think the current benchmarks are fine, just post before/after number

You also need to pass somehow /arch:AVX or greater to make it different, perhaps by temporarily alterint CmakeFIle.txt locally, or by any other working method.

@pps83
Copy link
Contributor Author

pps83 commented May 23, 2025

You also need to pass somehow /arch:AVX or greater to make it different, perhaps by temporarily alterint CmakeFIle.txt locally, or by any other working method.

shouldn't unit tests automatically run with/without these switches?

@AlexGuteniev
Copy link
Contributor

shouldn't unit tests automatically run with/without these switches?

Unit tests should.
I mean benchmarks, which currently don't

@AlexGuteniev
Copy link
Contributor

Also please edit the PR description to link it with the issue using one of Github keywords.

@pps83
Copy link
Contributor Author

pps83 commented May 23, 2025

I'm getting conflicting results for the test: popcount on my pc is somehow slower by around 30%
it's a tiger-lake-u core i7-11390H, shouldn't have slow popcount :)

@pps83
Copy link
Contributor Author

pps83 commented May 23, 2025

I'm getting conflicting results for the test: popcount on my pc is somehow slower by around 30% it's a tiger-lake-u core i7-11390H, shouldn't have slow popcount :)

no idea how is that it doesn't get inlined :)

image

@pps83
Copy link
Contributor Author

pps83 commented May 23, 2025

wow, accidentially, i had /Only __inline (/Ob1) option. Apparently, this destroys most of std code that doesn't have inlines. Imo, all these functions should be marked inline.

@pps83
Copy link
Contributor Author

pps83 commented May 23, 2025

popcount is on the right. So, it does improve, but will be 30% slower for projects using /Ob1

image

@AlexGuteniev
Copy link
Contributor

Imo, all these functions should be marked inline.

No; I asked a while ago, see #619

@AlexGuteniev
Copy link
Contributor

provides 5-10% speed up

✋ we don't do that here.

See https://github.com/microsoft/STL/wiki/Benchmarking-the-STL#calculating-speedups

So you probably mean 1.05 to 1.1 times speed up

@pps83
Copy link
Contributor Author

pps83 commented May 23, 2025

So you probably mean 1.05 to 1.1 times speed up

updated. I based it on my previous commit message: #5367

(actual git history doesn't have that anyways)

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented May 23, 2025

Mixed results for _if version, but consistent improvement of the branchless one

i5-1235U (Alder Lake)

P Cores

Benchmark Before After Speedup
bm_has_single_bit_if<uint8_t> 9.78 ns 9.90 ns 0.99
bm_has_single_bit_if<uint16_t> 12.1 ns 10.1 ns 1.20
bm_has_single_bit_if<uint32_t> 11.4 ns 12.3 ns 0.93
bm_has_single_bit_if<uint64_t> 9.66 ns 12.7 ns 0.76
bm_has_single_bit<uint8_t> 4.44 ns 5.56 ns 0.80
bm_has_single_bit<uint16_t> 4.66 ns 3.59 ns 1.30
bm_has_single_bit<uint32_t> 4.66 ns 3.59 ns 1.30
bm_has_single_bit<uint64_t> 4.90 ns 3.47 ns 1.41

E Cores

Benchmark Before After Speedup
bm_has_single_bit_if<uint8_t> 14.7 ns 14.0 ns 1.05
bm_has_single_bit_if<uint16_t> 16.6 ns 14.7 ns 1.13
bm_has_single_bit_if<uint32_t> 14.7 ns 14.3 ns 1.03
bm_has_single_bit_if<uint64_t> 14.7 ns 14.1 ns 1.04
bm_has_single_bit<uint8_t> 9.09 ns 8.84 ns 1.03
bm_has_single_bit<uint16_t> 9.09 ns 7.24 ns 1.26
bm_has_single_bit<uint32_t> 9.07 ns 7.25 ns 1.25
bm_has_single_bit<uint64_t> 9.08 ns 7.27 ns 1.25

@StephanTLavavej StephanTLavavej self-assigned this May 27, 2025
@StephanTLavavej StephanTLavavej added the performance Must go faster label May 27, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 Results on my 5950X, comparing main to this PR with set _CL_=/arch:AVX2 injected non-intrusively:

Benchmark Before After Speedup
bm_has_single_bit_if<uint8_t> 12.9 ns 11.2 ns 1.15
bm_has_single_bit_if<uint16_t> 12.9 ns 11.6 ns 1.11
bm_has_single_bit_if<uint32_t> 13.1 ns 12.9 ns 1.02
bm_has_single_bit_if<uint64_t> 13.3 ns 13.1 ns 1.02
bm_has_single_bit<uint8_t> 5.10 ns 5.30 ns 0.96
bm_has_single_bit<uint16_t> 5.09 ns 3.62 ns 1.41
bm_has_single_bit<uint32_t> 5.10 ns 3.62 ns 1.41
bm_has_single_bit<uint64_t> 5.13 ns 3.61 ns 1.42

I'm not worried about the 0.96, that looks like noise (I reran and it was slightly better), and the 1.41 speedups are very healthy indeed.

@StephanTLavavej StephanTLavavej removed their assignment May 28, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 28, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 28, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@pps83
Copy link
Contributor Author

pps83 commented May 28, 2025

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

no changes planned.

@StephanTLavavej StephanTLavavej merged commit 9ef878d into microsoft:main May 28, 2025
48 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 28, 2025
@StephanTLavavej
Copy link
Member

0️⃣ 1️⃣ 0️⃣

@pps83 pps83 deleted the bit-has_single_bit-faster2 branch May 31, 2025 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<bit>: Could has_single_bit() be faster?
3 participants