Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

StephanTLavavej · 2023-07-25T19:13:37Z

No change requested: I observed that we could ignore _Len and take advantage of our knowledge that it will always be the compile-time constant _Bits, which is available without capturing. I checked the codegen, and this indeed allows the compiler to bake in the constant. However, while benchmarking showed some improvement for BM_bitset_to_string<64, char>, it showed significant pessimization for BM_bitset_to_string<15, char> .

BEFORE USING _Bits ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------ BM_bitset_to_string<15, char> 140 ns 144 ns 4977778 BM_bitset_to_string<64, char> 3336 ns 3348 ns 224000 BM_bitset_to_string_large_single<char> 2008 ns 2040 ns 344615 BM_bitset_to_string<7, wchar_t> 88.3 ns 87.9 ns 7466667 BM_bitset_to_string<64, wchar_t> 2211 ns 2197 ns 298667 BM_bitset_to_string_large_single<wchar_t> 1451 ns 1430 ns 448000 AFTER USING _Bits ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------ BM_bitset_to_string<15, char> 640 ns 642 ns 1120000 BM_bitset_to_string<64, char> 2646 ns 2609 ns 263529 BM_bitset_to_string_large_single<char> 2010 ns 2040 ns 344615 BM_bitset_to_string<7, wchar_t> 82.3 ns 82.0 ns 8960000 BM_bitset_to_string<64, wchar_t> 2407 ns 2441 ns 320000 BM_bitset_to_string_large_single<wchar_t> 1372 ns 1360 ns 448000

I think despite some pessimization of the certain case, it is still the right thing to do

Oh, actually the pessimization is too much. I think, angry loop unrolling strikes here too.

Let's keep it as a parameter for now then

As _Bits is the natural decision but we have to use _Len for performance reasons, I think we need to add a comment for it.

Hmm, how to compare benchmark results before and after change in the same run in a convenient way? I'm doing this in a very awkward way.

Benchmark

#include <array> #include <benchmark/benchmark.h> #include <bit> #include <bitset> #include <climits> #include <cstddef> #include <cstdint> #include <random> #include <iostream> using namespace std; struct impl_old { template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>> static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs, const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) { basic_string<_Elem, _Tr, _Alloc> _Str(_Bits, _Elem0); for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) { if (bs[_Bits - 1 - _Pos]) { _Str[_Pos] = _Elem1; } } return _Str; } }; struct impl_new_Bits { template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>> static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs, const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) { basic_string<_Elem, _Tr, _Alloc> _Str; _Str.resize_and_overwrite(_Bits, [&bs, _Elem0, _Elem1](_Elem* _Buf, size_t) { for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) { _Buf[_Pos] = bs[_Bits - 1 - _Pos] ? _Elem1 : _Elem0; } return _Bits; }); return _Str; } }; struct impl_new_Len { template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>> static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs, const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) { basic_string<_Elem, _Tr, _Alloc> _Str; _Str.resize_and_overwrite(_Bits, [&bs, _Elem0, _Elem1](_Elem* _Buf, size_t _Len) { for (size_t _Pos = 0; _Pos < _Len; ++_Pos) { _Buf[_Pos] = bs[_Len - 1 - _Pos] ? _Elem1 : _Elem0; } return _Len; }); return _Str; } }; namespace { const auto random_bits = [] { mt19937_64 rnd{}; array<uint64_t, 32> arr; for (auto& d : arr) { d = rnd(); } return arr; }(); template <class op, size_t N, class charT> void BM_bitset_to_string(benchmark::State& state) { for (auto _ : state) { for (const auto& bits : random_bits) { bitset<N> bs{ bits }; benchmark::DoNotOptimize(op::template to_string<charT>(bs)); } } } template <class op, class charT> void BM_bitset_to_string_large_single(benchmark::State& state) { const auto large_bitset = bit_cast<bitset<CHAR_BIT * sizeof(random_bits)>>(random_bits); for (auto _ : state) { benchmark::DoNotOptimize(op::template to_string<charT>(large_bitset)); } } } // namespace #define MAKE_BENCHMARKS(impl)\ BENCHMARK(BM_bitset_to_string<impl, 15, char>)->Name("\nBM_bitset_to_string<" #impl ", 15, char>");\ BENCHMARK(BM_bitset_to_string<impl, 64, char>);\ BENCHMARK(BM_bitset_to_string_large_single<impl, char>);\ BENCHMARK(BM_bitset_to_string<impl, 7, wchar_t>);\ BENCHMARK(BM_bitset_to_string<impl, 64, wchar_t>);\ BENCHMARK(BM_bitset_to_string_large_single<impl, wchar_t>); MAKE_BENCHMARKS(impl_old); MAKE_BENCHMARKS(impl_new_Bits); MAKE_BENCHMARKS(impl_new_Len); BENCHMARK_MAIN();

(And I'm feeling nervious that the effect is very complex and can even vary between different runs)

Results

--------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------- BM_bitset_to_string<impl_old, 15, char> 1562 ns 1535 ns 448000 BM_bitset_to_string<impl_old, 64, char> 11188 ns 11230 ns 64000 BM_bitset_to_string_large_single<impl_old, char> 9014 ns 9050 ns 127758 BM_bitset_to_string<impl_old, 7, wchar_t> 915 ns 907 ns 896000 BM_bitset_to_string<impl_old, 64, wchar_t> 6158 ns 6166 ns 83627 BM_bitset_to_string_large_single<impl_old, wchar_t> 5732 ns 5755 ns 149333 BM_bitset_to_string<impl_new_Bits, 15, char> 898 ns 914 ns 768981 BM_bitset_to_string<impl_new_Bits, 64, char> 8399 ns 8438 ns 100000 BM_bitset_to_string_large_single<impl_new_Bits, char> 3976 ns 3962 ns 228761 BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 198 ns 197 ns 3733333 BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 11847 ns 11858 ns 44800 BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 7289 ns 7357 ns 148670 BM_bitset_to_string<impl_new_Len, 15, char> 550 ns 562 ns 1000000 BM_bitset_to_string<impl_new_Len, 64, char> 6000 ns 6099 ns 128085 BM_bitset_to_string_large_single<impl_new_Len, char> 3976 ns 4004 ns 320000 BM_bitset_to_string<impl_new_Len, 7, wchar_t> 135 ns 137 ns 5120000 BM_bitset_to_string<impl_new_Len, 64, wchar_t> 12374 ns 12556 ns 44800 BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 15409 ns 15346 ns 56000

Another run

--------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------- BM_bitset_to_string<impl_old, 15, char> 1731 ns 1726 ns 298667 BM_bitset_to_string<impl_old, 64, char> 7147 ns 7150 ns 89600 BM_bitset_to_string_large_single<impl_old, char> 4201 ns 4148 ns 139378 BM_bitset_to_string<impl_old, 7, wchar_t> 734 ns 732 ns 746667 BM_bitset_to_string<impl_old, 64, wchar_t> 10492 ns 10463 ns 74667 BM_bitset_to_string_large_single<impl_old, wchar_t> 5644 ns 5620 ns 122334 BM_bitset_to_string<impl_new_Bits, 15, char> 1435 ns 1416 ns 640000 BM_bitset_to_string<impl_new_Bits, 64, char> 5507 ns 5418 ns 115348 BM_bitset_to_string_large_single<impl_new_Bits, char> 3620 ns 3594 ns 247782 BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 188 ns 190 ns 2635294 BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 8363 ns 8231 ns 112000 BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 5042 ns 4933 ns 104533 BM_bitset_to_string<impl_new_Len, 15, char> 232 ns 235 ns 2986667 BM_bitset_to_string<impl_new_Len, 64, char> 6508 ns 6562 ns 100000 BM_bitset_to_string_large_single<impl_new_Len, char> 3964 ns 3941 ns 281493 BM_bitset_to_string<impl_new_Len, 7, wchar_t> 174 ns 174 ns 4480000 BM_bitset_to_string<impl_new_Len, 64, wchar_t> 15721 ns 15625 ns 56000 BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 7529 ns 7533 ns 112000

Are there problems with my comparing benchmark? I'm finding resize_and_overwrite&_Len approach stably slow for large wchar_t cases in my testing.

It may be CPU-dependent.
For _Bits the compiler is more likely to advantage of knowing _Bits and unroll loops or vectorize something.
For _Len the compiler may be restricted from doing that.

Whether the optimizations like loop unrolling are useful or harmful -- dependent on CPU type.

Also non-unrolled loop (_Len) is subject to JCC errata pessimization on some CPUs, /QIntel-jcc-erratum might help in this case https://learn.microsoft.com/ru-ru/cpp/build/reference/qintel-jcc-erratum?view=msvc-170

I'm neutral with regards to _Bits vs _Len if none of them are absolutely advantage.

I think we need to test for _Bits vs _Len in more environments to do the decision. I tend to believe that by using _Bits we can get general improvements over previous impl, though not significant in some cases like small bitset.to_string<char>; by using _Len we risk introducing some regressions.

Here is some results after turning on that switch:
.......I find my benchmark result highly unstable between separate runs...

Result

--------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------- BM_bitset_to_string<impl_old, 15, char> 1718 ns 1726 ns 298667 BM_bitset_to_string<impl_old, 64, char> 6191 ns 6147 ns 119467 BM_bitset_to_string_large_single<impl_old, char> 5596 ns 5580 ns 89600 BM_bitset_to_string<impl_old, 7, wchar_t> 842 ns 825 ns 1003520 BM_bitset_to_string<impl_old, 64, wchar_t> 8230 ns 8196 ns 89600 BM_bitset_to_string_large_single<impl_old, wchar_t> 5406 ns 5312 ns 100000 BM_bitset_to_string<impl_new_Bits, 15, char> 993 ns 993 ns 802816 BM_bitset_to_string<impl_new_Bits, 64, char> 7076 ns 7063 ns 139378 BM_bitset_to_string_large_single<impl_new_Bits, char> 4549 ns 4552 ns 281493 BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 132 ns 133 ns 4480000 BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 5765 ns 5755 ns 89600 BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 7747 ns 7782 ns 136533 BM_bitset_to_string<impl_new_Len, 15, char> 446 ns 445 ns 2986667 BM_bitset_to_string<impl_new_Len, 64, char> 5924 ns 5929 ns 89600 BM_bitset_to_string_large_single<impl_new_Len, char> 2957 ns 2979 ns 320000 BM_bitset_to_string<impl_new_Len, 7, wchar_t> 242 ns 239 ns 3200000 BM_bitset_to_string<impl_new_Len, 64, wchar_t> 11633 ns 11509 ns 44800 BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 5270 ns 5336 ns 149333

Another run

--------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------- BM_bitset_to_string<impl_old, 15, char> 1415 ns 1445 ns 432552 BM_bitset_to_string<impl_old, 64, char> 6740 ns 6627 ns 89600 BM_bitset_to_string_large_single<impl_old, char> 5079 ns 5000 ns 100000 BM_bitset_to_string<impl_old, 7, wchar_t> 632 ns 628 ns 1120000 BM_bitset_to_string<impl_old, 64, wchar_t> 8644 ns 8545 ns 89600 BM_bitset_to_string_large_single<impl_old, wchar_t> 7400 ns 7324 ns 74667 BM_bitset_to_string<impl_new_Bits, 15, char> 1096 ns 1108 ns 719369 BM_bitset_to_string<impl_new_Bits, 64, char> 6058 ns 6104 ns 143360 BM_bitset_to_string_large_single<impl_new_Bits, char> 4695 ns 4708 ns 298667 BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 247 ns 248 ns 5600000 BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 5696 ns 5755 ns 89600 BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 6840 ns 6875 ns 100000 BM_bitset_to_string<impl_new_Len, 15, char> 231 ns 234 ns 2800000 BM_bitset_to_string<impl_new_Len, 64, char> 5700 ns 5622 ns 144516 BM_bitset_to_string_large_single<impl_new_Len, char> 2904 ns 2916 ns 203636 BM_bitset_to_string<impl_new_Len, 7, wchar_t> 133 ns 132 ns 5439529 BM_bitset_to_string<impl_new_Len, 64, wchar_t> 13153 ns 13253 ns 44800 BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 5030 ns 5017 ns 102784

I suggest don't think too much on _Bits vs _Len. The next step would be vectorization, it would be obviously faster than this, and obviously will use _Bits.

-Original file line number
+Diff line change
@@ Expand Up / @@ -347,13 +347,13 @@ public: @@
         _NODISCARD _CONSTEXPR23 basic_string<_Elem, _Tr, _Alloc> to_string(
             const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) const {
             // convert bitset to string
-            basic_string<_Elem, _Tr, _Alloc> _Str(_Bits, _Elem0);
-            for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) {
-                if (_Subscript(_Bits - 1 - _Pos)) {
-                    _Str[_Pos] = _Elem1;
+            basic_string<_Elem, _Tr, _Alloc> _Str;
+            _Str._Resize_and_overwrite(_Bits, [this, _Elem0, _Elem1](_Elem* _Buf, size_t _Len) {
+                for (size_t _Pos = 0; _Pos < _Len; ++_Pos) {
+                    _Buf[_Pos] = _Subscript(_Len - 1 - _Pos) ? _Elem1 : _Elem0;
                 }
-            }
+                return _Len;
+            });
             return _Str;
         }
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -4158,11 +4158,16 @@ public: @@
             }
         }
-    #if _HAS_CXX23
     #pragma warning(push)
     #pragma warning(disable : 4018) // '<=': signed/unsigned mismatch (we compare to 0 before _New_size below, so it's safe)
         template <class _Operation>
-        constexpr void resize_and_overwrite(_CRT_GUARDOVERFLOW const size_type _New_size, _Operation _Op) {
+        constexpr void
+    #if _HAS_CXX23
+            resize_and_overwrite
+    #else // ^^^ _HAS_CXX23 / !_HAS_CXX23 vvv
+            _Resize_and_overwrite
+    #endif // ^^^ !_HAS_CXX23 ^^^
+            (_CRT_GUARDOVERFLOW const size_type _New_size, _Operation _Op) {
             if (_Mypair._Myval2._Myres < _New_size) {
                 _Reallocate_grow_by(_New_size - _Mypair._Myval2._Mysize,
                     [](_Elem* const _New_ptr, const _Elem* const _Old_ptr, const size_type _Old_size) {
@@ Expand All / @@ -4180,6 +4185,12 @@ public: @@
             _Eos(static_cast<size_type>(_Result_size));
         }
     #pragma warning(pop)
+    #if _HAS_CXX23
+        template <class _Operation>
+        constexpr void _Resize_and_overwrite(_CRT_GUARDOVERFLOW const size_type _New_size, _Operation _Op) {
+            resize_and_overwrite(_New_size, _Op);
+        }
     #endif // _HAS_CXX23
         _NODISCARD _CONSTEXPR20 size_type capacity() const noexcept {
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

Uh oh!

Diff view

Diff view

There are no files selected for viewing

StephanTLavavej Jul 25, 2023

Uh oh!

AlexGuteniev Jul 25, 2023

Uh oh!

AlexGuteniev Jul 25, 2023

Uh oh!

AlexGuteniev Jul 25, 2023

Uh oh!

achabense Jul 26, 2023

Uh oh!

achabense Jul 26, 2023 •

edited

Loading

Uh oh!

achabense Jul 26, 2023 •

edited

Loading

Uh oh!

AlexGuteniev Jul 26, 2023

Uh oh!

achabense Jul 26, 2023 •

edited

Loading

Uh oh!

AlexGuteniev Jul 28, 2023

Uh oh!

Use string::resize_and_overwrite in bitset to avoid buffer initialization #3904

Uh oh!

Use string::resize_and_overwrite in bitset to avoid buffer initialization #3904

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

StephanTLavavej Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

achabense Jul 26, 2023

Choose a reason for hiding this comment

Uh oh!

achabense Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

achabense Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Jul 26, 2023

Choose a reason for hiding this comment

Uh oh!

achabense Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Jul 28, 2023

Choose a reason for hiding this comment

Uh oh!

Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

achabense Jul 26, 2023 •

edited

Loading

achabense Jul 26, 2023 •

edited

Loading

achabense Jul 26, 2023 •

edited

Loading