-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Specialize some algorithms for vector<bool> #750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f3a1123
to
74b38e4
Compare
So I was wondering. This is a performance improvement. Is there any interest in collecting benchmarks in this repository, e.g. a separate It feels kind of strange that there are none currently |
stl/inc/vector
Outdated
// copy [_First, _Last) to [_Dest, ...) | ||
_Adl_verify_range(_First, _Last); | ||
|
||
// Slow path as _First and _Dest are not aligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure copy for 'only aligned vectors' is worth the metaprogramming cost here. The interesting cases to optimize are:
vector -> vector
vector -> any other bool range
any other bool range -> vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure whether it is worth it actually. This was more an exercise in bit fiddling for me as I never really did any bit manipulations so far.
I am not an expert but can we do unaligned memcopy? Is that even inside the limits of the language to copy an non byte-aligned range into another non-byte aligned range?
How would that work with strict aliasing. Would we need to cast to void*
rather than char
to get an unaligned pointer?
I guess one could try some tricks to expand / collaps the individual bits, but that would require SSE instructions and each direction would need its own special case.
That said, in my experience vector is rather less used and I would be curious wether there are actually use cases with different ranges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that even inside the limits of the language to copy an non byte-aligned range into another non-byte aligned range?
As in 'through memcpy', no. As in 'better than extracting bit by bit', yes.
How would that work with strict aliasing
Access as an unsigned int, no need for reinterpret here.
I guess one could try some tricks to expand / collaps the individual bits, but that would require SSE instructions
I mean something like this pseudocode:
const unsigned int* data();
int main() {
const unsigned int* first = data();
size_t _Offset = 1234;
constexpr auto _Bits_per = (8*sizeof(unsigned int)); // 8 == char_bit
first += _Offset/_Bits_per;
_Offset = _Offset % _Bits_per;
unsigned int current = (first[0] >> _Offset) | first[1] << (_Bits_per - _Offset); // read 32 bools
}
If the target is another vector we can do a similar alignment fixup without unpacking current
, if the target is a bool buffer that expansion would indeed be faster with AVX (since 32 bools * 8 bits == 256 bits, the AVX register size) but the same shift and mask thing can be done on a per size_t basis or similar. It won't be as fast as AVX but it will still be faster an unpacking and repacking on a bit by bit basis like before.
To be clear, I'm not saying you need to implement optimizations like that, but I am saying I believe the value of the optimization for its metaprogramming cost is questionable when the result is that copy(v.begin(), v.end(), v2.begin())
is 10x faster than copy(v.begin(), v.end(), v2.begin() + 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in 'through memcpy', no. As in 'better than extracting bit by bit', yes.
Is it too early for some warhammer references?
_INLINE_VAR constexpr bool _Is_vb_iterator<_Vb_iterator<_Alloc>> = true; | ||
|
||
template <class _InIt, class _OutIt> | ||
_CONSTEXPR20 _OutIt _Copy_vbool(_InIt _First, _InIt _Last, _OutIt _Dest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _InIt and _OutIt are vector iterator do we need this to be a template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes as we always need to consider allocators :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could still be templated on allocator instead then to make it clearer that the types accepted must be vector iterators but I agree there's less value there :(
} | ||
|
||
template <class _FwdIt, class _Ty> | ||
_CONSTEXPR20 void _Fill_vbool(_FwdIt _First, _FwdIt _Last, const _Ty& _Val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto does this need to be a template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto because of template class Alloc> vb_iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto :(
// compare [_First1, _Last1) to [_First2, ...) | ||
_Adl_verify_range(_First1, _Last1); | ||
|
||
// Slow path as _First1 and _First2 are not aligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto not sure this is worth the cost if unaligned does not work.
stl/inc/vector
Outdated
template <class _InIt, class _OutIt> | ||
_CONSTEXPR20 _OutIt _Copy_vbool(_InIt _First, _InIt _Last, _OutIt _Dest) { | ||
// copy [_First, _Last) to [_Dest, ...) | ||
_Adl_verify_range(_First, _Last); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself I took care to always put the specialization after the _Adl_verify_range
in the caller so that I can repeat it here for fun
Thinking about this a bit more I believe there is a path forward for the copy family:
I will have a look at that in the evening |
78ebbe7
to
6d09e07
Compare
2b4b709
to
35dbfdc
Compare
Sorry for the noise, I am slowly working on this on a linux machine currently. I have to massively expand testing so do not bother to review right now |
35dbfdc
to
d647b6c
Compare
d647b6c
to
cba418d
Compare
9d051ae
to
09b8fde
Compare
09b8fde
to
6c86f50
Compare
Dropping because of #879 and the other stuff I have based on that |
This addresses #625
I went ahead and added specialization for the following classes of algorithms:
What is still open are find / count, which all need bitops support
Also lexicographical_compare was not really something I would want to touch.
That said I only added them when
if constexpr
is available. Life is too short for tag dispatch