-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Vectorize unique
#5092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize unique
#5092
Conversation
Less error prone, especially if implementing _copy someday
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
The speedups are less significant on my 5950X, but good across the board with no regressions:
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
I've discovered that something is missing. Unlike For performance it is clearly a missed opportunity. Though the vectorization improvement should be bigger than the negative effect of extra writes. For correctness, I'm not sure. [algorithms.requirements]/3 says:
However as the writes write equal integer values, this is not observable for concurrent reads, even if container violates alignment requirements and the write is not atomic. The only thing where extra writes can be observable is running this algorithm on a read-only data without adjacent duplicates. But this is a very silly use case. It is easily fixable with |
Thanks for the one-of-a-kind PR! 😹 🚀 🎉 |
Not really unique, modelled on #4987
⏬ Double load
To compare adjacent values, the same memory is loaded twice with an element shift.
It is possible to reuse the previous vector part, and mix it with the current, to save one load, but have some extra instructions to mix values, and a loop-carried dependency. On SSE path it is possible with
_mm_alignr_epi8
(except for 8-bit elements). For AVX it would be way more complex due to AVX lanes.Benchmarking shows that double load is faster than any reuse attempt. To some extent such a result overlaps with #4958
⏱️ Benchmark results