Skip to content

Conversation

AlexGuteniev
Copy link
Contributor

Not really unique, modelled on #4987

⏬ Double load

To compare adjacent values, the same memory is loaded twice with an element shift.

It is possible to reuse the previous vector part, and mix it with the current, to save one load, but have some extra instructions to mix values, and a loop-carried dependency. On SSE path it is possible with _mm_alignr_epi8 (except for 8-bit elements). For AVX it would be way more complex due to AVX lanes.

Benchmarking shows that double load is faster than any reuse attempt. To some extent such a result overlaps with #4958

⏱️ Benchmark results

Benchmark main this
u<alg_type::std_fn, std::uint8_t> 1166 ns 190 ns
u<alg_type::std_fn, std::uint16_t> 1222 ns 247 ns
u<alg_type::std_fn, std::uint32_t> 1555 ns 310 ns
u<alg_type::std_fn, std::uint64_t> 1470 ns 665 ns
u<alg_type::rng, std::uint8_t> 1230 ns 187 ns
u<alg_type::rng, std::uint16_t> 1204 ns 233 ns
u<alg_type::rng, std::uint32_t> 1268 ns 308 ns
u<alg_type::rng, std::uint64_t> 1505 ns 665 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner November 16, 2024 20:59
@StephanTLavavej StephanTLavavej added the performance Must go faster label Nov 16, 2024
@StephanTLavavej StephanTLavavej self-assigned this Nov 16, 2024
Less error prone, especially if implementing _copy someday
@CaseyCarter

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej changed the title unique vectorization Vectorize unique Dec 17, 2024
@github-project-automation github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews Mar 3, 2025
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej removed their assignment Mar 3, 2025
@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews Mar 4, 2025
@StephanTLavavej StephanTLavavej self-assigned this Mar 4, 2025
@StephanTLavavej
Copy link
Member

The speedups are less significant on my 5950X, but good across the board with no regressions:

Benchmark Before After Speedup
u<alg_type::std_fn, std::uint8_t> 1681 ns 494 ns 3.40
u<alg_type::std_fn, std::uint16_t> 1566 ns 503 ns 3.11
u<alg_type::std_fn, std::uint32_t> 1319 ns 657 ns 2.01
u<alg_type::std_fn, std::uint64_t> 1466 ns 1138 ns 1.29
u<alg_type::rng, std::uint8_t> 1261 ns 497 ns 2.54
u<alg_type::rng, std::uint16_t> 1691 ns 584 ns 2.90
u<alg_type::rng, std::uint32_t> 1765 ns 601 ns 2.94
u<alg_type::rng, std::uint64_t> 1403 ns 1139 ns 1.23

@StephanTLavavej StephanTLavavej removed their assignment Mar 4, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Mar 4, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Mar 21, 2025
@StephanTLavavej StephanTLavavej self-assigned this Mar 21, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej added a commit to StephanTLavavej/STL that referenced this pull request Mar 21, 2025
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Mar 23, 2025

I've discovered that something is missing.
But I think it can wait for a follow up PR.


Unlike remove algorithm, this one doesn't have the search for the first duplicate before the main vectorization loop. The scalar implementations in headers have that part, the vectorized one currently doesn't.

For performance it is clearly a missed opportunity. Though the vectorization improvement should be bigger than the negative effect of extra writes.

For correctness, I'm not sure. [algorithms.requirements]/3 says:

For purposes of determining the existence of data races, algorithms shall not modify objects referenced through an iterator argument unless the specification requires such modification.

However as the writes write equal integer values, this is not observable for concurrent reads, even if container violates alignment requirements and the write is not atomic.

The only thing where extra writes can be observable is running this algorithm on a read-only data without adjacent duplicates. But this is a very silly use case.


It is easily fixable with adjacent_find, vectorized in another PR here.

@StephanTLavavej StephanTLavavej merged commit 2b11023 into microsoft:main Mar 24, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Mar 24, 2025
@StephanTLavavej
Copy link
Member

Thanks for the one-of-a-kind PR! 😹 🚀 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants