Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions stl/inc/memory
Original file line number Diff line number Diff line change
Expand Up @@ -4000,15 +4000,18 @@ protected:
_Atomic_ptr_base(remove_extent_t<_Ty>* const _Px, _Ref_count_base* const _Ref) noexcept
: _Ptr(_Px), _Repptr(_Ref) {}

void _Wait(remove_extent_t<_Ty>* _Old, memory_order) const noexcept {
void _Wait(remove_extent_t<_Ty>* _Old_ptr, _Ref_count_base* const _Old_rep, memory_order) const noexcept {
unsigned long _Remaining_timeout = 16; // milliseconds
const unsigned long _Max_timeout = 1048576; // milliseconds, ~17.5 minutes
for (;;) {
auto _Rep = _Repptr._Lock_and_load();
bool _Equal = _Ptr.load(memory_order_relaxed) == _Old;
bool _Equal = _Ptr.load(memory_order_relaxed) == _Old_ptr && _Rep == _Old_rep;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change requested; we should do this in a followup if at all to avoid resetting testing, since it's not a correctness issue. Should this be reordered as:

Suggested change
bool _Equal = _Ptr.load(memory_order_relaxed) == _Old_ptr && _Rep == _Old_rep;
bool _Equal = _Rep == _Old_rep && _Ptr.load(memory_order_relaxed) == _Old_ptr;

so we short-circuit when the core-local _Rep == _Old_rep is false before potentially loading _Ptr's cache line from some other core's data cache? (I suspect I understand cache coherence protocols just well enough to be dangerous.) Would any difference just be noise compared to the expense of _Repptr._Lock_and_load()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Pulling in @AlexGuteniev to render an opinion and/or tell me how wrong I am 😄.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vague understanding is that a relaxed load has no barriers, so it has no special costs. It would be fine to reorder, though, since the .load is at least a debug mode function call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't _Repptr and _Ptr in the same cache line? Sure, whatever they point to can be anywhere, but the pointers themselves are (as far as I can see) right beside each other.

(STL's comment sounds accurate, though.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they are always in the same cache line (it is really always, not just most of the time, because we over align to alignas(2 * sizeof(void*))). So from the cache perspective it does not matter.

I agree also that eliminating debug mode call can be potentially beneficial.

Still I more like the original order, because we compare _Rep == _Old_rep; the very last thing, as close to wait as possible, so reduce the probability of resorting to timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering _Rep is just a stack-local and the real load is on the line above, I can't quite follow your reasoning.

Copy link
Contributor

@AlexGuteniev AlexGuteniev Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering _Rep is just a stack-local and the real load is on the line above, I can't quite follow your reasoning.

Hm, you are right, we done loading it into register at the same time anyway.

Then debug mode call saving could be the main decision factor here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'd forgotten that we overalign - my cache coherence point is invalid. +1 to avoiding the call in debug compiles, though. We really want to minimize the time between load and check.

_Repptr._Store_and_unlock(_Rep);
if (!_Equal) {
break;
}
__std_atomic_wait_direct(&_Ptr, &_Old, sizeof(_Old), __std_atomic_wait_no_timeout);
__std_atomic_wait_direct(&_Ptr, &_Old_ptr, sizeof(_Old_ptr), _Remaining_timeout);
_Remaining_timeout = (_STD min)(_Max_timeout, _Remaining_timeout * 2);
}
}

Expand Down Expand Up @@ -4114,7 +4117,7 @@ public:
}

void wait(shared_ptr<_Ty> _Old, memory_order _Order = memory_order_seq_cst) const noexcept {
this->_Wait(_Old._Ptr, _Order);
this->_Wait(_Old._Ptr, _Old._Rep, _Order);
}

using _Base::notify_all;
Expand Down Expand Up @@ -4237,7 +4240,7 @@ public:
}

void wait(weak_ptr<_Ty> _Old, memory_order _Order = memory_order_seq_cst) const noexcept {
this->_Wait(_Old._Ptr, _Order);
this->_Wait(_Old._Ptr, _Old._Rep, _Order);
}

using _Base::notify_all;
Expand Down
71 changes: 71 additions & 0 deletions tests/std/include/test_atomic_wait.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,75 @@ struct with_padding_bits {
};
#pragma warning(pop)

template <class T, class U>
[[nodiscard]] bool ownership_equal(const T& t, const U& u) {
return !t.owner_before(u) && !u.owner_before(t);
}

inline void test_gh_3602() {
// GH-3602 std::atomic<std::shared_ptr>::wait does not seem to care about control block difference. Is this a bug?
{
auto sp1 = std::make_shared<char>();
auto holder = [sp1] {};
auto sp2 = std::make_shared<decltype(holder)>(holder);
std::shared_ptr<char> sp3{sp2, sp1.get()};

std::atomic<std::shared_ptr<char>> asp{sp1};
asp.wait(sp3);
}
{
auto sp1 = std::make_shared<char>();
auto holder = [sp1] {};
auto sp2 = std::make_shared<decltype(holder)>(holder);
std::shared_ptr<char> sp3{sp2, sp1.get()};
std::weak_ptr<char> wp3{sp3};

std::atomic<std::weak_ptr<char>> awp{sp1};
awp.wait(wp3);
}

{
auto sp1 = std::make_shared<char>();
auto holder = [sp1] {};
auto sp2 = std::make_shared<decltype(holder)>(holder);
std::shared_ptr<char> sp3{sp2, sp1.get()};

std::atomic<std::shared_ptr<char>> asp{sp3};

std::thread t([&] {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
asp = sp1;
asp.notify_one();
});

asp.wait(sp3);

t.join();
}

{ // Also test shared_ptrs that own the null pointer.
int* const raw = nullptr;

std::shared_ptr<int> sp_empty;
std::shared_ptr<int> sp_also_empty;
std::shared_ptr<int> sp_original(raw);
std::shared_ptr<int> sp_copy(sp_original);
std::shared_ptr<int> sp_different(raw);

assert(ownership_equal(sp_empty, sp_also_empty));
assert(!ownership_equal(sp_original, sp_empty));
assert(ownership_equal(sp_original, sp_copy));
assert(!ownership_equal(sp_original, sp_different));

std::atomic<std::shared_ptr<int>> asp_empty;
asp_empty.wait(sp_original);

std::atomic<std::shared_ptr<int>> asp_copy{sp_copy};
asp_copy.wait(sp_empty);
asp_copy.wait(sp_different);
}
}

inline void test_atomic_wait() {
// wait for all the threads to be waiting; if this value is too small the test might be ineffective but should not
// fail due to timing assumptions except where otherwise noted; if it is too large the test will only take longer
Expand Down Expand Up @@ -292,4 +361,6 @@ inline void test_atomic_wait() {
test_pad_bits<with_padding_bits<32>>(waiting_duration);
#endif // ^^^ !ARM ^^^
#endif // ^^^ no workaround ^^^

test_gh_3602();
}