Skip to content

Conversation

muellerj2
Copy link
Contributor

Fixes #5212.

At first glance, calling LCMapStringEx with LCMAP_HASH seems to be the obvious solution, but the docs state (emphasis mine):

Strings that appear equivalent typically return the same hash (for example, "hello" and "HELLO" with LCMAP_IGNORECASE). However, some complex cases, such as East Asian languages, can have similar strings with identical weights that compare as equal but do not return the same hash.

I briefly tested this and the warning seems to be true, as the following program finds some single-character strings that collate the same but yield different hashes:

#include <iostream>
#include <windows.h>

using namespace std;

int main() {
	const wchar_t* locale = L"ja-JP";
	for (wchar_t w = 1; w != 0; ++w) {
		int hashw;
		if (LCMapStringEx(locale, LCMAP_HASH, &w, 1, reinterpret_cast<LPWSTR>(&hashw), sizeof(int), nullptr, nullptr, 0) == 0) {
			continue;
		}

		for (wchar_t x = w + 1; x != 0; ++x) {
			int hashx;
			if (LCMapStringEx(locale, LCMAP_HASH, &x, 1, reinterpret_cast<LPWSTR>(&hashx), sizeof(int), nullptr, nullptr, 0) == 0) {
				continue;
			}
			if (hashw != hashx && CompareStringEx(locale, 0, &w, 1, &x, 1, nullptr, nullptr, 0) == CSTR_EQUAL) {
				cout << "found different hashes for characters collating the same at: " << int(w) << " : " << int(x) << '\n';
			}
		}
	}
}

Output:

found different hashes for characters collating the same at: 1556 : 64606
found different hashes for characters collating the same at: 1611 : 65137
found different hashes for characters collating the same at: 1614 : 65143
[...]

This means that LCMapStringEx with LCMAP_HASH is not good enough to meet the requirements in [locale.collate.virtuals]/3.

For this reason, this PR computes the sort key first (by calling do_transform) and then hashes the sort key.

@muellerj2 muellerj2 requested a review from a team as a code owner May 3, 2025 18:55
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 3, 2025
@StephanTLavavej StephanTLavavej added the bug Something isn't working label May 3, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 3, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 I pushed a trivial nitpick and a change to guard this with SKIP_COLLATE_TRANSFORM_TESTS and a comment explaining why. Please meow if I got that wrong.

@StephanTLavavej StephanTLavavej removed their assignment May 4, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 4, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 9, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej added a commit to StephanTLavavej/STL that referenced this pull request May 9, 2025
@StephanTLavavej StephanTLavavej merged commit a323953 into microsoft:main May 10, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 10, 2025
@StephanTLavavej
Copy link
Member

Thanks for figuring out the correct way to fix this bug! 💯 😻 😸

@muellerj2 muellerj2 deleted the locale-collate-hash branch May 31, 2025 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<locale>: std::collate_byname<_Elem>::hash() yields different hashes for strings that collate the same
2 participants