Skip to content

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Aug 17, 2025

This is the first minor step on the journey of making the matcher non-recursive to deal with #997 and #1528.

This moves all instances of _Tgt_state or _Bt_state from the stack to the heap (into a vector). These objects store the current match state and are used to restore it when backtracking. In functions, the _Tgt_state objects in the vector are referenced using their indices (and not using pointers because the vector might reallocate).

Note that we have to pop the frames from the stack whenever we return during backtracking, but we don't have to worry about exceptions: If an exception occurs, the whole _Matcher2 object will be destructed.

Since these objects are large, this slightly reduces stack pressure: Copying #997 to a test and running it with the usual matrix, the first asan configs now start failing after 258 instead of 249 repetitions.

This PR adds no additional tests because all existing regex tests already provide coverage for this change and there is no obvious gap where further tests are needed. It does add internal checks, though, to verify that we don't accidentally fail to pop from the stack somewhere when backtracking.

The impact on performance is mixed: On the one hand, the change can improve performance when regex matching involves a substantial amount of repeated backtracking, because it avoids reallocations (and regex_search usually backtracks completely at least once because the initial match fails). On the other hand, the new vector performs additional allocations, which noticeably worsens performance when individual _Tgt_state/_Bt_state objects don't allocate at all after #5518, such as the patterns (:?bibe)+ and ((?!lorem)bibe) in the regex_search benchmark. (In the long run, we probably want to use a vector variant implementing small vector optimization here.)

Benchmark for regex_search:

name before [ns] after [ns] speedup
bm_lorem_search/"^bibe"/2 59 58 1.01
bm_lorem_search/"^bibe"/3 61 57 1.07
bm_lorem_search/"^bibe"/4 63 59 1.05
bm_lorem_search/"bibe"/2 3076 3013 1.02
bm_lorem_search/"bibe"/3 6719 6250 1.08
bm_lorem_search/"bibe"/4 12556 11300 1.11
bm_lorem_search/"bibe".collate/2 3069 3380 0.91
bm_lorem_search/"bibe".collate/3 5999 6138 0.98
bm_lorem_search/"bibe".collate/4 12242 11300 1.08
bm_lorem_search/"(bibe)"/2 5162 5999 0.86
bm_lorem_search/"(bibe)"/3 10045 10254 0.98
bm_lorem_search/"(bibe)"/4 20926 22469 0.93
bm_lorem_search/"(bibe)+"/2 12556 12905 0.97
bm_lorem_search/"(bibe)+"/3 24554 25635 0.96
bm_lorem_search/"(bibe)+"/4 47571 54688 0.87
bm_lorem_search/"(?:bibe)+"/2 6094 7813 0.78
bm_lorem_search/"(?:bibe)+"/3 10986 15067 0.73
bm_lorem_search/"(?:bibe)+"/4 22496 28878 0.78
bm_lorem_search/R"(\bbibe)"/2 104980 92072 1.14
bm_lorem_search/R"(\bbibe)"/3 214471 179983 1.19
bm_lorem_search/R"(\bbibe)"/4 442661 351500 1.26
bm_lorem_search/R"(\Bibe)"/2 263660 217644 1.21
bm_lorem_search/R"(\Bibe)"/3 513827 460379 1.12
bm_lorem_search/R"(\Bibe)"/4 1108600 815763 1.36
bm_lorem_search/R"((?=....)bibe)"/2 4185 3850 1.09
bm_lorem_search/R"((?=....)bibe)"/3 8371 8196 1.02
bm_lorem_search/R"((?=....)bibe)"/4 16044 15067 1.06
bm_lorem_search/R"((?=bibe)....)"/2 4143 3683 1.12
bm_lorem_search/R"((?=bibe)....)"/3 7673 6975 1.10
bm_lorem_search/R"((?=bibe)....)"/4 14648 14125 1.04
bm_lorem_search/R"((?!lorem)bibe)"/2 3749 5000 0.75
bm_lorem_search/R"((?!lorem)bibe)"/3 7499 9417 0.80
bm_lorem_search/R"((?!lorem)bibe)"/4 14509 18415 0.79

@muellerj2 muellerj2 requested a review from a team as a code owner August 17, 2025 12:48
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Aug 17, 2025
@StephanTLavavej StephanTLavavej added enhancement Something can be improved regex meow is a substring of homeowner labels Aug 18, 2025
@StephanTLavavej StephanTLavavej self-assigned this Aug 18, 2025
@StephanTLavavej
Copy link
Member

Thanks as always for the detailed explanation and careful changes, looks perfect! 😻

@StephanTLavavej StephanTLavavej removed their assignment Aug 22, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Aug 22, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Aug 25, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 023485c into microsoft:main Aug 25, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Aug 25, 2025
@StephanTLavavej
Copy link
Member

Thanks for taking the first step towards this long-awaited impossible dream! 😻 🏃 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Something can be improved regex meow is a substring of homeowner
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants