Skip to content

Conversation

muellerj2
Copy link
Contributor

Since we settled on some reasonable semantics for leftmost-longest matching in #5218, I think we should remove the code for some other (abandoned) attempt to implement the leftmost-longest rule in _Matcher::_Do_if: An attempt to set Tgt_state to the leftmost-longest match found under this _N_if node.

This is neither necessary (in leftmost-longest mode, we take the final result from _Res, not _Tgt_state) nor sufficient (it assigns one of the longest matches to _Tgt_state, but if there are several of the same length it doesn't necessarily pick the correct one among these matches).

On the other hand, this saves 96 bytes of stack space per call to _Do_if in debug mode, slightly alleviating the stack overflow issues (#997, #1528).

Potentially leaving _Tgt_state in a garbage state is fine here, because none of the functions further up on the callstack rely on its value:

  • _Match_pat in the _N_if case: Will just return immediately to its caller without changing _Tgt_state.
  • _Do_rep0: Can't be a caller of _Match_pat because a repetition containing an _N_if node anywhere is not simple.
  • _Do_assert/_Do_neg_assert: Can't be the callers of _Match_pat because there are no lookahead assertions in the POSIX grammars that demand application of the leftmost-longest rule. (But even if there were lookahead assertions -- notwithstanding their currently unknown semantics -- it would be fine in the sense that the matcher wouldn't crash: _Do_assert would reset the position pointer to a savepoint, while _Do_neg_assert would fail, resulting in _Match_pat returning immediately to its caller so that the remaining analysis here applies. Even so, there is the issue that _Do_assert would not reset the capture groups in _Tgt_state -- but we don't know what to set them to either as long as we don't know the semantics of such assertions. Nevertheless, all "valid" capture groups would still point to legal ranges in the input, so even the matcher with this PR's change wouldn't crash. This means we wouldn't have to worry about a newer parser emiting assertion nodes because running them with the old matcher would at worst produce wrong results. To get correct semantics, an updated parser and matcher are required, but this would also be the case if we didn't do this PR's change.)
  • Another _Do_if: Will either reset _Tgt_state to some savepoint before calling _Match_pat or immediately return to its caller and leave _Tgt_state as-is.
  • _Do_rep: Will either reset _Tgt_state to some savepoint before doing the next _Match_pat call or return to its caller while leaving _Tgt_state as-is.
  • _Match_pat in the _N_rep or _N_end_rep cases: Will just return immediately to its caller without changing _Tgt_state.
  • _Match: Will evaluate _Res in leftmost-longest mode, not _Tgt_state. (_Match only evaluted _Res before <regex>: Fix depth-first and leftmost-longest matching rules #5218 as well, so this change also doesn't pose a problem if old and new functions are mixed.)

So in all cases, either Tgt_state isn't evaluated anymore or it is reset to some savepoint before it is used again.

@muellerj2 muellerj2 requested a review from a team as a code owner April 13, 2025 12:59
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Apr 13, 2025
@StephanTLavavej StephanTLavavej added enhancement Something can be improved regex meow is a substring of homeowner labels Apr 14, 2025
@StephanTLavavej StephanTLavavej self-assigned this Apr 14, 2025
@StephanTLavavej
Copy link
Member

Thanks for the exceptional analysis and writeup - I wouldn't have noticed that this code was dead! 😻

Anything to mitigate stack consumption is also greatly appreciated.

@StephanTLavavej StephanTLavavej removed their assignment Apr 15, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 15, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 22, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit f57e71d into microsoft:main Apr 22, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 22, 2025
@StephanTLavavej
Copy link
Member

Thanks again for this simplification! 🧹 ✨ 🐱

@muellerj2 muellerj2 deleted the regex-simplify-if-matching branch May 31, 2025 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Something can be improved regex meow is a substring of homeowner
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants