-
-
Notifications
You must be signed in to change notification settings - Fork 19k
BUG: fix bug in str.fullmatch for Arrow backend with optional groups #61073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nation in PyArrow strings Fixes an issue where regex patterns with alternation (|) produce different results between str dtype and string[pyarrow] dtype. When using patterns like "(as)|(as)", PyArrow implementation would incorrectly match "asdf" while Python's implementation correctly rejects it. The fix adds special handling to ensure alternation patterns are properly parenthesized when using PyArrow-backed strings.
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
@ptth222 sorry you didn't get any feedback on the PR initially, but thanks for opening the issue and this PR! It looks like a good start to me, but will need to address some failing tests (and also add a new test case for what it is fixing) |
): | ||
if not pat.endswith("$") or pat.endswith("\\$"): | ||
pat = f"{pat}$" | ||
if not pat.endswith("$") and not pat.startswith("^"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is change is missing from the original code is the or pat.endswith("\\$")
, which ensures you can match a literal dollar sign (this is covered by the test_fullmatch_dollar_literal
test, which is failing with this change).
I think every time you do if not pat.endswith("$")
(also below), you'll have to add the `or pat.endswith("\$")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added what you said here and it seems to pass now. The one error doesn't seem to have anything to do with this code, but I could be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to reply before, but it seems not to have gone through. I made these changes and it seems like it passes now. The 1 error that is generated doesn't seem to be because of this.
Made the changes suggested by @jorisvandenbossche.
Added a test to confirm that the arrow implementation gives the same result as the python one. Also changed test_str_fullmatch to use the fullmatch method instead of the match method.
There were errors, so I changed the tests to try and make them pass.
Had to change expected result again because I missed one previously. Also had to change test structure to reuse parametrize.
There were arrows about the series types not being the same, so I tried to address that.
Trying again to get the result types correct so that the equal assertion works.
I added a test that tests the arrow fullmatch against the python fullmatch. I also might have found a mistake in the fullmatch test. "test_str_fullmatch" in tests/extension/test_arrow.py was calling the I can't for the life of me figure out what the problem is with the new test I added, I didn't add any more tests except for the one for fullmatch, but it might be a good idea to add the same kind of test for match, search, etc. Assuming we can get the issues straightened out. |
Trying to address #61072 since I created the issue. I've never contributed to pandas, but I have tried to follow what is in the guide.