Skip to content

Conversation

ptth222
Copy link

@ptth222 ptth222 commented Mar 7, 2025

Trying to address #61072 since I created the issue. I've never contributed to pandas, but I have tried to follow what is in the guide.

Pedro-Santos04 added a commit to Pedro-Santos04/pandas that referenced this pull request Mar 23, 2025
…nation in PyArrow strings

Fixes an issue where regex patterns with alternation (|) produce different results between str dtype and string[pyarrow] dtype. When using patterns like "(as)|(as)", PyArrow implementation would incorrectly match "asdf" while Python's implementation correctly rejects it. The fix adds special handling to ensure alternation patterns are properly parenthesized when using PyArrow-backed strings.
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@jorisvandenbossche
Copy link
Member

@ptth222 sorry you didn't get any feedback on the PR initially, but thanks for opening the issue and this PR! It looks like a good start to me, but will need to address some failing tests (and also add a new test case for what it is fixing)

@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data Arrow pyarrow functionality labels Sep 12, 2025
@jorisvandenbossche jorisvandenbossche added this to the 2.3.3 milestone Sep 12, 2025
@jorisvandenbossche jorisvandenbossche changed the title BUG: Addressing #61072 BUG: fix bug in str.fullmatch for Arrow backend with optional groups Sep 12, 2025
):
if not pat.endswith("$") or pat.endswith("\\$"):
pat = f"{pat}$"
if not pat.endswith("$") and not pat.startswith("^"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is change is missing from the original code is the or pat.endswith("\\$"), which ensures you can match a literal dollar sign (this is covered by the test_fullmatch_dollar_literal test, which is failing with this change).

I think every time you do if not pat.endswith("$") (also below), you'll have to add the `or pat.endswith("\$")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added what you said here and it seems to pass now. The one error doesn't seem to have anything to do with this code, but I could be wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to reply before, but it seems not to have gone through. I made these changes and it seems like it passes now. The 1 error that is generated doesn't seem to be because of this.

Added a test to confirm that the arrow implementation gives the same result as the python one. Also changed test_str_fullmatch to use the fullmatch method instead of the match method.
There were errors, so I changed the tests to try and make them pass.
Had to change expected result again because I missed one previously. Also had to change test structure to reuse parametrize.
There were arrows about the series types not being the same, so I tried to address that.
Trying again to get the result types correct so that the equal assertion works.
@ptth222
Copy link
Author

ptth222 commented Sep 17, 2025

I added a test that tests the arrow fullmatch against the python fullmatch. I also might have found a mistake in the fullmatch test. "test_str_fullmatch" in tests/extension/test_arrow.py was calling the match method and not the fullmatch method, so I changed it to use fullmatch. I also had to change some of the expected values to match what should be expected from fullmatch.

I can't for the life of me figure out what the problem is with the new test I added, test_str_fullmatch_against_python_fullmatch. It is producing AssertionError: Series NA mask are different, but both series should be the exact same values and types. I have tried using both ArrowDtype(pa.string()) and result.dtype in the astype method, but still get this same error. I can't reproduce this on my own machine. If someone has any insights into what the issue could be, I would love to hear it.

I didn't add any more tests except for the one for fullmatch, but it might be a good idea to add the same kind of test for match, search, etc. Assuming we can get the issues straightened out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants