Skip to content

Removal of leading WWWnnn. in URL canonicalization is too aggressive #29

@tfmorris

Description

@tfmorris

This bug (internetarchive/surt#28) reported against the Python SURT module applies to the URL canonicalization here as well.

The following URLs are incorrectly canonicalized with SURT as "com)/".

SURT = "com)/"
1. https://www1355544.com/
2. https://www3288.com/
3. https://www504778.com/
4. https://www556798.com/
5. https://www57912.com/

There's also a difference in the handling of these prefixes between the two packages: the Java package removes ALL leading matching prefixes while the Python package only removes the first one. I think the less aggressive approach of the Python package might be preferable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions