Skip to content

Conversation

luis-cd
Copy link

@luis-cd luis-cd commented Aug 19, 2025

…ove_orphans

Closes #3418

Background

When duplicating pages and then removing one, the file size after calling compress_identical_objects did not decrease as expected. The root causes were:

  • StreamObjects were not considered in duplicate compression.
  • The remove_orphans parameter was accepted but never actually applied.

Fix

  • Include StreamObjects in duplicate detection.
  • Ensure remove_orphans is honored and unreferenced objects are removed.

Tests

Note on file size

The absolute file size is not reduced back to the exact original value. This is still a problem: some additional overhead remains in the file structure even after deduplication. The important part is that the size no longer grows unbounded and that the remove_orphans parameter now works as documented.

Notes on Code Observations

During the implementation, several potential security and design issues were observed throughout the code. These are not addressed in this PR, but have been documented inline for visibility.

These items should be reviewed in future issues or PRs to improve overall security, maintainability, and robustness of the library.

By flagging them now, we ensure they are tracked and can be prioritized for proper fixes later.

…ove_orphans

Closes py-pdf#3418

Previously, StreamObjects were not considered in duplicate compression
and the remove_orphans parameter was ignored. This caused unexpected
file size increases when removing duplicated pages.

This commit ensures StreamObjects are handled and remove_orphans
removes unreferenced objects correctly.

Regression test included.
Copy link

codecov bot commented Aug 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.99%. Comparing base (ee922f3) to head (b523768).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3440      +/-   ##
==========================================
+ Coverage   96.97%   96.99%   +0.01%     
==========================================
  Files          54       54              
  Lines        9337     9379      +42     
  Branches     1711     1724      +13     
==========================================
+ Hits         9055     9097      +42     
  Misses        168      168              
  Partials      114      114              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Reordered imports alphabetically and replaced deprecated typing.Dict/typing.Tuple
with dict/tuple where applicable. Trailing whitespace and blank line issues removed.

Security and design issues are documented in code for future review.

Args:
remove_identicals: Remove identical objects.
remove_orphans: Remove unreferenced objects.
"""
if not remove_identicals and not remove_orphans:
logger_warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need a warning?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, both args being set to True is a mistake. It can be the case that this happens programmatically as a result of some logic I don't even want to think about. In this case, just a warning is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean both arguments being set to False is a mistake. While this method does not have an effect in this case, I do not see why we should issue a warning. If we want to completely avoid this, we should issue a proper exception to point the user to a mis-use. Warnings might be read, but are not guaranteed to be always read.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advice not to use exceptions in this case. Warning is for debugging. But it's not my call.

# recursive DFS (mark phase)
orphans = self._perform_orphan_recursive_mark_phase()
# sweep phase
assert orphans is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this assertion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly for defensive programming and so that next developers see that orphans must be something at that point. If somebody changes the method _perform_orphan_recursive_mark_phase and the new logic doesn't necessarily return the orphans at that point, it will stop at the assertion. It's better to have it fail at that point than later when using orphans expecting something else than None.

# When remove_orphans=True, also mark reachability while rewriting refs
orphans = self._remove_identical_objects(mark_orphans=remove_orphans)

if remove_orphans:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that this closes #3306 as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. I didn't realize that issue was there

can later sweep truly unreachable ones.

This function:
- groups objects by obj.hash_value(), --- WARNING: this can cause collisions !!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which real-world effect and probability does this warning have?

Copy link
Author

@luis-cd luis-cd Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe really large PDFs with repeating elements. I was just being extra. I don't really think it pays off to change this.

obj: PdfObject, crossref: dict[IndirectObject, IndirectObject]
Returns:
Optional[list[bool]]: a boolean list the size of self._objects where True means
"unvisited (potential orphan)" and False means "reachable". Returns None if
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this indentation correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep 👍 — that indentation is correct.
Well as the Google-style docstring convention.

If you want to be super PEP conforming, we can add a blank line between the sections Args and Returns.

"unvisited (potential orphan)" and False means "reachable". Returns None if
mark_orphans is False.
"""
# -----
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks superfluous.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change it. But I would make it clear that True means unvisited but not necessarily orphan, until the end of the mark phase.

If mark_orphans is True, mark reachable objects during the rewrite so that the caller
can later sweep truly unreachable ones.

--- SECURITY CONCERN !!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please propose a solution on how you recommend to mitigate this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're asking about the SECURITY CONCERN !!! thing

The dirtiest solution is to set a MAX_DEPTH int variable and if exceeded, stop the mark phase and the compression, raise the necessary errors. Timeouts or iteration caps of some short go in the same direction. Although I believe it's a good idea to implement this, even if they are optional arguments, I don't think it would be a complete solution.

A better, more complete solution, would be to use stacks instead of recursion.

# look for similar objects
for idx, obj in enumerate(self._objects):
replace_indirect_object(value, crossref, mark_orphans)
# -----
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# -----

Copy link
Author

@luis-cd luis-cd Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# ----- marks the end of the inner function. I believe it adds readability. But we can remove it if you want.

if is_null_or_none(obj):
continue
assert obj is not None, "mypy" # mypy: TypeGuard of `is_null_or_none` does not help here.
assert obj is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate on this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably my mistake, having

, "mypy

in is a good idea.

self._idnum_hash[h][1].append(obj.indirect_reference)
hsh = obj.hash_value()
if hsh in self._idnum_hash:
# duplicate: record its indirect and clear the slot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"its indirect" what?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — that comment is a bit shorthand-y.

The “its indirect” means its IndirectObject reference (the pointer that other PDF objects use to refer to it).

try:
orphans[self._ID.indirect_reference.idnum - 1] = False
except AttributeError:
logger_debug(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Instead of catching exception, please prefer checking the availability beforehand.
  2. I do not see any benefit of the debug logging in this case.

Mark reachable objects during the rewrite so that the caller
can later sweep truly unreachable ones.

--- SECURITY CONCERN !!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please propose a solution for this concern.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dirtiest solution is to set a MAX_DEPTH int variable and if exceeded, stop the mark phase and the compression, raise the necessary errors. Timeouts or iteration caps of some short go in the same direction. Although I believe it's a good idea to implement this, even if they are optional arguments, I don't think it would be a complete solution.

A better, more complete solution, would be to use stacks instead of recursion.

orphans[value.idnum - 1] = False
else:
mark_orphans(value)
# -----
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is superfluous.

try:
orphans[self._ID.indirect_reference.idnum - 1] = False # type: ignore
orphans[self._ID.indirect_reference.idnum - 1] = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

for i in compress(range(len(self._objects)), orphans):
self._objects[i] = None


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These empty lines look wrong.

# remove duplicated page and compress
try:
writer_double.remove_page(len(writer_double.pages) - 1)
except TypeError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need any exception handling in tests. Could you please elaborate why this is required here?

compress_identical_objects restores the PDF size to approximately the single-page
size after removing a duplicated page.
"""
SAMPLE_URLS = [
Copy link
Collaborator

@stefan6419846 stefan6419846 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferred way to handle multiple cases is to use parameterization as this produces nicer error messages.

Additionally: No need to download the sample files - they can be used directly from the source checkout and pytest.mark.enable_socket can be replaced by pytest.mark.samples.


# Verificamos que el documento sigue intacto (sigue teniendo 1 página)
assert len(writer.pages) == 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

writer = PdfWriter()
writer.add_blank_page(width=72, height=72)

# Ejecutamos con ambos flags en False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use English (instead of Spanish?)

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

I have added some review comments. Additionally, please have a look at the failing checks.

luis-cd and others added 2 commits September 8, 2025 15:41
Empty line removal

Co-authored-by: Stefan <[email protected]>
Removes sneaky identation error

Co-authored-by: Stefan <[email protected]>
empty line style correction

Co-authored-by: Stefan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PdfWriter remove_page/compress_identical_objects does not clean up properly
2 participants