-
Notifications
You must be signed in to change notification settings - Fork 1.5k
BUG: Fix compress_identical_objects handling of StreamObjects and rem… #3440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ove_orphans Closes py-pdf#3418 Previously, StreamObjects were not considered in duplicate compression and the remove_orphans parameter was ignored. This caused unexpected file size increases when removing duplicated pages. This commit ensures StreamObjects are handled and remove_orphans removes unreferenced objects correctly. Regression test included.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3440 +/- ##
==========================================
+ Coverage 96.97% 96.99% +0.01%
==========================================
Files 54 54
Lines 9337 9379 +42
Branches 1711 1724 +13
==========================================
+ Hits 9055 9097 +42
Misses 168 168
Partials 114 114 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Reordered imports alphabetically and replaced deprecated typing.Dict/typing.Tuple with dict/tuple where applicable. Trailing whitespace and blank line issues removed. Security and design issues are documented in code for future review.
|
||
Args: | ||
remove_identicals: Remove identical objects. | ||
remove_orphans: Remove unreferenced objects. | ||
""" | ||
if not remove_identicals and not remove_orphans: | ||
logger_warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, both args being set to True is a mistake. It can be the case that this happens programmatically as a result of some logic I don't even want to think about. In this case, just a warning is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean both arguments being set to False
is a mistake. While this method does not have an effect in this case, I do not see why we should issue a warning. If we want to completely avoid this, we should issue a proper exception to point the user to a mis-use. Warnings might be read, but are not guaranteed to be always read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advice not to use exceptions in this case. Warning is for debugging. But it's not my call.
# recursive DFS (mark phase) | ||
orphans = self._perform_orphan_recursive_mark_phase() | ||
# sweep phase | ||
assert orphans is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly for defensive programming and so that next developers see that orphans must be something at that point. If somebody changes the method _perform_orphan_recursive_mark_phase and the new logic doesn't necessarily return the orphans at that point, it will stop at the assertion. It's better to have it fail at that point than later when using orphans expecting something else than None.
# When remove_orphans=True, also mark reachability while rewriting refs | ||
orphans = self._remove_identical_objects(mark_orphans=remove_orphans) | ||
|
||
if remove_orphans: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this closes #3306 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes. I didn't realize that issue was there
can later sweep truly unreachable ones. | ||
|
||
This function: | ||
- groups objects by obj.hash_value(), --- WARNING: this can cause collisions !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which real-world effect and probability does this warning have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe really large PDFs with repeating elements. I was just being extra. I don't really think it pays off to change this.
obj: PdfObject, crossref: dict[IndirectObject, IndirectObject] | ||
Returns: | ||
Optional[list[bool]]: a boolean list the size of self._objects where True means | ||
"unvisited (potential orphan)" and False means "reachable". Returns None if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this indentation correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep 👍 — that indentation is correct.
Well as the Google-style docstring convention.
If you want to be super PEP conforming, we can add a blank line between the sections Args and Returns.
"unvisited (potential orphan)" and False means "reachable". Returns None if | ||
mark_orphans is False. | ||
""" | ||
# ----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks superfluous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change it. But I would make it clear that True means unvisited but not necessarily orphan, until the end of the mark phase.
If mark_orphans is True, mark reachable objects during the rewrite so that the caller | ||
can later sweep truly unreachable ones. | ||
|
||
--- SECURITY CONCERN !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please propose a solution on how you recommend to mitigate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're asking about the SECURITY CONCERN !!! thing
The dirtiest solution is to set a MAX_DEPTH int variable and if exceeded, stop the mark phase and the compression, raise the necessary errors. Timeouts or iteration caps of some short go in the same direction. Although I believe it's a good idea to implement this, even if they are optional arguments, I don't think it would be a complete solution.
A better, more complete solution, would be to use stacks instead of recursion.
# look for similar objects | ||
for idx, obj in enumerate(self._objects): | ||
replace_indirect_object(value, crossref, mark_orphans) | ||
# ----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ----- marks the end of the inner function. I believe it adds readability. But we can remove it if you want.
if is_null_or_none(obj): | ||
continue | ||
assert obj is not None, "mypy" # mypy: TypeGuard of `is_null_or_none` does not help here. | ||
assert obj is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate on this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably my mistake, having
, "mypy
in is a good idea.
self._idnum_hash[h][1].append(obj.indirect_reference) | ||
hsh = obj.hash_value() | ||
if hsh in self._idnum_hash: | ||
# duplicate: record its indirect and clear the slot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"its indirect" what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch — that comment is a bit shorthand-y.
The “its indirect” means its IndirectObject reference (the pointer that other PDF objects use to refer to it).
try: | ||
orphans[self._ID.indirect_reference.idnum - 1] = False | ||
except AttributeError: | ||
logger_debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Instead of catching exception, please prefer checking the availability beforehand.
- I do not see any benefit of the debug logging in this case.
Mark reachable objects during the rewrite so that the caller | ||
can later sweep truly unreachable ones. | ||
|
||
--- SECURITY CONCERN !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please propose a solution for this concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dirtiest solution is to set a MAX_DEPTH int variable and if exceeded, stop the mark phase and the compression, raise the necessary errors. Timeouts or iteration caps of some short go in the same direction. Although I believe it's a good idea to implement this, even if they are optional arguments, I don't think it would be a complete solution.
A better, more complete solution, would be to use stacks instead of recursion.
orphans[value.idnum - 1] = False | ||
else: | ||
mark_orphans(value) | ||
# ----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is superfluous.
try: | ||
orphans[self._ID.indirect_reference.idnum - 1] = False # type: ignore | ||
orphans[self._ID.indirect_reference.idnum - 1] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
for i in compress(range(len(self._objects)), orphans): | ||
self._objects[i] = None | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These empty lines look wrong.
# remove duplicated page and compress | ||
try: | ||
writer_double.remove_page(len(writer_double.pages) - 1) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not need any exception handling in tests. Could you please elaborate why this is required here?
compress_identical_objects restores the PDF size to approximately the single-page | ||
size after removing a duplicated page. | ||
""" | ||
SAMPLE_URLS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preferred way to handle multiple cases is to use parameterization as this produces nicer error messages.
Additionally: No need to download the sample files - they can be used directly from the source checkout and pytest.mark.enable_socket
can be replaced by pytest.mark.samples
.
|
||
# Verificamos que el documento sigue intacto (sigue teniendo 1 página) | ||
assert len(writer.pages) == 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writer = PdfWriter() | ||
writer.add_blank_page(width=72, height=72) | ||
|
||
# Ejecutamos con ambos flags en False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use English (instead of Spanish?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
I have added some review comments. Additionally, please have a look at the failing checks.
Empty line removal Co-authored-by: Stefan <[email protected]>
Removes sneaky identation error Co-authored-by: Stefan <[email protected]>
empty line style correction Co-authored-by: Stefan <[email protected]>
…ove_orphans
Closes #3418
Background
When duplicating pages and then removing one, the file size after calling
compress_identical_objects
did not decrease as expected. The root causes were:StreamObject
s were not considered in duplicate compression.remove_orphans
parameter was accepted but never actually applied.Fix
StreamObject
s in duplicate detection.remove_orphans
is honored and unreferenced objects are removed.Tests
compress_identical_objects
, and checks that the file size approximates the original single-page PDF.Note on file size
The absolute file size is not reduced back to the exact original value. This is still a problem: some additional overhead remains in the file structure even after deduplication. The important part is that the size no longer grows unbounded and that the
remove_orphans
parameter now works as documented.Notes on Code Observations
During the implementation, several potential security and design issues were observed throughout the code. These are not addressed in this PR, but have been documented inline for visibility.
These items should be reviewed in future issues or PRs to improve overall security, maintainability, and robustness of the library.
By flagging them now, we ensure they are tracked and can be prioritized for proper fixes later.