BUG: Fix compress_identical_objects handling of StreamObjects and rem… #3440

luis-cd · 2025-08-19T22:01:13Z

…ove_orphans

Background

When duplicating pages and then removing one, the file size after calling compress_identical_objects did not decrease as expected. The root causes were:

StreamObjects were not considered in duplicate compression.
The remove_orphans parameter was accepted but never actually applied.

Fix

Include StreamObjects in duplicate detection.
Ensure remove_orphans is honored and unreferenced objects are removed.

Tests

Added a regression test based on the original report (PdfWriter remove_page/compress_identical_objects does not clean up properly #3418).
The test duplicates pages from sample PDFs, removes one, calls compress_identical_objects, and checks that the file size approximates the original single-page PDF.

Note on file size

The absolute file size is not reduced back to the exact original value. This is still a problem: some additional overhead remains in the file structure even after deduplication. The important part is that the size no longer grows unbounded and that the remove_orphans parameter now works as documented.

Notes on Code Observations

During the implementation, several potential security and design issues were observed throughout the code. These are not addressed in this PR, but have been documented inline for visibility.

These items should be reviewed in future issues or PRs to improve overall security, maintainability, and robustness of the library.

By flagging them now, we ensure they are tracked and can be prioritized for proper fixes later.

…ove_orphans Closes py-pdf#3418 Previously, StreamObjects were not considered in duplicate compression and the remove_orphans parameter was ignored. This caused unexpected file size increases when removing duplicated pages. This commit ensures StreamObjects are handled and remove_orphans removes unreferenced objects correctly. Regression test included.

codecov · 2025-08-19T22:11:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.99%. Comparing base (ee922f3) to head (b523768).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3440      +/-   ##
==========================================
+ Coverage   96.97%   96.99%   +0.01%     
==========================================
  Files          54       54              
  Lines        9337     9379      +42     
  Branches     1711     1724      +13     
==========================================
+ Hits         9055     9097      +42     
  Misses        168      168              
  Partials      114      114

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Reordered imports alphabetically and replaced deprecated typing.Dict/typing.Tuple with dict/tuple where applicable. Trailing whitespace and blank line issues removed. Security and design issues are documented in code for future review.

pypdf/_utils.py

stefan6419846 · 2025-09-08T09:00:43Z

pypdf/_writer.py


        Args:
            remove_identicals: Remove identical objects.
            remove_orphans: Remove unreferenced objects.
+        """
+        if not remove_identicals and not remove_orphans:
+            logger_warning(


Why does this need a warning?

Usually, both args being set to True is a mistake. It can be the case that this happens programmatically as a result of some logic I don't even want to think about. In this case, just a warning is fine.

I assume you mean both arguments being set to False is a mistake. While this method does not have an effect in this case, I do not see why we should issue a warning. If we want to completely avoid this, we should issue a proper exception to point the user to a mis-use. Warnings might be read, but are not guaranteed to be always read.

I would advice not to use exceptions in this case. Warning is for debugging. But it's not my call.

stefan6419846 · 2025-09-08T09:02:09Z

pypdf/_writer.py

+                # recursive DFS (mark phase)
+                orphans = self._perform_orphan_recursive_mark_phase()
+            # sweep phase
+            assert orphans is not None


Why do we need this assertion?

Mostly for defensive programming and so that next developers see that orphans must be something at that point. If somebody changes the method _perform_orphan_recursive_mark_phase and the new logic doesn't necessarily return the orphans at that point, it will stop at the assertion. It's better to have it fail at that point than later when using orphans expecting something else than None.

stefan6419846 · 2025-09-08T09:02:39Z

pypdf/_writer.py

+            # When remove_orphans=True, also mark reachability while rewriting refs
+            orphans = self._remove_identical_objects(mark_orphans=remove_orphans)
+
+        if remove_orphans:


I assume that this closes #3306 as well?

Oh, yes. I didn't realize that issue was there

stefan6419846 · 2025-09-08T09:03:26Z

pypdf/_writer.py

+        can later sweep truly unreachable ones.
+
+        This function:
+        - groups objects by obj.hash_value(),  ---  WARNING: this can cause collisions !!!


Which real-world effect and probability does this warning have?

Maybe really large PDFs with repeating elements. I was just being extra. I don't really think it pays off to change this.

pypdf/_writer.py

stefan6419846 · 2025-09-08T09:04:34Z

pypdf/_writer.py

-            obj: PdfObject, crossref: dict[IndirectObject, IndirectObject]
+        Returns:
+            Optional[list[bool]]: a boolean list the size of self._objects where True means
+            "unvisited (potential orphan)" and False means "reachable". Returns None if


Is this indentation correct?

Yep 👍 — that indentation is correct.
Well as the Google-style docstring convention.

If you want to be super PEP conforming, we can add a blank line between the sections Args and Returns.

stefan6419846 · 2025-09-08T09:04:46Z

pypdf/_writer.py

+            "unvisited (potential orphan)" and False means "reachable". Returns None if
+            mark_orphans is False.
+        """
+        # -----


This looks superfluous.

We can change it. But I would make it clear that True means unvisited but not necessarily orphan, until the end of the mark phase.

stefan6419846 · 2025-09-08T09:05:25Z

pypdf/_writer.py

+            If mark_orphans is True, mark reachable objects during the rewrite so that the caller
+            can later sweep truly unreachable ones.
+
+            --- SECURITY CONCERN !!!


Please propose a solution on how you recommend to mitigate this.

If you're asking about the SECURITY CONCERN !!! thing

The dirtiest solution is to set a MAX_DEPTH int variable and if exceeded, stop the mark phase and the compression, raise the necessary errors. Timeouts or iteration caps of some short go in the same direction. Although I believe it's a good idea to implement this, even if they are optional arguments, I don't think it would be a complete solution.

A better, more complete solution, would be to use stacks instead of recursion.

stefan6419846 · 2025-09-08T09:05:42Z

pypdf/_writer.py

-        # look for similar objects
-        for idx, obj in enumerate(self._objects):
+                    replace_indirect_object(value, crossref, mark_orphans)
+        # -----


Suggested change

# -----

# ----- marks the end of the inner function. I believe it adds readability. But we can remove it if you want.

stefan6419846 · 2025-09-08T09:06:03Z

pypdf/_writer.py

            if is_null_or_none(obj):
                continue
-            assert obj is not None, "mypy"  # mypy: TypeGuard of `is_null_or_none` does not help here.
+            assert obj is not None


Please elaborate on this change.

Probably my mistake, having

, "mypy

in is a good idea.

stefan6419846 · 2025-09-08T09:06:24Z

pypdf/_writer.py

-                self._idnum_hash[h][1].append(obj.indirect_reference)
+            hsh = obj.hash_value()
+            if hsh in self._idnum_hash:
+                # duplicate: record its indirect and clear the slot


"its indirect" what?

Good catch — that comment is a bit shorthand-y.

The “its indirect” means its IndirectObject reference (the pointer that other PDF objects use to refer to it).

stefan6419846 · 2025-09-08T09:09:56Z

pypdf/_writer.py

+            try:
+                orphans[self._ID.indirect_reference.idnum - 1] = False
+            except AttributeError:
+                logger_debug(


Instead of catching exception, please prefer checking the availability beforehand.

I do not see any benefit of the debug logging in this case.

stefan6419846 · 2025-09-08T09:10:14Z

pypdf/_writer.py

+            Mark reachable objects during the rewrite so that the caller
+            can later sweep truly unreachable ones.
+
+            --- SECURITY CONCERN !!!


Please propose a solution for this concern.

The dirtiest solution is to set a MAX_DEPTH int variable and if exceeded, stop the mark phase and the compression, raise the necessary errors. Timeouts or iteration caps of some short go in the same direction. Although I believe it's a good idea to implement this, even if they are optional arguments, I don't think it would be a complete solution.

A better, more complete solution, would be to use stacks instead of recursion.

stefan6419846 · 2025-09-08T09:10:44Z

pypdf/_writer.py

+                    orphans[value.idnum - 1] = False
+                else:
+                    mark_orphans(value)
+        # -----


This comment is superfluous.

stefan6419846 · 2025-09-08T09:11:08Z

pypdf/_writer.py

        try:
-            orphans[self._ID.indirect_reference.idnum - 1] = False  # type: ignore
+            orphans[self._ID.indirect_reference.idnum - 1] = False


stefan6419846 · 2025-09-08T09:11:45Z

pypdf/_writer.py

        for i in compress(range(len(self._objects)), orphans):
            self._objects[i] = None

+


These empty lines look wrong.

tests/test_writer.py

stefan6419846 · 2025-09-08T09:12:45Z

tests/test_writer.py

+        # remove duplicated page and compress
+        try:
+            writer_double.remove_page(len(writer_double.pages) - 1)
+        except TypeError:


We should not need any exception handling in tests. Could you please elaborate why this is required here?

stefan6419846 · 2025-09-08T09:13:30Z

tests/test_writer.py

+    compress_identical_objects restores the PDF size to approximately the single-page
+    size after removing a duplicated page.
+    """
+    SAMPLE_URLS = [


The preferred way to handle multiple cases is to use parameterization as this produces nicer error messages.

Additionally: No need to download the sample files - they can be used directly from the source checkout and pytest.mark.enable_socket can be replaced by pytest.mark.samples.

stefan6419846 · 2025-09-08T09:14:33Z

tests/test_writer.py

+
+    # Verificamos que el documento sigue intacto (sigue teniendo 1 página)
+    assert len(writer.pages) == 1
+


Suggested change

stefan6419846 · 2025-09-08T09:15:11Z

tests/test_writer.py

+    writer = PdfWriter()
+    writer.add_blank_page(width=72, height=72)
+
+    # Ejecutamos con ambos flags en False


Please use English (instead of Spanish?)

stefan6419846

Thanks for the PR.

I have added some review comments. Additionally, please have a look at the failing checks.

Empty line removal Co-authored-by: Stefan <[email protected]>

Removes sneaky identation error Co-authored-by: Stefan <[email protected]>

empty line style correction Co-authored-by: Stefan <[email protected]>

luis-cd mentioned this pull request Aug 20, 2025

PdfWriter remove_page/compress_identical_objects does not clean up properly #3418

Open

TST: added test coverage for noop in compress_identical_objects

438d5dc