BUG: Handling array-based content stream -- specific to adding watermark #3537

RichPereira · 2025-11-27T02:56:29Z

Hi, I am a first time contributor and wanted to contribute to the project. I attempted fix for #3497 - specifically the watermark issue. Tested the fixed code, seems to add watermarks to the pdf document properly after local testing.

Overview
Observation: After looking closely, I saw that when calling page.merge_page(), the recursive deep copy of PDF objects (specifically the ArrayObject inside the /ProcSet resource dictionary, or other arrays) failed because the parent cloning function was attempting to call obj._clone(). The method name expected by the recursive cloning routine (_clone) did not match the implemented public method (clone) on the ArrayObject and possibly other generic types.

Fix: Added a specific check for hasattr(data, "_clone") before attempting to treat a PDF object as a generic dictionary. This ensures that objects like ArrayObject (which incorrectly appeared as a dictionary-like type during the deep-copy recursion) are now correctly identified as not being a DictionaryObject—the only class defining the private _clone helper—and are allowed to fall through to the proper list/tuple handling logic, thus preventing the access of a non-existent _clone attribute.

Files affected: pypdf/generic/_data_structures.py

Note - did add test cases in pypdf/tests/generic/test_array_based_input.py.

…led on dictionary type objects.

…mark.

codecov · 2025-11-27T03:05:46Z

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 97.15%. Comparing base (310e571) to head (7a53028).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pypdf/generic/_data_structures.py	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3537      +/-   ##
==========================================
- Coverage   97.16%   97.15%   -0.01%     
==========================================
  Files          57       57              
  Lines        9809     9810       +1     
  Branches     1781     1782       +1     
==========================================
  Hits         9531     9531              
  Misses        167      167              
- Partials      111      112       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…hanges.

stefan6419846

Thanks for the PR.

I am not completely sure whether these changes actually are valid. Why can we do this on every object? What is the difference between clone and _clone?

Additionally, the tests do not completely look like they follow our usual pattern:

Please prefer a more generic test file name, where the file could be used for more tests. It seems like it mostly covers DictionaryObject, thus we might use tests/generic/test_dictionary_object.py instead.
Please keep exactly two empty lines between functions.
Do we really need all those generic fixtures?
pytest.fail() should be avoided if possible. Why are regular assertions not sufficient?

stefan6419846 · 2025-12-01T12:33:40Z