Skip to content

Conversation

@RichPereira
Copy link

@RichPereira RichPereira commented Nov 27, 2025

Hi, I am a first time contributor and wanted to contribute to the project. I attempted fix for #3497 - specifically the watermark issue. Tested the fixed code, seems to add watermarks to the pdf document properly after local testing.

Overview
Observation: After looking closely, I saw that when calling page.merge_page(), the recursive deep copy of PDF objects (specifically the ArrayObject inside the /ProcSet resource dictionary, or other arrays) failed because the parent cloning function was attempting to call obj._clone(). The method name expected by the recursive cloning routine (_clone) did not match the implemented public method (clone) on the ArrayObject and possibly other generic types.

Fix: Added a specific check for hasattr(data, "_clone") before attempting to treat a PDF object as a generic dictionary. This ensures that objects like ArrayObject (which incorrectly appeared as a dictionary-like type during the deep-copy recursion) are now correctly identified as not being a DictionaryObject—the only class defining the private _clone helper—and are allowed to fall through to the proper list/tuple handling logic, thus preventing the access of a non-existent _clone attribute.

Files affected: pypdf/generic/_data_structures.py

Note - did add test cases in pypdf/tests/generic/test_array_based_input.py.

@codecov
Copy link

codecov bot commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 97.15%. Comparing base (310e571) to head (7a53028).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pypdf/generic/_data_structures.py 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3537      +/-   ##
==========================================
- Coverage   97.16%   97.15%   -0.01%     
==========================================
  Files          57       57              
  Lines        9809     9810       +1     
  Branches     1781     1782       +1     
==========================================
  Hits         9531     9531              
  Misses        167      167              
- Partials      111      112       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RichPereira RichPereira marked this pull request as draft November 27, 2025 04:33
@RichPereira RichPereira marked this pull request as ready for review November 28, 2025 03:05
Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

I am not completely sure whether these changes actually are valid. Why can we do this on every object? What is the difference between clone and _clone?

Additionally, the tests do not completely look like they follow our usual pattern:

  • Please prefer a more generic test file name, where the file could be used for more tests. It seems like it mostly covers DictionaryObject, thus we might use tests/generic/test_dictionary_object.py instead.
  • Please keep exactly two empty lines between functions.
  • Do we really need all those generic fixtures?
  • pytest.fail() should be avoided if possible. Why are regular assertions not sufficient?


# Final checks after merge
# The original page should now have the resources and contents from the watermark
if NameObject("/Contents") not in page:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test with controlled inputs. Why do we need this condition?

assert len(reader.pages) == 1

resources_obj = page["/Resources"].get_object()
assert len(resources_obj.get("/ProcSet", [])) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to check exact values here?

watermark_page = watermark_reader.pages[0]

# Store an original property before cloning
original_media_box = watermark_page[NameObject("/MediaBox")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
original_media_box = watermark_page[NameObject("/MediaBox")]
original_media_box = watermark_page.mediabox

content_array = source_page.get(NameObject("/Contents")).get_object()

# test presence of indirect_reference attribute
if isinstance(content_array, ArrayObject):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this heavy construct for controlled inputs?

assert content is not None

# Check that the process completed successfully and resulted in a valid object type
assert isinstance(content, (StreamObject, ArrayObject))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we know which of these it is?

"""
original_data = b"BT /F0 12 Tf 50 50 Td (Test Content) Tj ET"

#Create a populated StreamObject (which is truthy: len() > 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#Create a populated StreamObject (which is truthy: len() > 0)
# Create a populated StreamObject (which is truthy: len() > 0)

populated_stream = StreamObject()
populated_stream.set_data(original_data)

#Convert the Python integer len(original_data) to a PdfObject (NumberObject)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is superfluous.

populated_stream[NameObject("/Length")] = NumberObject(len(original_data))
populated_stream[NameObject("/Filter")] = NameObject("/FlateDecode")

assert len(populated_stream) > 0 # Not falsy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert len(populated_stream) > 0 # Not falsy
assert len(populated_stream) > 0 # Not falsy

Additionally, shouldn't we know the exact length?

pdf_writer = create_pdf_writer
# Clone and verify it completes successfully
cloned = content_stream_with_data.clone(pdf_writer, force_duplicate=True)
# The _clone method should have been called, so verify the clone is valid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen without the _clone method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants