-
Notifications
You must be signed in to change notification settings - Fork 1.5k
BUG: Handling array-based content stream -- specific to adding watermark #3537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3537 +/- ##
==========================================
- Coverage 97.16% 97.15% -0.01%
==========================================
Files 57 57
Lines 9809 9810 +1
Branches 1781 1782 +1
==========================================
Hits 9531 9531
Misses 167 167
- Partials 111 112 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
stefan6419846
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
I am not completely sure whether these changes actually are valid. Why can we do this on every object? What is the difference between clone and _clone?
Additionally, the tests do not completely look like they follow our usual pattern:
- Please prefer a more generic test file name, where the file could be used for more tests. It seems like it mostly covers DictionaryObject, thus we might use
tests/generic/test_dictionary_object.pyinstead. - Please keep exactly two empty lines between functions.
- Do we really need all those generic fixtures?
pytest.fail()should be avoided if possible. Why are regular assertions not sufficient?
|
|
||
| # Final checks after merge | ||
| # The original page should now have the resources and contents from the watermark | ||
| if NameObject("/Contents") not in page: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test with controlled inputs. Why do we need this condition?
| assert len(reader.pages) == 1 | ||
|
|
||
| resources_obj = page["/Resources"].get_object() | ||
| assert len(resources_obj.get("/ProcSet", [])) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to check exact values here?
| watermark_page = watermark_reader.pages[0] | ||
|
|
||
| # Store an original property before cloning | ||
| original_media_box = watermark_page[NameObject("/MediaBox")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| original_media_box = watermark_page[NameObject("/MediaBox")] | |
| original_media_box = watermark_page.mediabox |
| content_array = source_page.get(NameObject("/Contents")).get_object() | ||
|
|
||
| # test presence of indirect_reference attribute | ||
| if isinstance(content_array, ArrayObject): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this heavy construct for controlled inputs?
| assert content is not None | ||
|
|
||
| # Check that the process completed successfully and resulted in a valid object type | ||
| assert isinstance(content, (StreamObject, ArrayObject)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we know which of these it is?
| """ | ||
| original_data = b"BT /F0 12 Tf 50 50 Td (Test Content) Tj ET" | ||
|
|
||
| #Create a populated StreamObject (which is truthy: len() > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #Create a populated StreamObject (which is truthy: len() > 0) | |
| # Create a populated StreamObject (which is truthy: len() > 0) |
| populated_stream = StreamObject() | ||
| populated_stream.set_data(original_data) | ||
|
|
||
| #Convert the Python integer len(original_data) to a PdfObject (NumberObject) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is superfluous.
| populated_stream[NameObject("/Length")] = NumberObject(len(original_data)) | ||
| populated_stream[NameObject("/Filter")] = NameObject("/FlateDecode") | ||
|
|
||
| assert len(populated_stream) > 0 # Not falsy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert len(populated_stream) > 0 # Not falsy | |
| assert len(populated_stream) > 0 # Not falsy |
Additionally, shouldn't we know the exact length?
| pdf_writer = create_pdf_writer | ||
| # Clone and verify it completes successfully | ||
| cloned = content_stream_with_data.clone(pdf_writer, force_duplicate=True) | ||
| # The _clone method should have been called, so verify the clone is valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen without the _clone method?
Hi, I am a first time contributor and wanted to contribute to the project. I attempted fix for #3497 - specifically the watermark issue. Tested the fixed code, seems to add watermarks to the pdf document properly after local testing.
Overview
Observation: After looking closely, I saw that when calling page.merge_page(), the recursive deep copy of PDF objects (specifically the ArrayObject inside the /ProcSet resource dictionary, or other arrays) failed because the parent cloning function was attempting to call obj._clone(). The method name expected by the recursive cloning routine (_clone) did not match the implemented public method (clone) on the ArrayObject and possibly other generic types.
Fix: Added a specific check for hasattr(data, "_clone") before attempting to treat a PDF object as a generic dictionary. This ensures that objects like ArrayObject (which incorrectly appeared as a dictionary-like type during the deep-copy recursion) are now correctly identified as not being a DictionaryObject—the only class defining the private _clone helper—and are allowed to fall through to the proper list/tuple handling logic, thus preventing the access of a non-existent _clone attribute.
Files affected: pypdf/generic/_data_structures.py
Note - did add test cases in pypdf/tests/generic/test_array_based_input.py.