Skip to content

Follow XMP metadata recommendations from PDF specification #3396

@stefan6419846

Description

@stefan6419846

Explanation

We currently use the xml.dom.minidom stdlib package to work with XMP metadata. This does not completely fulfill the recommendations from Annex H, section H.2, of the PDF 2.0 specification regarding XMP metadata:

  • Document.toxml() adds the <?xml version="1.0"> header.
  • Between </x:xmpmeta> and <?xpacket end="w"?>, provide "white-space padding to permit in-place updating of metadata".

Section H.7.5 has an additional note, but is possibly considered outdated:

Note that applications which fully understand PDF updating do not usually update in-place

Code Example

Exactly like in

pypdf/docs/user/metadata.md

Lines 139 to 162 in 0b64266

from pypdf import PdfWriter
writer = PdfWriter(clone_from="example.pdf")
metadata = writer.xmp_metadata
assert metadata # Ensure that it is not `None`.
rdf_root = metadata.rdf_root
xmp_meta = rdf_root.parentNode
xmp_document = xmp_meta.parentNode
# Please note that without a text node, the corresponding elements might
# be omitted completely.
pdfuaid_description = xmp_document.createElement("rdf:Description")
pdfuaid_description.setAttribute("rdf:about", "")
pdfuaid_description.setAttribute("xmlns:pdfuaid", "http://www.aiim.org/pdfua/ns/id/")
pdfuaid_part = xmp_document.createElement("pdfuaid:part")
pdfuaid_part_text = xmp_document.createTextNode("1")
pdfuaid_part.appendChild(pdfuaid_part_text)
pdfuaid_description.appendChild(pdfuaid_part)
rdf_root.appendChild(pdfuaid_description)
metadata.stream.set_data(xmp_document.toxml().encode("utf-8"))
writer.write("output.pdf")

I currently use monkey-patching, which does not feel right:

def writexml(minidom_writer, indent='', addindent='', newl='', encoding=None, standalone=None):
    for _child_node in xmp_document.childNodes:
        _child_node.writexml(minidom_writer, indent, addindent, newl)


xmp_document.writexml = writexml

According to section 7.3.2 of the XMP Specifiction Part 1, we might have to use end="r" to disallow in-place modifications as we leave no whitespace:

The end="w" or end="r" portion shall be used by packet scanning processors to determine whether the XMP may be modified in-place. The end="w" form indicates writable and the end="r" form indicates read-only. The writable or read-only indication should be ignored by all “smart” (not packet scanning) processors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfWriterThe PdfWriter component is affectedis-featureA feature request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions