The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

Removing identifiable metada from PDF files

Posted by jpluimers on 2023/06/19

I archived a long thread that started withΒ [Archive] πš“πš˜πš—πš—πš’οΉπšœπšŠπšžπš—πšπšŽπš›πšœ on Twitter: “More fun publisher surveillance: Elsevier embeds a hash in the PDF metadata that is unique for each time a PDF is downloaded, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs. ” / Twitter atΒ [Wayback/Archive] Thread by @json_dirs on Thread Reader App – Thread Reader App.

TL;DR: publishers put hashes in PDF metadata to track back redistribution; they hardly use smarter watermarking as those are difficult to automatically parse; the hashes can be easily removed.

To remove all of the top-level metadata, you can use exiftool and qpdf:

exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>

To remove *all* metadata, you can use dangerzone or mat2

Related links:

  • [Wayback/Archive] ExifTool by Phil Harvey

    ExifTool is a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files. ExifTool supports many different metadata formats including EXIF, GPS, IPTC, XMP, JFIF, GeoTIFF, ICC Profile, Photoshop IRB, FlashPix, AFCP and ID3, as well as the maker notes of many digital cameras by Canon, Casio, DJI, FLIR, FujiFilm, GE, GoPro, HP, JVC/Victor, Kodak, Leaf, Minolta/Konica-Minolta, Motorola, Nikon, Nintendo, Olympus/Epson, Panasonic/Leica, Pentax/Asahi, Phase One, Reconyx, Ricoh, Samsung, Sanyo, Sigma/Foveon and Sony.

  • [Wayback/Archive] QPDF: A Content-Preserving PDF Transformation System
    What is QPDF?

    QPDF is a command-line tool and C++ library that performs content-preserving transformations on PDF files. It supports linearization, encryption, and numerous other features. It can also be used for splitting and merging files, creating PDF files (but you have to supply all the content yourself), and inspecting files for study or analysis. QPDF does not render PDFs or perform text extraction, and it does not contain higher-level interfaces for working with page contents. It is a low-level tool for working with the structure of PDF files and can be a valuable tool for anyone who wants to do programmatic or command-line-based manipulation of PDF files.

  • [Wayback/Archive] Dangerzone

    Take potentially dangerous PDFs, office documents, or images and convert them to a safe PDF.

    HOW IT WORKS

    Dangerzone works like this: You give it a document that you don’t know if you can trust (for example, an email attachment). Inside of a sandbox, dangerzone converts the document to a PDF (if it isn’t already one), and then converts the PDF into raw pixel data: a huge list of of RGB color values for each page. Then, in a separate sandbox, dangerzone takes this pixel data and converts it back into a PDF.

  • [Wayback/Archive] Elsevier PDF “hashes”
  • [Wayback/Archive] Strip PDF Metadata (gist)

Related tweets:

–jeroen


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.