Removing identifiable metada from PDF files
Posted by jpluimers on 2023/06/19
I archived a long thread that started withΒ [Archive] πππππ’οΉππππππππ on Twitter: “More fun publisher surveillance: Elsevier embeds a hash in the PDF metadata that is unique for each time a PDF is downloaded, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs. ” / Twitter atΒ [Wayback/Archive] Thread by @json_dirs on Thread Reader App β Thread Reader App.
TL;DR: publishers put hashes in PDF metadata to track back redistribution; they hardly use smarter watermarking as those are difficult to automatically parse; the hashes can be easily removed.
To remove all of the top-level metadata, you can use
exiftoolandqpdf:exiftool -all:all= <path.pdf> -o <output1.pdf> qpdf --linearize <output1.pdf> <output2.pdf>To remove *all* metadata, you can use
dangerzoneormat2
Related links:
- [Wayback/Archive] ExifTool by Phil Harvey
ExifTool is a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files. ExifTool supports many different metadata formats including EXIF, GPS, IPTC, XMP, JFIF, GeoTIFF, ICC Profile, Photoshop IRB, FlashPix, AFCP and ID3, as well as the maker notes of many digital cameras by Canon, Casio, DJI, FLIR, FujiFilm, GE, GoPro, HP, JVC/Victor, Kodak, Leaf, Minolta/Konica-Minolta, Motorola, Nikon, Nintendo, Olympus/Epson, Panasonic/Leica, Pentax/Asahi, Phase One, Reconyx, Ricoh, Samsung, Sanyo, Sigma/Foveon and Sony.
- [Wayback/Archive] QPDF: A Content-Preserving PDF Transformation System
What is QPDF?
QPDF is a command-line tool and C++ library that performs content-preserving transformations on PDF files. It supports linearization, encryption, and numerous other features. It can also be used for splitting and merging files, creating PDF files (but you have to supply all the content yourself), and inspecting files for study or analysis. QPDF does not render PDFs or perform text extraction, and it does not contain higher-level interfaces for working with page contents. It is a low-level tool for working with the structure of PDF files and can be a valuable tool for anyone who wants to do programmatic or command-line-based manipulation of PDF files.
- [Wayback/Archive] Dangerzone
Take potentially dangerous PDFs, office documents, or images and convert them to a safe PDF.
HOW IT WORKS
Dangerzone works like this: You give it a document that you don’t know if you can trust (for example, an email attachment). Inside of a sandbox, dangerzone converts the document to a PDF (if it isn’t already one), and then converts the PDF into raw pixel data: a huge list of of RGB color values for each page. Then, in a separate sandbox, dangerzone takes this pixel data and converts it back into a PDF.
- [Wayback/Archive] Elsevier PDF “hashes”
- [Wayback/Archive] Strip PDF Metadata (gist)
Related tweets:
- [Archive] πππππ’οΉππππππππ on Twitter: “@AndyPerfors @ceptional good question, and unfortunately yes the metadata does seem to be intact on sci-hub. Gently, since we may be on same page, I think the time for being coy is over, and it’s time to start vocally advocating against and actively working to replace the for-profit publication system.” / Twitter
- [Archive] Karl Magnacca on Twitter: “@json_dirs I checked with ExifTool and these fields are also in pdfs from BioOne and T&F (and also Pensoft, but not really relevant there), but not Wiley. Also I only checked one so far, but a BioOne paper from SciHub has the metadata mostly stripped out compared to the version on RG.” / Twitter
- [Archive] Sophie, indistinguishable from random noise on Twitter: “@karanlyons @json_dirs I still think it’s websafe base64 (not standard base64, it’s using – and _ and not + and /), with implicit padding, but these are a lot of tokens, and there is nothing constant between them. It’s certainly a strange way to add a fingerprint, but I don’t believe it’s encrypted” / Twitter
- [Archive] Kukka de Bierguirb HΓ€st on Twitter: “@json_dirs @SchmiegSophie 1) there are no actual spaces, exiftool inserts them at lowercase-uppercase borders 2) exiftool strips those periods in the tag so looks like `exiftool -b -xmp PDFFILE | grep -oP ”` is the only way to get the intact tag out” / Twitter
- [Archive] πππππ’οΉππππππππ on Twitter: “@horsemankukka @SchmiegSophie yep seeing the same thing. updated the gist with the samples and some python code to extract, mine i think still needs a lil regex tweaking but definitely more pattern to be found now. Tma appears to be a suffix. also seeing columns of “lt”/”lw”, “o9e” and”G” https://t.co/pjECXgwpwf” / Twitter
–jeroen






Leave a comment