The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

Installing Poppler on Windows via Chocolatey, which includes pdfimages for lossless extraction of images from PDF files

Posted by jpluimers on 2026/03/09

At the time of writing there was an almost 3 year old [Wayback/Archive] Chocolatey Software | Poppler 0.89.0 version so I filed the issue [Wayback/Archive] poppler 23.03 has been out for a few weeks, can you please update the build? · Issue #88 · chtof/chocolatey-packages mentioning [Wayback/Archive] Pull requests · oschwartz10612/poppler-windows

Poppler 23.03.0

Since that did not get really solved, I finally found out that after installing scoop, then scoop install poppler did work and installed version 23.08.0 (which I documented in [Wayback/Archive] Poppler version out of date · Issue #75 · chtof/chocolatey-packages installs from the most recent [Wayback/Archive] Releases · oschwartz10612/poppler-windows).

A very different approach is to install Poppler inside Windows Subsystem for Linux (WSL) as explained in [Wayback/Archive] Poppler On Windows. Python, PDFs, and Window’s Subsytem for… | by Matthew Earl Miller | Towards Data Science.

I needed Poppler (or actually the Windows equivalent of poppler-utils) of two reasons:

  1. I wanted to experiment with pdftotext as it has these very compelling command-line switches.
  2. I needed to export images for which pdfimages is the poppler tool to go.

pdftotext

Let’s start with qoutes from [Wayback/Archive] pdftotext: Portable Document Format (PDF) to text converter (version 3.03) | poppler-utils Commands | Man Pages | ManKier:

-layout

Maintain (as best as possible) the original physical layout of the text.  The default is to ´undo’ physical layout (columns, hyphenation, etc.) and output the text in reading order.

-fixed number

Assume fixed-pitch (or tabular) text, with the specified character width (in points).  This forces physical layout mode.

-raw

Keep the text in content stream order.  This is a hack which often “undoes” column formatting, etc.  Use of raw mode is no longer recommended.

-nodiag

Discard diagonal text (i.e., text that is not close to one of the 0, 90, 180, or 270 degree axes). This is useful for skipping watermarks drawn on body text.

In my case the --raw was the best option to export bank account statements to text for post processing.

pdfimages

And some quotes from [Wayback/Archive] pdfimages: Portable Document Format (PDF) image extractor (version 3.03) | poppler-utils Commands | Man Pages | ManKier

-all

Write JPEG, JPEG2000, JBIG2, and CCITT images in their native format. CMYK files are written as TIFF files. All other images are written as PNG files. This is equivalent to specifying the options -png -tiff -j -jp2 -jbig2 -ccitt.

-list

Instead of writing the images, list the images along with various information for each image. Do not specify an image-root with this option.

-p

Include page numbers in output file names.

-q

Don’t print any messages or errors.

In my case, --list showed me the available pictures, and the combination -all -p -q exported all images and included the page number in the exported image filenames and no error/warning messages. Omitting -q will result in large sections of output like this (as most PDF document writers are not really conformant)

Syntax Error (5859555): Unknown compression method in flate stream
Syntax Error (6188476): Bad two dim code (0001) in CCITTFax stream
Syntax Error (6188699): Bad two dim code (0001) in CCITTFax stream
Syntax Error (6188874): Bad two dim code (0001) in CCITTFax stream
Syntax Error (6189003): CCITTFax row is wrong length (54)
Syntax Error (6189004): CCITTFax row is wrong length (48)
Syntax Error (6189006): CCITTFax row is wrong length (47)
...
Syntax Error (6594849): Bad white code (0008) in CCITTFax stream
Syntax Error (6594849): Bad two dim code (0000) in CCITTFax stream
Syntax Error (6594968): CCITTFax row is wrong length (68)
Syntax Error (6594968): CCITTFax row is wrong length (68)
Syntax Error (6594968): CCITTFax row is wrong length (68)
Syntax Error (6594968): CCITTFax row is wrong length (92)
Syntax Error (6594968): CCITTFax row is wrong length (71)
Syntax Error (6594968): CCITTFax row is wrong length (1983)
Syntax Error (6594968): Bad two dim code (0001) in CCITTFax stream
Syntax Error (6595113): Invalid CCITTFax code

Related links

Queries

–jeroen

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.