Converting multi-page PDFs to several JPGs using ImageMagick and/or GhostScript - pdf

I am trying to convert a multi-page PDF file into a bunch of JPEGs, one for each page in the PDF. I have spent hours and hours looking up how to do this, and eventually I discovered that I need Ghostscript installed. So I did that (from this website: http://downloads.ghostscript.com/public/ And I used the most recent link "ghostscript-9.05.tar.gz" from Feb 8, 2012).
However, even with this installed/downloaded, I am still unable to do what I want. Should I have this saved somewhere special, like in the same folder as ImageMagick?
What I have figured out so far is this:
In Command Prompt I change the working directory to the ImageMagick folder, where that is saved.
I then type
convert "<full file path to pdf>" "<full file path to jpg>"
This is followed by a giant blob of error. It begins with:
Unrecoverable error: rangecheck in.setuserparams
Operand stack:
Followed by a blurb of unreadable numbers and caps. It ends with:
While reading gs_lev2.ps:
%%[ Error: invalidaccess; OffendingCommand: put ]%%
Needless to say, after hours and hours of deliberation, I don't think I am any closer to doing the seemingly simple task of converting this PDF into a JPG.
What I would like are some step by step instructions on how to make this work. Don't leave out anything, no matter how "obvious" it might seem (especially anything involving ghostscript). This has been troubling me and my supervisor for months now.
For further clarification, we are on a Windows XP operating system. The eventual intention is to call these command lines in R, the statistical language, and run it in a script. In addition, I have been able to successfully convert JPGs to PNG format and vice versa, but PDF just is not working.
Help!!!

You don't need ImageMagick for this, Ghostscript can do it all alone. (If you used ImageMagick, it couldn't do that conversion itself, it HAS to use Ghostscript as its 'delegate'.)
Try this for directly using Ghostscript:
c:\path\to\gswin32c.exe ^
-o page_%03d.jpg ^
-sDEVICE=jpeg ^
d:/path/to/input.pdf
This will create a new JPEG for each page, and the filenames will increment as page_001.jpg, page_002.jpg,...
Note, this will also create JPEGs which use all the default settings of the jpeg device (one of the most important ones will be that the resolution will be 72dpi).
If you need higher (or lower resolution) for your images, you can add other options:
gswin32c.exe ^
-o page_%03d.jpg ^
-sDEVICE=jpeg ^
-r300 ^
-dJPEGQ=100 ^
d:/path/to/input.pdf
-r300 sets the resolution to 300dpi and -dJPEGQ=100 sets the highest JPEG quality level (Ghostscript's default is 75).
Also note, please: JPEG is not well suited to represent shapes with sharp edges and high contrast in good quality (such as you typically see in black-on-white text pages with small characters).
The (lossy) JPEG compression method is optimized for continuous-tone pictures + photos, and not for line graphics. Therefore it is sub-optimal for such PostScript or PDF input pages which mainly contain text. Here, the lossy compression of the JPEG format will result in poorer quality output even if the input is excellent. See also the JPEG FAQ for more details on this topic.
You may get better image output by choosing PNG as the output format (PNG uses a lossless compression):
gswin32c.exe ^
-o page_%03d.png ^
-sDEVICE=png16m ^
-r150 ^
d:/path/to/input.pdf
The png16m device produces 24bit RGB color. You could swap this for pnggray (for pure grayscale output), png256 (for 8-bit color), png16 (4-bit color), pngmono (black and white only) or pngmonod (alternative black-and-white module).

There are numerous SaaS services that will do this for you too. HyPDF and Blitline come to mind.

Related

Ghostscript to compress a batch of PDFs

I have no experience of programming.
My PDFs won't display images on the iPad in PDFExpert or GoodNotes as the images are in JPEG2000, from what I could find on the internet.
These are large PDFs, upto 1500-2000 pages with images. One of these was an 80MB or so file. I tried printing it with Foxit to convert the images to JPG from JPEG2000 but the file size jumped to 800MB...plus it's taking too long.
I stumbled upon Ghostscript, but I have NO clue how to use the command line interface.
I am very short on time. Pretty much need a step by step guide for a small script that converts all my PDFs in one go.
Very sorry about my inexperience and helplessness. Can someone spoon-feed me the steps for this?
EDIT: I want to switch the JPEG2000 to any other format that produces less of an increase in file size and causes a minimal loss in quality (within reason). I have no clue how to use Ghostscript. I basically want to change the compression on the images to something that will display correctly on the iPad while maintaining the quality of the rest of the text, as well as the embedded bookmarks.
I'll repeat that I have NO experience with command line...I don't even know how to point GS to the folder my PDFs are in...
You haven't really said what it is you want. 'Convert' PDFs how exactly ?
Note that switching from JPX (JPEG2000) to JPEG will result in a quality loss, because the image data will be quantised (with a different quantisation scheme to JPX) by the JPEG encoder. You can use a lossless compression scheme instead, but then you won't get the same kind of compression. You won't get the same compression ratio as JPX anyway no matter what you use, the result will be larger.
A simple Ghostscript command would be:
gs -sDEVICE=pdfwrite -o out.pdf in.pdf
Because JPEG2000 encoding is (or at least, was) patent encumbered, the pdfwrite device doesn't write images as JPX< by default it will write them several times with different compression schemes, and then use the one that gives the best compression (practically always JPEG).
Getting better results will require more a complex command line, but you'll also have to be more explicit about what exactly you want to achieve, and what the perceived problem with the simplistic command line is.
[EDIT]
Well, giving help on executing a command line is a bit off-topic for Stack Overflow, this is supposed to be a site for software developers :-)
Without knowing what operating system you are using its hard to give you detailed instructions, I also have no idea what an iPad uses, I don't generally use Apple devices and my only experience is with Macs.
Presumably you know where (the directory) you installed Ghostscript. Either open a command shell there and type the command ./gs or execute the command by giving the full path, such as :
/usr/bin/gs
I thought the arguments on the command line were self-explanatory, but....
The -sDEVICE=pdfwrite switch tells Ghostscript to use the pdfwrite device, as you might guess from the name, that device writes PDF files as its output.
The -o switch is the name (and full path if required) of the output file.
The final argument is the name (and again, full path if its not in the current directory) of the input file.
So a command might look like:
/usr/bin/gs -sDEVICE=pdfwrite -o /home/me/output.pdf /home/me/input.pdf
Or if Ghostscript and the input file are in the same directory:
./gs -sDEVICE=pdfwrite -o out.pdf input.pdf

shrinking a PDF

I'm not sure if this is the right place to post this question.
I'm trying to reduce the size of multiple 7MB PDF files so I tried this ghostscript commands I found online:
simple ghostscript with printer quality setting
gswin32c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
tried this
gswin32c.exe -o output.pdf -sDEVICE=pdfwrite -dColorConversionStrategy=/LeaveColorUnchanged -dDownsampleMonoImages=false -dDownsampleGrayImages=false -dDownsampleColorImages=false -dAutoFilterColorImages=false -dAutoFilterGrayImages=false -dColorImageFilter=/FlateEncode -dGrayImageFilter=/FlateEncode input.pdf
and this
gswin32c.exe -o output.pdf -sDEVICE=pdfwrite -dColorConversionStrategy=/LeaveColorUnchanged -dEncodeColorImages=false -dEncodeGrayImages=false -dEncodeMonoImages=false input.pdf
but in all cases the PDF files obtained were 'bigger' that the original.
All these pdf files are basically a collection of scanned images so maybe I need a specific option to 'tell' ghostscript to compress them ?
The strange thing I found is that using the trial version of phantom pdf I was able to reduce the size to 2-5MB without visible loss of quality.
How do I do the same with ghostscript ?
Firstly, Ghostscript (or more accurately, Ghostscript's pdfwrite device) doesn't 'shrink' PDF files, it makes new ones which may, or may not, be smaller.
Secondly, its practically impossible to say what might be happening with a PDF file without an example to look at.
If your files really are scanned images, then (assuming sensible initial compression) there's probably no way to reduce the file size without reducing quality. You might not notice the reduction in quality, especilaly if you're just viewing on screen, but it will be there.
Random poking with command lines which you run across online is probably not going to result in useful output either; you really need to understand where the size is being used in your original files, and then select options which are likely to reduce that.
For example, you say the pages are scanned images; there are only two realistic ways to reduce the size of an image, downsample it to a lower resolution, or select a different (more efficient, possibly lossy, compression). Ghostscript already compresses image data (unless you tell it not to).
The latter two of your command lines explicitly disable image downsampling, so they are not likely to reduce the size of scanned images. (by default the pdfwrite device doesn't downsample images, we try to preserve quality)
The middle option disables auto compression, and selects Flate compression. If your images were previously JPEG compressed, or are not contone images, then this is probably reasonable.
You also say that the PDF files got larger, most likely this is due to using compressed object streams and xref, which is a PDF 1.5 feature that the pdfwrite device doesn't support. However its not likely to save you much space.
I'd say the most likely difference is that 'phantom PDF' is using more aggressive downsampling, which you could reproduce with pdfwrite.
I'm assuming, of course, that you are using a recent version of Ghostscript. Older versions unsurprisingly perform less well than recent ones.

Ghostscript error when converting PostScript to PDF file

I convert a PDF with Ghostscript (9.20) to a PostScript File:
pdf2ps original.pdf optimized.ps
and then try to reconvert the PostScript to a smaller PDF file with the -dPDFSETTINGS=/screen or /ebook option to hopefully obtain a smaller PDF file size in the end:
ps2pdf -dPDFSETTINGS=/screen optimized.ps optimized.pdf
But then I get the following error during conversion:
Subsample filter does not support non-integer downsample factor (2.400000)
Failed to initialise downsample filter, downsampling aborted
What's missing or what I'm doing wrong? Couldn't find any solutions yet… :-(
Firstly you don't need to do a multiple step conversion PDF->PS->PDF, a simple PDF->PDF will work.
The warning is due to trying to downsample images to a lower resolution, and the scale factor is not an integer. So in this case, it won't downsample. If you insist on using the canned settings instead of setting the controls yourself, then I'm afraid you are pretty much always going to be in the dark. It would be much better to read the documentation and work out which controls to set, based on the type of input you have, and the compromises you are prepared to accept on quality.
In this case, you will almost certainly have to not downsample monochrome images. See the documentation on how to achieve that.
You have not stated the version of Ghostscript you are using which makes it even harder to comment here, however there is an open enhancement request regarding the downsampling filter here
Which originated with a Stack Overflow question here

Preflight program for PDFs using PoDoFo or anything else open source? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have to automate a preflight check on PDF documents. The preflight consists of:
Detect the resolution of images in an existing document and change them to 300dpi if they are not already at that resolution.
Detect the colorspace of images and if not in CMYK, then convert them to CMYK using color profiles.
Detect whether or not fonts are embedded in an existing PDF document, and correct this problem by substituting fonts. (or drawing font outlines — I'm not sure about this part).
Just wondering if this can be done using PoDoFo or any other open source projects out there. Or if I really need to go order some propriety software between $2K to $6K. My hosting environment is on Linux and supports PHP, Perl, Python, Ruby, Java.
Any ideas?
I'm not aware of any ready-made Open Source software which meets your requirements.
Only a part of it could be solved by writing your own shell script (or other program).
Detect resolution of images.
Run pdfimages -list some.pdf to output a list of images contained in the PDF as well as their dimensions... seemingly. But what is not obvious about it: these dimensions are the ones of the raw image (as embedded in the PDF). This could be 720x720 pixels. However, if rendered onto a 10x10 inch square of the page this image will be 72 DPI on the page. If rendered on a 1x1 inch square, it will be 720 DPI. Both types of 'rendering' inside a PDF can be made from the same embedded raw image, and it is the context of the current 'graphic state' which determines which is applied. So to determine the actual DPI of an image as it appears on the page requires some additional PDF parsing...
In any case, you can tell Ghostscript to re-sample images to 300 dpi, and to use a 'threshold' for this. (Ghostscript will never "upsample" an image, only downsample these which do overshoot the threshold. Upsampling almost never makes sense -- it only blows up the file size with no return in terms of higher quality.)
Convert colors to colorspace CMYK using ICC profiles.
The most recent versions of Ghostscript can do that. See also the most recent Ghostscript documentation describing its support for ICC.
Embed un-embedded fonts.
Running (and evaluating the results of) pdffonts some.pdf will show you which fonts are not embedded.
Ghostscript can embed un-embedded fonts.
So one Ghostscript command that would cover most of your requirements is this:
gs \
-o cmyk.pdf \
-sDEVICE=pdfwrite \
-sColorConversionStrategy=CMYK \
-sProcessColorModel=DeviceCMYK \
-sOutputICCProfile=/path/to/your.icc \
-sColorImageDownsampleThreshold=2 \
-sColorImageDownsampleType=Bicubic \
-sColorImageResolution=300 \
-sGrayImageDownsampleThreshold=2 \
-sGrayImageDownsampleType=Bicubic \
-sGrayImageResolution=300 \
-sMonoImageDownsampleThreshold=2 \
-sMonoImageDownsampleType=Bicubic \
-sMonoImageResolution=1200 \
-dSubsetFonts=true \
-dEmbedAllFonts=true \
-sCannotEmbedFontPolicy=Error \
-c ".setpdfwrite<</NeverEmbed[ ]>> setdistillerparams" \
-f some.pdf
This command would downsample all images with a resolution that's higher than the double wanted resolution (*ImageDownSampleThreshold=2). Also it would apply all these settings to any input file (unless some special PDF preflighting software which would apply selective 'fixups' based on the results of 'checks' for special properties).
Lastly, I cannot see what made think you'd have to spend $2k to $6k in case you'd have to resort to closed-source, commercial preflighting software. (My favorite in this field is the very powerful callas pdfToolbox6 (which even has a version that runs as CLI on Linux) -- its basic version costs 500 €.)
My background is in printing, so please keep this in mind when reading my answer. The items you propose to do seem somewhat straight forward, but when you get into the nitty gritty of it, there's a lot of print-industry knowledge that goes into these operations.
Here's some quick feedback to your bullet points:
You won't want to upsample an low res image to 300 dpi as it will decrease image quality (via re-interpolation) and increase files size.
You need to be careful with color conversions. There may be certain builds of RGB which you'd want to convert to black only. Or what happens if someone supplies a file which is already cmyk and tagged with the incorrect profile.
Font detection - very complicated to substitute fonts. If you don't have the exact same font as the originator, you could end up with text reflow problems. To own that font, you'll have to paid for a license. You also can't convert fonts to outlines without them being embedded.
My recommendation is to look at a commercial package for preflighting. These developers have invested years into developing their programs and are experts within the field of printing. The challenging part will be finding ones that are unix based in your price range. Most are designed for Windows or Mac. Callas has a linux cl version but not at the price listed. You'd need the server version.
What type of volume are you planning to run through it?
Did you try Enfocus PitStop Pro? Contact their support department with your specific request. They have tons of PDF preflight examples and will be happy to help you out.

Using ps2pdf on EPS files with PNG used for bitmaps?

We're currently using ps2pdf to convert EPS files to PDF. These EPS files contain both vector information (lines and text) and bitmap data.
However, by default ps2pdf converts the bitmap components of these images to JPG as they're embedded within the PDF, whereas for the type of graphics we have (data visualisation) it would be much more appropriate to use lossless compression. PDF supports PNG, so it should be possible to achieve what we're trying to do, but I'm having trouble finding a relevant option in the somewhat intimidating manual.
So the short question is: what is the correct way to write this?
    ps2pdf -dPDFSETTINGS=UsePNGinsteadOfJPGcompression input.eps output.pdf
The answer is not -dUseFlateCompression, since that option refers to using Flate instead of LZW compression; both are lossless but LZW was covered by patents for a while. Since that's not a problem any more, the option is ignored.
Instead, the options called to achieve lossless encoding of bitmap data are: (all four of)
-dAutoFilterColorImages=false
-dAutoFilterGrayImages=false
-dColorImageFilter=/FlateEncode
-dGrayImageFilter=/FlateEncode
You might also want to do the same thing with MonoImageFilter as well, but I assume /CCITTFaxEncode does a reasonable job there so it's not too important.