Using ps2pdf on EPS files with PNG used for bitmaps? - pdf

We're currently using ps2pdf to convert EPS files to PDF. These EPS files contain both vector information (lines and text) and bitmap data.
However, by default ps2pdf converts the bitmap components of these images to JPG as they're embedded within the PDF, whereas for the type of graphics we have (data visualisation) it would be much more appropriate to use lossless compression. PDF supports PNG, so it should be possible to achieve what we're trying to do, but I'm having trouble finding a relevant option in the somewhat intimidating manual.
So the short question is: what is the correct way to write this?
    ps2pdf -dPDFSETTINGS=UsePNGinsteadOfJPGcompression input.eps output.pdf

The answer is not -dUseFlateCompression, since that option refers to using Flate instead of LZW compression; both are lossless but LZW was covered by patents for a while. Since that's not a problem any more, the option is ignored.
Instead, the options called to achieve lossless encoding of bitmap data are: (all four of)
-dAutoFilterColorImages=false
-dAutoFilterGrayImages=false
-dColorImageFilter=/FlateEncode
-dGrayImageFilter=/FlateEncode
You might also want to do the same thing with MonoImageFilter as well, but I assume /CCITTFaxEncode does a reasonable job there so it's not too important.

Related

Ghostscript to compress a batch of PDFs

I have no experience of programming.
My PDFs won't display images on the iPad in PDFExpert or GoodNotes as the images are in JPEG2000, from what I could find on the internet.
These are large PDFs, upto 1500-2000 pages with images. One of these was an 80MB or so file. I tried printing it with Foxit to convert the images to JPG from JPEG2000 but the file size jumped to 800MB...plus it's taking too long.
I stumbled upon Ghostscript, but I have NO clue how to use the command line interface.
I am very short on time. Pretty much need a step by step guide for a small script that converts all my PDFs in one go.
Very sorry about my inexperience and helplessness. Can someone spoon-feed me the steps for this?
EDIT: I want to switch the JPEG2000 to any other format that produces less of an increase in file size and causes a minimal loss in quality (within reason). I have no clue how to use Ghostscript. I basically want to change the compression on the images to something that will display correctly on the iPad while maintaining the quality of the rest of the text, as well as the embedded bookmarks.
I'll repeat that I have NO experience with command line...I don't even know how to point GS to the folder my PDFs are in...
You haven't really said what it is you want. 'Convert' PDFs how exactly ?
Note that switching from JPX (JPEG2000) to JPEG will result in a quality loss, because the image data will be quantised (with a different quantisation scheme to JPX) by the JPEG encoder. You can use a lossless compression scheme instead, but then you won't get the same kind of compression. You won't get the same compression ratio as JPX anyway no matter what you use, the result will be larger.
A simple Ghostscript command would be:
gs -sDEVICE=pdfwrite -o out.pdf in.pdf
Because JPEG2000 encoding is (or at least, was) patent encumbered, the pdfwrite device doesn't write images as JPX< by default it will write them several times with different compression schemes, and then use the one that gives the best compression (practically always JPEG).
Getting better results will require more a complex command line, but you'll also have to be more explicit about what exactly you want to achieve, and what the perceived problem with the simplistic command line is.
[EDIT]
Well, giving help on executing a command line is a bit off-topic for Stack Overflow, this is supposed to be a site for software developers :-)
Without knowing what operating system you are using its hard to give you detailed instructions, I also have no idea what an iPad uses, I don't generally use Apple devices and my only experience is with Macs.
Presumably you know where (the directory) you installed Ghostscript. Either open a command shell there and type the command ./gs or execute the command by giving the full path, such as :
/usr/bin/gs
I thought the arguments on the command line were self-explanatory, but....
The -sDEVICE=pdfwrite switch tells Ghostscript to use the pdfwrite device, as you might guess from the name, that device writes PDF files as its output.
The -o switch is the name (and full path if required) of the output file.
The final argument is the name (and again, full path if its not in the current directory) of the input file.
So a command might look like:
/usr/bin/gs -sDEVICE=pdfwrite -o /home/me/output.pdf /home/me/input.pdf
Or if Ghostscript and the input file are in the same directory:
./gs -sDEVICE=pdfwrite -o out.pdf input.pdf

Ghostscript error when converting PostScript to PDF file

I convert a PDF with Ghostscript (9.20) to a PostScript File:
pdf2ps original.pdf optimized.ps
and then try to reconvert the PostScript to a smaller PDF file with the -dPDFSETTINGS=/screen or /ebook option to hopefully obtain a smaller PDF file size in the end:
ps2pdf -dPDFSETTINGS=/screen optimized.ps optimized.pdf
But then I get the following error during conversion:
Subsample filter does not support non-integer downsample factor (2.400000)
Failed to initialise downsample filter, downsampling aborted
What's missing or what I'm doing wrong? Couldn't find any solutions yet… :-(
Firstly you don't need to do a multiple step conversion PDF->PS->PDF, a simple PDF->PDF will work.
The warning is due to trying to downsample images to a lower resolution, and the scale factor is not an integer. So in this case, it won't downsample. If you insist on using the canned settings instead of setting the controls yourself, then I'm afraid you are pretty much always going to be in the dark. It would be much better to read the documentation and work out which controls to set, based on the type of input you have, and the compromises you are prepared to accept on quality.
In this case, you will almost certainly have to not downsample monochrome images. See the documentation on how to achieve that.
You have not stated the version of Ghostscript you are using which makes it even harder to comment here, however there is an open enhancement request regarding the downsampling filter here
Which originated with a Stack Overflow question here

Converting multi-page PDFs to several JPGs using ImageMagick and/or GhostScript

I am trying to convert a multi-page PDF file into a bunch of JPEGs, one for each page in the PDF. I have spent hours and hours looking up how to do this, and eventually I discovered that I need Ghostscript installed. So I did that (from this website: http://downloads.ghostscript.com/public/ And I used the most recent link "ghostscript-9.05.tar.gz" from Feb 8, 2012).
However, even with this installed/downloaded, I am still unable to do what I want. Should I have this saved somewhere special, like in the same folder as ImageMagick?
What I have figured out so far is this:
In Command Prompt I change the working directory to the ImageMagick folder, where that is saved.
I then type
convert "<full file path to pdf>" "<full file path to jpg>"
This is followed by a giant blob of error. It begins with:
Unrecoverable error: rangecheck in.setuserparams
Operand stack:
Followed by a blurb of unreadable numbers and caps. It ends with:
While reading gs_lev2.ps:
%%[ Error: invalidaccess; OffendingCommand: put ]%%
Needless to say, after hours and hours of deliberation, I don't think I am any closer to doing the seemingly simple task of converting this PDF into a JPG.
What I would like are some step by step instructions on how to make this work. Don't leave out anything, no matter how "obvious" it might seem (especially anything involving ghostscript). This has been troubling me and my supervisor for months now.
For further clarification, we are on a Windows XP operating system. The eventual intention is to call these command lines in R, the statistical language, and run it in a script. In addition, I have been able to successfully convert JPGs to PNG format and vice versa, but PDF just is not working.
Help!!!
You don't need ImageMagick for this, Ghostscript can do it all alone. (If you used ImageMagick, it couldn't do that conversion itself, it HAS to use Ghostscript as its 'delegate'.)
Try this for directly using Ghostscript:
c:\path\to\gswin32c.exe ^
-o page_%03d.jpg ^
-sDEVICE=jpeg ^
d:/path/to/input.pdf
This will create a new JPEG for each page, and the filenames will increment as page_001.jpg, page_002.jpg,...
Note, this will also create JPEGs which use all the default settings of the jpeg device (one of the most important ones will be that the resolution will be 72dpi).
If you need higher (or lower resolution) for your images, you can add other options:
gswin32c.exe ^
-o page_%03d.jpg ^
-sDEVICE=jpeg ^
-r300 ^
-dJPEGQ=100 ^
d:/path/to/input.pdf
-r300 sets the resolution to 300dpi and -dJPEGQ=100 sets the highest JPEG quality level (Ghostscript's default is 75).
Also note, please: JPEG is not well suited to represent shapes with sharp edges and high contrast in good quality (such as you typically see in black-on-white text pages with small characters).
The (lossy) JPEG compression method is optimized for continuous-tone pictures + photos, and not for line graphics. Therefore it is sub-optimal for such PostScript or PDF input pages which mainly contain text. Here, the lossy compression of the JPEG format will result in poorer quality output even if the input is excellent. See also the JPEG FAQ for more details on this topic.
You may get better image output by choosing PNG as the output format (PNG uses a lossless compression):
gswin32c.exe ^
-o page_%03d.png ^
-sDEVICE=png16m ^
-r150 ^
d:/path/to/input.pdf
The png16m device produces 24bit RGB color. You could swap this for pnggray (for pure grayscale output), png256 (for 8-bit color), png16 (4-bit color), pngmono (black and white only) or pngmonod (alternative black-and-white module).
There are numerous SaaS services that will do this for you too. HyPDF and Blitline come to mind.

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.

How to optimize PDF file size?

I have an input PDF file (usually, but not always generated by pdfTeX), which I want to convert to an output PDF, which is visually equivalent (no matter the resolution), it has the same metadata (Unicode text info, hyperlinks, outlines etc.), but the file size is as small as possible.
I know about the following methods:
java -cp Multivalent.jar tool.pdf.Compress input.pdf (from http://multivalent.sourceforge.net/). This recompresses all streams, removes unused objects, unifies equivalent objects, compresses whitespace, removes default values, compresses the cross-reference table.
Recompressing suitable images with jbig2 and PNGOUT.
Re-encoding Type1 fonts as CFF fonts.
Unifying equivalent images.
Unifying subsets of the same font to a bigger subset.
Remove fillable forms.
When distilling or otherwise converting (e.g. gs -sDEVICE=pdfwrite), make sure it doesn't degrade image quality, and doesn't increase (!) the image sizes.
I know about the following techniques, but they don't apply in my case, since I already have a PDF:
Use smaller and/or less fonts.
Use vector images instead bitmap images.
Do you have any other ideas how to optimize PDF?
Optimize PDF Files
Avoid Refried Graphics
For graphics that must be inserted as bitmaps, prepare them for maximum compressibility and minimum dimensions. Use the best quality images that you can at the output resolution of the PDF. Inserting compressed JPEGs into PDFs and Distilling them may recompress JPEGs, which can create noticeable artifacts. Use black and white images and text instead of color images to allow the use of the newer JBIG2 standard that excels in monochromatic compression. Be sure to turn off thumbnails when saving PDFs for the Web.
Use Vector Graphics
Use vector-based graphics wherever possible for images that would normally be made into GIFs. Vector images scale perfectly, look marvelous, and their mathematical formulas usually take up less space than bitmapped graphics that describe every pixel (although there are some cases where bitmap graphics are actually smaller than vector graphics). You can also compress vector image data using ZIP compression, which is built into the PDF format. Acrobat Reader version 5 and 6 also support the SVG standard.
Minimize Fonts
How you use fonts, especially in smaller PDFs, can have a significant impact on file size. Minimize the number of fonts you use in your documents to minimize their impact on file size. Each additional fully embedded font can easily take 40K in file size, which is why most authors create "subsetted" fonts that only include the glyphs actually used.
Flatten Fat Forms
Acrobat forms can take up a lot of space in your PDFs. New in Acrobat 8 Pro you can flatten form fields in the Advanced -> PDF Optimizer -> Discard Objects dialog. Flattening forms makes form fields unusable and form data is merged with the page. You can also use PDF Enhancer from Apago to reduce forms by 50% by removing information present in the file but never actually used. You can also combine a refried PDF with the old form pages to create a hybrid PDF in Acrobat (see "Refried PDF" section below).
see article
From PDF specification version 1.5 there are two new methods of compression, object streams and cross reference streams.
You mention that the Multivalent.jar compress tool compresses the cross reference table. This usually means the cross reference table is converted into a stream and then compressed.
The format of this cross reference stream is not fixed. You can change the bit size of the three "columns" of data. It's also possible to pre-process the stream data using a predictor function which will improve the compression level of the data. If you look inside the PDF with a text editor you might be able to find the /Predictor entry in the cross reference stream dictionary to check whether the tool you're using is taking advantage of this feature.
Using a predictor on the compression might be handy for images too.
The second type of compression offered is the use of object streams.
Often in a PDF you have many similar objects. These can now be combined into a single object and then compressed. The documentation for the Multivalent Compress tool mentions that object streams are used but doesn't have many details on the actual choice of which objects to group together. The compression will be better if you group similar objects together into an object stream.