PDF Optimization Acrobat vs. Ghostscript

PDF Optimization Acrobat vs. Ghostscript - pdf

I have a PDF file that I would like to optimize. I am receiving the file from an outside source so I don't have the means to recreate it from the beginning.
When I open the file in Acrobat and query the resources, it says that the fonts in the file take up 90%+ of the space. If I save the file as postscript and then save the postscript file to an optimized PDF, the file is significantly smaller (upwards of 80% smaller) and the fonts are still embedded.
I am trying to recreate these results with ghostscript. I have tried various permutations of options with pswrite and pdfwrite but what happens is when I do the initial conversion from PDF to Postscript, the text gets converted to an image. When I convert back to PDF the font references are gone so I end up with a PDF file that has 'imaged' text rather than actual fonts.
The file contains 22 embedded custom Type1 fonts which I have. I have added the fonts to the ghostscript search path and proved that ghostscript can find them with:
gs \
-I/home/nauc01
-sFONTPATH=/home/nauc01/fonts/Type1 \
-o 3783QP.pdf \
-sDEVICE=pdfwrite \
-g5950x8420 \
-c "200 700 moveto" \
-c "/3783QP findfont 60 scalefont setfont" \
-c "(TESTING !!!!!!) show showpage"
The resulting file has the font correctly embedded.
I have also tried using ghostscript to go from PDF to PDF like this:
gs \
-sDEVICE=pdfwrite \
-sNOPAUSE \
-I/home/nauc01 \
-dBATCH \
-dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/printer \
-CompressFonts=true \
-dSubsetFonts=true \
-sOutputFile=output.pdf \
input.pdf
but the output is usually larger than the input and I can't view the file in anything but ghostscript (adobe reader gives "Object label badly formatted").
I can't provide the original file because they contain confidential information but I will try to answer any questions that need to be answered regarding them.
Any ideas? Thanks in advance.

Don't use pswrite. As you've discovered this will render text. instead use the ps2write device which retains fonts and text.
You don't say which version of Ghostscript you are using but I would recommend you use a recent one.
One point; Ghostscript isn't 'optimising' the PDF the way Acrobat does, its re-creating it. The original PDF is fully interpreted to produce a sequence of operations that mark the page, pdfwrite (and ps2write) then make a new file which only has those operations inside.
If you choose to subset fonts, then only the required glyphs will be included. If the original PDF contains extraneous information (Adobe Illustrator, for example, usually embeds a complete copy of the .ai file) then this will be discarded. This may result in a smaller file, or it may not.
Note that pdfwrite does not support compressed xref and some other later features at present, so some files may well get bigger.
I would personally not go via ps2write, since this just adds another layer of prcoessing and discarding of information. I would just use pdfwrite to create a new PDF file. If you find files for which this does not work (using current code) then you should raise a bug report at http://bugs.ghostscript.com so that someone can address the problem.

You might want to try the Multivalent Compress tool. It has an (experimental) option to subset embedded fonts that might make your PDF much smaller. It also contains a lot of switches that allow for better compression, sometimes at the cost of quality (JPEG compression of bitmaps, for example).
Unfortunately, the most recent version of Multivalent does no longer include the tools. Google for Multivalent20060102.jar, that version still includes them. To run Compress:
java -classpath /path/to/Multivalent20060102.jar tool.pdf.Compress [options] <pdf file>

Related

Ghostscript pdf conversion makes ligatures unable to copy & paste

I have a pdf (created with latex with \usepackage[a-2b]{pdfx}) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:
gs -dPDFA-2 -dBATCH -dNOPAUSE -sPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite
-dPDFSETTINGS=/printer -sProcessColorModel=DeviceRGB
-sColorConversionStrategy=UseDeviceIndependentColor
-dColorImageDownsampleType=/Bicubic -dAutoRotatePages=/None
-dCompatibilityLevel=1.5 -dEmbedAllFonts=true -dFastWebView=true
-sOutputFile=main_new.pdf main.pdf
While this produces a nice, small pdf, now when I copy a word with "fi", I instead (often) get "ő".
Since the correct characters are somehow encoded in the original pdf, is there some parameter I can give ghostscript so that it simply preserves this information in the converted pdf?
I'm using ghostscript 9.27 on macOS 10.14.

Without seeing your original file, so that I can see the way the text is encoded, it's not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information'; for an explanation, see here.
If you original PDF file has a ToUnicode CMap then the pdfwrite device should use that to generate a new ToUnicode CMap in the output file, maintaining cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but it's just a guess without seeing an example.
My guess is that your original file doesn't have a ToUnicode CMap, which means that it's essentially only working by luck.

How can I disable ghostscipt rasterization of images and paths?

I need to convert a PDF to a different ICC color profile. Through different searches and tests, I found out a way to do that:
First I convert my PDF to a PS file with:
.\gswin64c.exe -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile="test.ps" "test.pdf"
Then I convert the PS back to a PDF with the following (this is to generate a valid PDF/X-3 file):
.\gswin64c.exe -dPDFX -dNOPAUSE -dBATCH -sDEVICE=pdfwrite
-sColorConversionStrategy=/UseDeviceIndependentColor -sProcessColorModel=DeviceCMYK
-dColorAccuracy=2 -dRenderIntent=0 -sOutputICCProfile="WebCoatedSWOP2006Grade5.icc"
-dDeviceGrayToK=true -sOutputFile="final.pdf" test_PDFX_def.ps test.ps
The ICC profile is embedded and all works perfectly. The only problem is that the whole final PDF is rasterized. Here I loose all the paths and other vectorial elements quality I have in the starting file. I need to keep them vectorial because this PDF will have a specific application.

First step don't convert to PostScript!!!
Any transparent marking operations will have to be rendered if you do that, because PostScript doesn't support transparency. Other features will be lost as well, so really, don't do that. The input and output ends of Ghostscript are more or less independent; the pdfwrite device doesn't know whether the input was PDF or PostScript, and doesn't care. So you don't need to convert a PDF file into PostScript before sending it as input.
You can feed the original PDF file into the second command line in place of the PostScript file.
As long as you are producing PDF/X-3 or later then the transparency will be preserved. Make sure you are using an up to date version of Ghostscript.

Replace all font glyphs in a PDF by converting them to outline shapes

I am looking for a way to 'outline' all text/fonts in a PDF file, i.e. convert them to curves.
I would prefer to do this without having to convert the PDF to PostScript and back. Also, I would like to use free lightweight cross-platform tools that can be automated from the command line, such as Ghostscript or MuPDF.

Yes, you can use Ghostscript to achieve what you want.
I. For Ghostscript versions up to 9.14
You need to go through 2 steps:
Convert the PDF to a PostScript file, but use the side effect of a relatively unknown parameter: it is called -dNOCACHE. This will convert all used fonts to outline shapes:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Convert the PS back to PDF (and, maybe delete the intermediate PS again):
gs -o somepdf-with-outlines.pdf -sDEVICE=pdfwrite somepdf.ps
rm somepdf.ps
This method is not reliable long-term, because the Ghostscript developers have stated that -dNOCACHE may not be present in future versions.
Note: the resulting PDF will very likely be larger than the original one. Plus, without additional command line parameters, all images in the original PDF will likely also be processed according to Ghostscript builtin defaults. This can lead to unwanted side-effects. Those side-effects can be avoided by adding more command line parameters to do otherwise.
II. Ghostscript versions 9.15 or newer
Ghostscript version 9.15 (released in September 2014) supports a new command line parameter:
-dNoOutputFonts
This will cause the output devices pdfwrite, ps2write and eps2write "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
This means: the two steps described for pre-9.15 GS versions can be avoided. The desired result can be achieved with a single command:
gs -o file-with-outlines.pdf -dNoOutputFonts -sDEVICE=pdfwrite file.pdf
Note: the same caveat is true as already noted in part I. If your PDF includes images, there may be unwanted side effects introduced by the simple command line above. To avoid these, you need to add more specific parameters.

This commit adds a new switch -dNoOutputFonts to the Ghostscript pdfwrite and ps2write devices which will produce a PDF file (or PostScript, depending on the selected device) where all the glyphs have been created as vectors, not as text.
You will need at least version 9.15 of Ghostscript to get this feature. Be aware that the PDF file will almost certainly be larger and copy/paste/search will (obviously) not work.

III. Ghostscript versions 9.54.0 (Windows 10)
I found a method that preserves all fonts flawlessly as vectors without any visual errors and with just two printing steps, after Ghostscript is first installed and configured correctly.
(Note! You must Add the Ghostscript bin-/ and lib-folder to your windows PATH in order to get Ghostscript to do anything)
Instructions here
Print your PDF-file that contains vector based fonts or other vector elements with Acrobat Reader and using Microsoft PS Class Driver to a YourFile.prn file. (To install this driver -- Control Panel - Devices - Printers & Scanners - Add a Printer or scanner -- and let first Windows to look for a while for a connected printer, and when it stops select an option -- The printer that I want is not listed - Add a local printer or network printer with manual settings - Next - Use an existing port: > File:(Print to File) - Next - Microsoft: Microsoft PS Class Driver - Next)
Open Command prompt, navigate to the folder where YourFile.prn file is located and type: "C:\Program Files\gs\gs9.54.0\bin\gswin64c.exe" -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=eps2write -sOutputFile=YourFile.eps YourFile.prn
If you have a constant need to do this you can also create prn2eps.bat file containing the following:
"C:\Program Files\gs\gs9.54.0\bin\gswin64c.exe" -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=eps2write -sOutputFile=%1.eps %1.prn
To use that bat file you just need to type: prn2eps YourFile.
(Note! you must have the bat file and Yourfile.prn in the same directory)
For some reason newest Ghostscript ps2epsi function didn't work in Windows 10, and Adobe made PDF:s had e.g. minor but consistent errors in some font characters when I imported them in non-Adobe design software as PDF:s. I have found out during the years that EPS-file format is one of the most reliable formats when vectors must be preserved from one software to another. Many times printing PDF again to PDF using just another printer driver may be enough or single file format change using Ghostscript, but not always.

Is there a way to ignore the watermark when using Ghostscript to convert PDF to TIFF

I'm using gs9.10 and have successfully converted my PDF to TIFF using this command line:
gswin64c -dNOPAUSE -q -r300x300 -sDEVICE=tifflzw \
-dBATCH -sCompression=lzw -dFirstPage=1 -dLastPage=5 \
-sOutputFile=TEST.TIFF \
TEST.PDF
However, I don't want the TIFF to have the watermark that is on every page of the PDF. Is there an option to ignore the watermark layer when writing out to a TIFF?

To be honest, this sounds suspiciously like trying to circumvent copyright. Obviously I can't tell since I haven't seen your original PDF file but watermarks are often applied to 'demo' or paid-for PDF files.
In any event, without seeing the file its impossible to say whether a watermark can be removed, because it depends on how the watermark has been applied, there are at least 3 different ways that I can think of off-hand and 2 of those I could eliminate the watermark later. There is unlikely to be a 'watermark layer' in the PDF file.
If you post a URL to the original PDF file I can look at it.

If it is just about text extraction, this command should do it:
pdftotext \
-layout \
input.pdf
output.txt
Now, if your "watermark" is no text, but some sort of image or vector graphic, it will not be part of your output.txt
Now, if your "watermark" indeed is also text, that watermark string will also appear in your output text for each page. It should be easy to remove that string from the text (and replace it by nothing if unwanted).
If your "watermark" text appears as gobble-di-gook in output.txt, then the font type ("Type 3"?) or font encoding ("custom"?) used for the watermark text does not allow for easy text extraction, or a valid "ToUnicode" map is missing for the font used.
If your main text did not extract successfully from the PDF, and if your main text is "gobble-di-gook", it very likely won't extract any better when removing the "watermark" from the original PDF file before applying pdftotext either...

Using ImageMagick or Ghostscript (or something) to scale PDF to fit page?

I need to shrink some large PDFs to print on an 8.5x11 inch (standard letter) page. Can ImageMagick/Ghostscript handle this sort of thing, or am I having so much trouble because I'm using the wrong tool for the job?
Just relying on the 'shrink to page' option in client-side print dialogs is not an option, as we'd like for this to be easy-to-use for the end users.

I would not use convert. It uses Ghostscript in the background, but is much slower. I'd use Ghostscript directly, since it gives me much more direct control (and also some control over settings which are much more difficult to achieve with convert). And for convert to work for PDF-to-PDF conversion you'll have Ghostscript installed anyway:
gs \
-o /path/to/resized.pdf \
-sDEVICE=pdfwrite \
-dPDFFitPage \
-r300x300 \
-g2550x3300 \
/path/to/original.pdf

The problem with using ImageMagick is that you are converting to a raster image format, increasing file size and decreasing quality for any vector elements on your pages.
Multivalent will retain the vector information of the PDF.
Try:
java -cp Multivalent.jar tool.pdf.Impose -dim 1x1 -paper "8.5x11in" myFile.pdf
to create an output file myFile-up.pdf

ImageMagick's mogrify/convert commands will indeed do the job. Stephen Page had just about the right idea, but you do need to set the dpi of the file as well, or you won't get the job done.
Assuming you have a file that's 300 dpi and already the same aspect ratio as 8.5 x 11 the command would be:
// 300dpi x 8.5 -2550, 300dpi x 11 -3300
convert original.pdf -density "300" -resize "2550x3300" resized.pdf
If the aspect ratio is different, then you need to do some slightly trickier cropping.

The Ghostscript approach worked well for me. (I moved my file from my Windows PC to a Linux computer and ran it there.) I made one small change to the Ghostscript command because the Ghostscript resize command above completely fills an 8.5 by 11 inch page. My printer cannot print to the edge, though, so several milllimeters along each page edge were lost. To overcome that problem, I scaled my PDF document to 0.92 of a full 8.5 by 11 inches. That way I saw everything centered on the page and had a slight margin. Because 0.92 * (2550x3300) = (2346x3036), I ran the following Ghostscript command:
gs -sDEVICE=pdfwrite \
-dPDFFitPage \
-r300x300 \
-g2346x3036 \
/home/user/path/original.pdf \
-o /home/user/path/resized.pdf

If you use Insert > Image... in LibreOffice Writer to insert a PDF, you can use direct manipulation or its Image Properties to resize and reposition the PDF, and when you File > Export as... PDF the PDF remains vectors and text. Interestingly when I did this with a PDF invoice the PDF exported from LO is smaller than the original, but the Linux pdfimages command-line utility suggests LO preserves any raster images within the original PDF.
However, you want something easier-to-use for your end users than the print dialog's "Shrink to page" option. There are tools like Adobe Acrobat that lay out PDFs to form print jobs that are PDFs; I don't know which ones have a simple "Change the bounding box and scale to letter-size". Surprisingly the do-it-all qpdf tool lacks this feature.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas