Scaling PDF file using ghostscript - pdf

Our system takes 8.5 x 11 PDF files (only) and does things to them. Sometimes customers hand us files to manipulate into the right shape. We're working to automate scaling non-standard sized PDF files into 8.5 x 11.
We've been able to handle most files we've tested with ghostscript, but we have this one customer submitted file that we are unable to handle. (And unfortunately we can't recreate the condition and, of course, can't post the customer's data.)
The file is PDF v1.7 and contains seven 8.5x11 pages followed by four pages that are 25.5 x 45.33 inches. I don't know how they were generated (Adobe Acrobat 10.1.2 per pdfinfo).
We have gradually added a series of parameters to our gs command until we arrived at this:
gs -sDEVICE=pdfwrite -sOutputFile=$final_file -dBATCH -dNOPAUSE -sPAPERSIZE=letter -q -r720 -g6120x7920 -dPDFFitPage -dFIXEDMEDIA $files_to_convert
This seems to work fine for our other files, but for this ONE file, the 25.5 x 45.33 pages are not scaled to letter size. Here are the measurements for the output file's pages 7 and 8's per pdfinfo:
Page 7 size: 612 x 792 pts (letter)
Page 7 rot: 0
Page 8 size: 1836 x 3264 pts
Page 8 rot: 0
I've read that PostScript has Policies, PageSize options, but I'm not aware of such a thing with PDF. And if it exists, I don't know how to alter it using ghostscript.
How can I make sure all pages are scaled to letter?

Well, Ghostscript uses PostScript as its scripting language, so anything you can do in PostScript you can do to a PDF file.
I really wouldn't use -g with pdfwrite, because -g specifies pixels, and since pdfwrite is a vector device that doesn't really work well. Use DEVICEHEIGHTPOINTS and DEVICEWIDTHPOINTS instead.
Don't set -sPAPERSIZE either, you can't set the media to be letter in one place and something different (the -g switch) elsewhere.
Its not really possible to tell you what's going on exactly with your PDF file without seeing it, and you haven't really explained what's wrong. You imply that the pages are not being scaled, but you don't say what size they are being drawn at. You also don't say why you think the pages are 'legal' size when viewed in Acrobat.
If you are saying that the pages in question are 'legal' but the media is much larger, then that is entirely possible and would suggest that the pages have a CropBox. Ghostscript uses the MediaBox for page sizes, Acrobat uses a plethora of different boxes, but usually defaults to the CropBox.
If you want Ghostscript to use the CropBox then just tell it -dUseCropBox.
Alternatively post an example somewhere and I can look at it.

Related

Ghostscript add white background image

I have a script which automatically adds a gutter to a PDF file. It adds gutter to left for ODD numbered pages and gutter to the right for EVEN numbered pages. It does this by moving the existing image over.
Here is the code for that:
'gs -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -o output.pdf \
-dDEVICEWIDTHPOINTS=513 \
-dDEVICEHEIGHTPOINTS=738 -dFIXEDMEDIA -c \
"<< /CurrPageNum 1 def /Install { /CurrPageNum CurrPageNum 1 add def CurrPageNum 2 mod 1 eq \
{-4.5 0 translate} {4.5 0 translate} \
ifelse } bind >> setpagedevice" -f input_file.pdf
I've found that when I send this PDF file to the printer, the additional space is not "counting" so the file is now narrower now. I think this is because transparency doesn't count on the PDF, and so when sent to the printer the pages are seen as narrower.
Is it possible to add a white background to the pdf so it ISN'T seen as transparent? Or is there an alternative way to fix this?
I'm afraid your assumption is flawed, your 'translate' has no transparency involvement at all, its shifting the content on the media (NB this is not an image, ie a bitmap, in general. Its more complex content). All the content is shifted, no matter whether it is transparent or not.
I'm afraid I can't follow what you mean about the printed page being 'narrower'. The Media request will be for a page 513x738 points, which is a really weird size; 7.125 by 10.25 inches. Unles that matches the page size of your printer, then its going to do 'something' with the result. Probably it will center it if the media is larger than the request, but if the media is smaller than requested, then it will either scale it down or crop it. Either will result in something different to what you expect.
Is there a reason you are changing the media size of the original PDF file ?
If the media request does match the printer then its still possible that there will be cropping or scaling going on, because the printable area may not be the same as hte size of the media. The paper handling of some printers means that they cannot print all the way to the edge of the media. In that case the printer may scale or crop the output again.
You can easily elimiate transparency as being the culprit by simply starting with a test file which does not contain any transparency. If you aren't certain then one solution owuld be to use a recent version of Ghostscript and use the pdfimage32 device. That will create a PDF file from the original PDF, but the output file will only contain a bitmap image, no transparency at all.
To help us consider the problem, it would be helpful to see the original PDF file, the PDF file you send to the printer, and a scan or photograph of the final printed page. It would also be useful to know the version of Ghostscript you are using, the make and model of the printer, and how you are sending the PDF file to the printer.

ghostscript shrinking pdf doesn't work anymore

first question here.
So i was using the ghostscript command to shrink my pdf which yieled good results (around 30-40% decrease in size). However, one day last week it stopped shrinking them and instead returned me a pdf of the size or even a bit heavier (around 1% or less). Therefore I don't know what's going on since the command used to work fine and i was able to shrink some pdf easily...
I will note that when using gs on my pdfs it always return an error about some glyphs missing in the GlyphLessFont but i don't think it's related to my issue (though if you could redirect me to fixing the glyphlessfont that would be much appreciated).
Here's the command I use :
`gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=out.pdf`
Here's also a pdf sample that was shrinked correctly (original file size 4.7mo / shrinked version 2.9mo) https://nofile.io/f/39Skta4n25R/bulletin1_ocr.pdf
EDIT: light version that worked for the file above : https://nofile.io/f/QOKfG34d5Cg/bulletin1_light.pdf
Here's the input and output file of another pdf that didn't work
(input) https://nofile.io/f/sXsU0Mcv35A/bulletin15_ocr.pdf
(output through the gs command above) https://nofile.io/f/STdJYqqt6Fq/out.pdf
you'll notice that both input and output file are 27.6mo whereas the first file was reduced.
I would also add that i've performed OCR on these pdf using pdfocr and the tesseract engine and that's why i didn't try to convert to png to reduce the size, i need the extra OCR layer so that we can publish those file for our website and we want them to be lighter if possible.
Final info : ghostscript -v is 9.10 (2013-08-30) and tesseract is 3.03 with leptonica-1.70 and pdfocr is 0.1.4
Hope you guys can help !
EDIT2: while waiting for the answer I continued my scanning and ocring of the documents and it appears that after passing my pdf through pdfocr it was shrinked like it used to with the ghostscript. Therefore i wonder if the script pdfocr does the shrinking with ghostscript since i know it invokes it for other tasks during the process of OCRisation.
The PDF has a media size of 35.44 by 50.11 inches, is that really the size of the original ?
Given that you appear to commonly use OCR I assume that, in general, your PDF files simply consist of very large images. In that case the major impact on the file size is going to come from downsampling the images. If you look at the documentation you can see that the /screen settings downsample images to 72 dpi, with a threshold of 1.5 (so images over 72 * 1.5 = 107 dpi will be downsampled to 72, anything less is regarded as not worth it)
Your PDF file has a media size of 35.44 x 50.11 inches. Its rather a large file (26 pages) so I'll limit myself to considering page 1. On this page there is one image, and a bunch of invisible text, placed there by Tesseract. The image on page 1 is a 8-bit RGB image with dimensions 2481x3508, and it covers the entire page.
So the resolution of that image is 2481 / 35.44 by 3500 / 50.11 = 70.00 x 69.84
Since that is less than 72 dpi, pdfwrite isn't going to downsample it.
Had your media been 8.5 x 11 inches then the image would have had an effective resolution of 2481 / 8.5 by 2500 / 11 = 291.8 x 318.18 and so would have been downsampled by a factor of about 4.
However..... for me your 'working' PDF file also has a large media size, and the images are also already below the downsampling resolution. When I run that file using your command line, the output file is essentially the same size as the input file.
So I'm at a loss to see how you could ever have experienced the reduced file size. Perhaps you could post the reduced file as well.
EDIT
So, the reason that your files are smaller after passing through Ghostscript is because the vast majority of the content is the scanned pages. These are stored in the PDF file as DCT encoded images (JPEG).
The resolution of the images is low enough (see above) that they are not downsampled. However, the way that old versions of Ghostscript work is that image data is always decompressed on reading, and then recompressed when writing.
Because JPEG is a lossy image format, this means that the decompressed and recompressed image is of lower quality than the original, and the way that loss of quality is applied means that the data compresses better.
So a quirk of the way that Ghostscript works results in you losing quality, but getting smaller files. Note that for current versions of Ghostscript, the JPEG data is passed through unchanged, unless your configuration requires it to be donwsampled, or colour converted.
So why doesn't it compress the other file ? Well for current code, of course, which is what I'm using, it won't, because the image doesn't need downsampling or anything.
Now, when I run it through an old version of Ghostscript which I have here (9.10, chosen because that's what your working reduced file is using) then I do indeed see the file size reduced. It goes down from 26MB to 15MB.
When I look at your 'not working' reduced file, I see that it has been produced by Ghostscript 9.23, not Ghostscript 9.10.
So the reason you see a difference in behaviour is because you have upgraded to a newer version of Ghostscript which does a better job of preserving the image data unchanged.
If you really want to reduce the quality of the images you can set -dPassThroughJPEGImages=false but IMO you'd do better to either get the media size of the original PDF coreect (surely the pages are not really 35x50 inches ?) or set the ColorImageResolution to a lower value.

How to stop GhostScript replacing Times-Roman with NimbusRomNo9L-Regu (on Windows)

I am using Ghostscript v9.07 on Windows 7 (x64) to convert PostScript files into PDFs. Inside the PostScript files, the fonts Times-Roman and Times-Bold are used - this can be seen in the PostScript plain-text:
/Times-Roman findfont 24.00 scalefont setfont
.
.
.
/Times-Bold findfont 24.00 scalefont setfont
.
.
.
(This is extremely common in PostScript output from TeX.)
When I convert the PostScript to PDF using Ghostscript, Ghostscript uses the Nimbus-Roman font-family.
GPL Ghostscript 9.07 (2013-02-14)
Copyright (C) 2012 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Loading NimbusRomNo9L-Regu font from %rom%Resource/Font/NimbusRomNo9L-Regu... 39
38016 2393624 2695216 1387174 4 done.
Loading NimbusRomNo9L-Medi font from %rom%Resource/Font/NimbusRomNo9L-Medi... 39
38016 2453738 2695216 1390537 4 done.
This output suggests that it is loading the font-family from resources embedded in the executable. In any case, it isn't loading it from the PostScript file so it is either from the application's distribution or the operating system.
The result looks abysmal in every PDF viewer I have tried, regardless of font-smoothing settings.
I cannot understand it. The resulting PDF's text is not rasterised - you can select it and cut and paste it and the file-size alone shows that it is still real text - so I can only conclude that this is a quirk of the Nimbus-Roman typeface.
There is absolutely no reason to use that typeface but, no matter what I do, I cannot seem to get Ghostscript to use a Times-Roman (or equivalent) typeface from the Windows font set.
I tried the -sFONTPATH switch (alternatively called sFONTDIR on some sites - I tried both) to no avail and I went looking for the fontmap file to try hacking that but didn't find one for Windows - only fontmaps for other operating systems in the install directory.
How can I change this behaviour and stop Ghostscript from using the awful font?
The way to have Ghostscript use a specific font, rather than a substitute, is to embed the font in the input.
This:
/Times-Roman findfont 24.00 scalefont setfont
merely requests a font (and size), it does not include the font data. In the absence of the font, Ghostscript is forced to use a substitute font.
On Windows, Ghostscript uses the ROM file system, and does not ship with the support files. If you want to use those you can pull them from Git, and then use the -I switch. That will allow you to alter cidfmap which controls the CIDFonts known to Ghostscript, and fontmap.GS which controls the regular fonts known.
However, I don't see the result you do with the apparently bitmapped font. Perhaps you should try a newer version of Ghostscript. Or post an original PostScript program and converted PDF file so it would be possible to examine them.
[edit starts here]
OK so basically there's some problems with what you've put up above. The screenshot you have there is from page 1, but the 'font substitution', ie the loading of NimbusRoman-Regular and NimbusRoman-Bold (on current versions of Ghostscript) take place on page 5. Clearly the 'substitution' on page 5 can't affect what's on page 1. You would have seen that if you hadn't set -dBATCH, moral; when debugging a problem, keep the command line simple.
The actual 'problem' is that the PostScript program uses a type 3 bitmap font for all the text, the font being defined in the body of the program.
In short, the text is a crappy bitmap font because that's the way the PostScript has been created, it contains a crappy bitmap font and insists on using it (FWIW it appears to have been rendered at 300 dpi).
In fact the only text in the document which is scalable is the very text which uses Times-Roman (not a substitution, it uses the font from the interpreter instead of the bitmap font).
This is the text for the figure at the top of the page; 'point', 'p', 'X', 'line', 'Y', 'u', 'W'. This has been included as an EPS (originally called ray_space.eps) into the document, which is why it uses real scalable fonts. If you zoom in on the top of page 5 you will see that the scalable fonts continue to look smooth (the body of the figure) while the bitmapped font (the figure title for example) look bitmapped.
There's nothing you can do about that, essentially, other than recreate the PostScript from the original DVI document. This is not a bug (or at least, not a Ghostscript bug).
[and another edit...]
I'm afraid its not really simple.
There are 26 fonts named /Fa to /Fx, I am unsure why this is the case, since each font is capable of containing 255 glyphs yet most do not (at the same time, they don't just contain two glyphs either). There seems no good reason for the fonts being encoded as they are.
The program uses each font as it requires it, switching fonts when it needs different glyphs. Eg:
1 0 bop
392 319 a Fx (An)21 b(In)n(tro)r(duction)g(to)h(Pro)t(jectiv)n
(e)h(Geometry)670 410 y(\(for)f(computer)f(vision\))
815
561 y Fw(Stan)c(Birc)o(h\014eld)74 754 y Fv(1)67 b(In)n(tro)r(duction)
74 856 y Fu(W)l(e)
The 'Fx' and 'Fw' calls load the font dictionary which is then used by the following code. Frankly the Tex PostScript is difficult to figure out, one might almost think it was designed to be obfuscated.
Identifying which 2-byte pair beginning with 'F' require removing or replacing, and which do not (eg ASCII encoded image data) would be tricky. In addition the glyphs are rendered flipped in the y axis (to match the transformation matrix in use) so you would have to adjust the FontMatrix similarly for a substitute font.
If the fonts in these TeX font dictionaries always contained the same glyphs, it would be possible to define a set of substitute fonts (/Fa to /Fx) which could be used instead. These would need to be loaded into the TeXDict dictionary to replace the ones defined in the header of the progtam of course.
I'm not saying its impossible, but it would be a good amount of research effort to figure out what would be possible, and then a good amount of coding to implement it.

Using Ghostscript to change page size from A4 and other to Letter

I recently asked this question about changing paper size and have a command that scales properly most of time:
gs -sDEVICE=pdfwrite -sOutputFile= $outFile -dBATCH -dNOPAUSE -q -dDEVICEHEIGHTPOINTS=792 -dDEVICEWIDTHPOINTS=612 -dPDFFitPage -dFIXEDMEDIA $inFile
We need to have it generate output pages that are US Letter. I have two problem files I can post. One needs extra whitespace; the other just needs a very small rescaling but it gets a wildly different output.
The first file is an A4 file. The command scales it to a height of 792 and the width is scaled to 559.667. It's scaled accurately, but we need whitespace either on both sides or on the right. How can I modify my command (or run a second command) to do this?
The second file is 8.52" x 11.02". For this one I get an output file that is 549.127 x 709.8 pts and I don't get it at all. 0.02" is within our printing tolerances so we can let it go, but a) I'd rather have just one process b) maybe the issue isn't the small scaling adjustments and maybe it will be a problem for other files.
These, along with your previous question, are all related to the myriad different Boxes available in a PDF file, and the various ways that a PDF processor will deal with the available Boxes.
For your first file; the output of pdfwrite has a MediaBox of 612x792 but a CropBox of [26.1662903 0 585.83374 792.0] which is (of course) not Letter. Its the result of scaling A4 down to Letter and centring that scaled down area on the Letter media.
If you remove the CropBox (using a binary editor) and open the file in Acrobat you will see that the white space is evenly distributed left and right of the page.
So really its up to your printing process. Either you need to tell that process to use the MediaBox and ignore the CropBox, or have it centre the CropBox on the media when the CropBox is not the same as the media.
Your second file has a MediaBox of 684x864 which is 9.5x12 inches. However it has a TrimBox which is [33.8027 33.7217 647.533 827.028] doing the arithmetic that works out is 8.524x11.018 inches.
Clearly Acrobat (or whatever you are using to get the size) is using the TrimBox, not the MediaBox. Ghostscript uses the MediaBox by default, if you want it to use a different *Box then you haver to tell it so. Try adding -dUseTrimBox to your command line. See my answer to your previous question :-)

Convert PDF to PS in Ghostscript and preserve CMYK split

I have an RGB PDF which I've preflighted in Adobe Acrobat pro to a PDF x1a compliant PDF in US Web Coated SWOP v2.
The PDF now has 4 plates (C/M/Y/K)
C plate is empty
M plate has 100% of a red image
Y plate has 100% of the same red image
K plate has 100% of black text on page (text is not on any other plate)
I'm now trying to convert that PDF into a PS using ghostscript
I've tried:
gs -dNOPAUSE -dBATCH -sDEVICE=ps2write -sProcessColorModel=DeviceCMYK -sOutputFile=output.ps input.pdf
But then when I distill this PS back to a PDF the text is on all the plates and not just the K plate.
I've used this online tool:
http://pdf.my-addr.com/free-online-pdf-to-ps-convert.php
To also do the conversion and the distilled version of the PS generated by that preserves the plate breakdown. They are also using Ghostscript to create the PS.
So I'm assuming there is some setting I am missing.
Does anyone know?
Update 1
Trying in pdftops too and again it is taking my K plate and spreading it across all CMYK plates.
What secret magic are they doing on that web site to preserve plates?!
Update 2
Only main difference I can see is I'm using
%%Creator: GPL Ghostscript 905 (pswrite)
and that website is using
%%Creator: GPL Ghostscript 871 (pswrite)
Could it be a version thing, or are they doing something I'm not?
Ghostscript 9 and above use much better colour management than previous versions, but you have to get the ICC profiles correct. I'd guess you are using the default profiles and I think the first thing I'd suggest is that you use the current version of Ghostscript which is 9.07, I think there were a few changes made to the default profiles.
Its also possible that the PDF file now has an input RGB profile associated with it, which Ghostscript is now using whereas previously it didn't. I'd need to see the file to be able to tell better what is going on, but I have a sneaky suspicion that your 'pre-flight' conversion is causing the problem. What happens if you use the original PDF file ?
I very much doubt if the PDF file actually contains CMYK colour components, I would imagine all that has happened is that different profiles have been inserted into the file that control the conversion from RGB to CIE and from CIE to CMYK.
In passing, don't use pswrite. Its a terrible low-level output that converts much of the content into images. It produces large PostScript that processes very slowly and doesn't scale well (ie if the printer is a different resolution). Use the ps2write device instead.
By the way, since you've already used Acrobat, why don't you just use 'Save As' PostScript from there ?