I am using (PDF)LaTeX to make a document, and I also need to embed already existing PDF documents in it. The problem is that I have PDF documents in several different page sizes (letter, a4, etc) and I want to compile all of them into a single b5 PDF document.
If I use the pdfpages package from CTAN, all hyperlinks from the original PDFs are removed. So I tried to do it with GhostScript.
This sounds like something normal to do but I have failed to find a working solution.
I have, in the meanwhile, read a few question and answers, but failed to figure out what I am doing wrong and what I am missing.
This doesn't seem to address my problem of scaling.
Neither does that.
This seems to go in the right direction but I couldn't make use of the information :-(.
To make the problem easier, let's just try to resize a single PDF so that:
its contents are scaled to fit the page
the new page has the size I want
Sounds easy, and it is easy to do, for example with pdfjam:
pdfjam --outfile b5-foo.pdf --paper b5paper foo.pdf
Now the problem with this is that pdfjam throws away hyperlinks. From its website:
A potential drawback of pdfjam and other scripts based upon it is that any hyperlinks in the source PDF are lost.
This must be because it seems to use pdfpages mentioned above.
Unlike pdfjam, GhostScript keeps hyperlinks. However, it either:
crops the original when I downscale; or
does not put the scaled content on a page of the size I need -- instead, I get a page that seems to be scaled down, while keeping the original aspect ratio.
This is what I have installed:
$ gs --version
9.21
(Installed on Linux)
This is how I can use GhostScript to crop the content:
gs -dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dFIXEDMEDIA -sPAPERSIZE=isob5 \
-o b5-foo.pdf foo.pdf
... and here is how I can use -dPDFFitPage to scale the content but also keep the aspect ratio of the original page size:
gs -dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dFIXEDMEDIA -sPAPERSIZE=isob5 -dPDFFitPage \
-o b5-foo.pdf foo.pdf
To be even clearer: I seem to get a page that is scaled so that it would fit inside the b5 I am asking for, but it is not b5: it still has the H/W ratio the original (letter) had!
I'd be happy if this can be done just using switches but if I need to use PostScript that's perfectly fine.
The solution seems to be to use -dPSFitPage instead of -dPDFFitPage. This might have something to do with the PDF files that I am trying to resize. Unfortunately, I cannot share those :-(. When I tried to reproduce this with files that I generated and the problem does not reproduce. I don't know why this is or how I should have known it.
To summarize, using PDF files for both input and output:
-dFitPage and -dPDFFitPage give me scaled pages with the original aspect ratio
-dPSFitPage gives me scaled content on the page size I request with -sPAPERSIZE="$PAPERSIZE"
This seems to go against what the documentation says.
Related
I've read a few threads on this general subject, but they don't seem to help so I'm asking as a new question.
I'm currently creating a process to use ghostscript from vb.net (I've had some issues with parameters with Ghostscipt.net, but external process works well enough for me)
I'm capturing a small square area from a specific page of a pdf and converting to a png. There's an offset involved as the area is not on the left/bottom edge, so I'm using 200x200 for the size and -220, 206 for x/y offset and I've specified page 1 for this example, as below:
" -dBATCH -dNOPAUSE -sDEVICE=pnggray -r600 -dDEVICEWIDTHPOINTS=200 -dDEVICEHEIGHTPOINTS=200 -dFIXEDMEDIA -dFirstPage=1 -dLastPage=1 -SOutputFile="mypath\output.png" -c "<</PageOffset [-220 206]>> setpagedevice" -f "mypath\input.pdf"
This works fine, but I next need to restrict the output png file to 600 x 600 pixels, regardless of input size.
All the solutions I've found seem to be related to changing device width/height points, but I use those to specifiy the area I want to capture, so I'm confused as to how this should work.
Can anyone help?
Sorry I couldn't tag this for Ghostscript - my reputation isn't high enough!
Thanks
Thanks K J for your comments, very helpful. I've been testing an alternative library for decoding the QR codes - ZXing.net - and this seems not to be so fussy about resolution and image size, so I think I've gotten around the problem that way.
I've tried it with both 300 and 600 resolution files of different sizes and it's worked well, so I'd recommend it to anyone with this sort of requirement!
I have an old Kindle Dx. Owing to disabilities, I can't use tablets or other touch devices, and I transfer pdfs to the Kindle to read them. It requires pre-processing.
What is a good option to pre-process pdfs without rasterizing them?
[When rasterizing is acceptable:
k2pdfopt -mode copy for maps or for small text. This rasterizes, enhances contrast, and makes everything 1.4-compatible.
k2pdfopt -mode copy -dev dx for other works. This rasterizes to 800x1080, downsamples as needed, enhances contrast while making everything grayscale, and makes everything 1.4-compatible.
When rasterizing text is not acceptable:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf if you want to preserve graphics. This makes minimal changes to make everything 1.4 compatible.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-g800x1080 -r150 -dPDFFitPage \
-dFastWebView -sColorConversionStrategy=RGB \
-dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=150 -dGrayImageResolution=150 -dMonoImageResolution=300 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf if you want moderate downsampling. This re-rasterizes existing raster images to fit 800x1080 and makes everything 1.4 compatible.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-g800x1080 -r150 -dPDFFitPage \
-dFastWebView -sColorConversionStrategy=Gray \
-dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=75 -dGrayImageResolution=75 -dMonoImageResolution=150 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf if you want more aggressive downsampling. This re-rasterizes raster images to fit 400x540, makes them grayscale, and makes everything 1.4 compatible. Low image quality, but usually still recognizable.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dFILTERIMAGE -dFILTERVECTOR -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf if you want to cut all graphics.
If using any of these options to pre-process for another device check its screen size in pixels. Don't worry too much about pixels per inch.]
[I.S. My goals are to fix pdfs so they 1. don't crash my Kindle, 2. don't freeze my Kindle or take too long to load each page, and 3. don't take up too much of the limited disk space on my Kindle. Preferably also 4. not rasterizing text, 5. not cutting out all images, which can sometimes lose tables, etc. and 6. not reflowing text, which will generally lose tabled. But I'm happy to downsample most images.]
[I.S. Note that I'm keeping copies of the originals. This is not a way to save disk space!]
For scanned pdfs, Willus's k2pdfopt is a great option. I've set up Mac Automator for
k2opt -mode copy -dev dx
or occasionally just -mode copy.
For pdf-born-pdfs, I'd rather not rasterize everything.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%stderr
-dNOPAUSE -dQUIET -dBATCH
can usually convert files, so the Kindle Dx can open them, but the Kindle will still slow, freeze, or crash with some pages.
One option is to combine Ghostscript and Mutool as follows:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%stderr -dNOPAUSE -dQUIET -dBATCH to pre-process pdfs to remove passwords,
mutool clean -g -g -d -s -l to sort out the junk, and then
gs
-sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%stderr -dNOPAUSE -dQUIET -dBATCH again to get a smaller and faster pdf.
Note: I think Mutool's 3rd -g is the equivalent of Ghostscript's -dDetectDuplicateImages. Since it slows rendering down it may be better to do the opposite. I'm not sure how to set it to false. -dDetectDuplicateImages false? -uDetectDuplicateImages?
Note: I'm using gtime to time pdf rendering.
A single-step tool in a single application would help. And an image-reduction too would also help. Ghostscript's documentation is hard to follow.
For cleanup, as an alternative to running mutool:
-dFastWebView might help.
-dNOGC indicates that Ghostscript does garbage collection by default.
For image reduction:
-dPDFSETTINGS=/screen seems to work better in 9.50 than 9.23. /ebook might be better since it embeds all fonts.
-dFILTERIMAGE -dFILTERVECTOR also work better in 9.50 than 9.23, but are more drastic than I'd like.
A lot of settings seem to rely in input resolution and/or input page size.
-r seems to rely on input page size, rather than output page size. The Kindle Dx is 800 pixels by 1180 pixels.
-dDownScaleFactor reduces relative to input resolution.
-g800x1080 seems to crop pages, not shrink them.
I think -sDEVICE=pdfimage8 rasterizes everything, like k2pdfopt.
In some cases
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dFastWebView
-uDetectDuplicateImages -dPDFSETTINGS=/ebook -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH yields larger and slower files than just -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH
... I'm not sure what to make of these results.
You've asked an awful lot in here, which makes it rather difficult to read and answer cogently. You haven't really made it clear exactly what it is you want to achieve (you also haven't said what version of GS and MuPDF you are using).
Here are some points;
You don't need to 'clean out the junk' from PDF files produced by Ghostscript, these rarely have anything which can be removed, that's one reason people run PDF files through GS+pdfwrite (despite my saying constantly its a bad idea).
Using the -g switch with Mutool twice doesn't (AFAIK) do anything extra, but adding -d decompresses the files. You can have Ghostscript produce uncompressed PDF files too, use -dCompressPages=false -dCompressFonts=false -dCompressStreams=false.
When you pass your PDF through pdfwrite, then MuPDF, then pdfwrite again, you are risking quality degradation at every step, and the intermediate MUPDF step is unlikely to achieve anything. Most likely what you are doing is reducing the compression (and quality) of any JPEG compressed images, I doubt much else of use is happening.
I can't think why you'd want to not detect duplicate images, it really just makes the file bigger but if you want to you use the switch the same way as all the other GS switches; -dDetectDuplicateImages=false. Note this won't change the processing speed (and generally pdfwrite doesn't do rendering, but perhaps you mean on the target device...), the detection is done by applying an MD5 filter to every image as it is read, then comparing the MD5 hashes. Switching that off doesn't stop the MD5 it just stops the comparison.
If you find Ghostscript's documentation hard to follow, then use the Adobe documentation for distillerparams, that's where the majority of the pdfwrite settings come from (ie blame Adobe for this ;-)
-dFastWebView is (IMO) totally pointless, its there purely for compatibility with Adobe, and because a lot of people won't accept that its useless and insist on it. All it does is speed up loading of the first page of a PDF file, by PDF consumers which support it (which is practically none). And to do this it makes the file slightly bigger and more complicated.
Do NOT use -dNOGC, I keep telling people not to do this, its a debugging tool, it has no practical value in production other than to potentially make Ghostscript use more memory. Everything else you hear about it is cargo cult.
-r has nothing to do with the media szie at all, and does (more or less) nothing with pdfwrite. It sets the resolution of a page when rendering. Since you don't want to render to an image, setting the resolution is not a useful thing to do.
No pdfwrite settings rely on the "input resolution" because PDF (and PostScript) files don't have a resolution, they are vector page descriptions.
-dDownscaleFactor is a switch which only applies to the downscaling devices; tiffscaled and friends, which are rendering devices, it has no effect at all on pdfwrite.
Setting a fixed media size (using -g) does indeed rely on the resolution (because its specified in device pixesl) and does indeed only alter the media size, not the content. If you want to rescale the content to fit the new media, then you need to use -dFitPage. I can't really see why you would do that. Note that it doesn't affect the content of a PDF file (unless its a rendered image), it just makes all the numberic values smaller.
The pdfimage devices do indeed produce a PDF file where the entire content is an image; hence the name....
Now, if you could define what you actually want to achieve, I could make some suggestions.....
[EDIT]
image downsampling
Firstly there are three controls which turn this feature on/off altogether;
-dDownsampleMonoImages, -dDownsampleGrayImages and -dDownsampleColorImages. Assuming you don't select a PDFSETTINGS (I would recommend you do not) these are all initially false. If you want to downsample any images you need to set the relevant mono/gray/color switch to true.
Once downsampling is enabled then you need to set the relevant ImageResolution and DownsamplingThreshold, there are again switches for each colour depth.
Now although PDF files don't have a resolution the images have an effective resolution, but its not easy to calculate (actually without a lot of effort its impossible). Its the number of image samples in the bitmap in each direction, divided by the area of the media covered by the image.
As an example if I have an image 100x100 samples, and that is placed on the page in a 1 inch square, then the resolution of the image is 100 dpi. If I then scale the image up so that it covers 2 inches square (but don't change the image data) then its 50 dpi.
So you need to decide what resolution looks OK on your device. You then set -dColorImageResolution=, -dMonoImageResolution, -dGrayImageResolution.
That's the 'target' resolution. But if the image is already close to that it can be wasteful to process it, so the Downsampling threshold is consulted. The actual resolution of the image in the input has to be the target resolution times the threshold, or more, to be reduced for output.
If we consider, for example, a target resolution of 300 and a threshold of 1.5 then the actual resolution of an image in the input file would have to exceed 450 dpi to be considered for downsampling.
Obviously you can set the threshold to 1.0 eg -dColorImageDownsampleThreshold=1.0
Finally there is the downsampling type, this is the filter used to create the lower resolution image from the higher. The simplest is /Subsample; basically throw away enough lines and columns until we reach the required resolution (this is only filter available for monochrome imsages, as all the others would change the colour depth). Then there's /Average which averages the value in each direction, effectively a bilinear filter. Finally there's /Bicubic which probably does the 'best' job but will be the slowest to process.
On top of all that you can choose the Image Filter (the compression filter) used to write the image data. We don't support JPXEncode in the AGPL version of Ghostscript and pdfwrite. That leaves you /CCITTFaxEncode (for monochrome) DCTEncode (JPEG) and FlateEncode (basically Zip compression). That's MonImageFilter, GrayImageFilter and ColorImageFilter.
If you want to use these you must first set AutoFilterGrayImages to false and/or AutoFilterColorImages to false, because if these are true the pdfwrite device will choose a compression method by looking to see which one compresses most. For Gray and Color images this will almost certainly be JPEG.
Final point is that linework (vector data) cannot be selectively rendered; either everything is rendered or everything is maintained 'as it was'. The only time (in general) that pdfwrite renders content is when transaprecny is present and the output CompatibilityLevel doesn't support transparency (1.3 or below). There are exceptions but they are quite uncommon.
You might want to consider setting the ColorConversionStrategy to either /DeviceRGB or /DeviceGray. I've no idea if you are using colour or grayscale devices, but if they are grayscale creating a gray PDF file would reduce the size and processing significantly. Creating an RGB file for colour devices probably makes sense too, in case the input is CMYK.
I have many (around 1000) multiple-page PDFs for a program I am writing.
The problem is that many of them are inconsistent about page size, even within the same document at times. Does anyone know of a way I could programmatically go through the files and resize the pages to what I want? This can be in any language.
I can accomplish this in Adobe Acrobat Pro, but there are so many that would end up taking a long, long time. The only way I can get it to resize there is to add a background from a file, and then choosing the file i want to resize.
Generally, PDFtk is a good fit for this kind of problems. It will let you pull everything apart, and reorder/resize/modify pages on the command line.
I had a similar problem and could easily solve it with PDF split and merge, a Java based toolkit for editing PDF files.
You can resize a PDF with a command line tool, Ghostscript.
Assuming you want to resize a PDF to 306x396 points (which would give you a quarter of a letter sized pages), do it like this:
gs \
-o 306x396-points.pdf \
-sDEVICE=pdfwrite \
-g3060x3960 \
-dPDFFitPage \
-dUseCropBox \
input.pdf
Note that the -g.... dimensions are in pixels. Because Ghostscript internally computes with 720 PPI by default, these are increased by a factor of 10 as compared to the sizes in points.
For Windows, use gswin32c.exe or gswin64c.exe instead of gs.
I have a library manual that the creator changed some of the LaTeX code and changed the page position and size, but didn't check it before compiling, distilling and sending it off. He is currently unavailable, so if I want to print it I have to fix it myself.
I was able to use some ghostscript commands to push the entire text down to something approaching centered on the page, the command is show below:
/usr/bin/gs -sDEVICE=pdfwrite -o /home/user/shiftdown.pdf -dPDFSETTINGS=/prepress -c "<</PageOffset [0 -35]>> setpagedevice" -f /home/user/brokendoc.pdf
The issue is that while the page is now printable without hitting hardware margins, the chapter titles are still halfway cut off at the top. If I open the PDF in Acrobat or Reader, I can select the chapter title and copy it and it pastes the full text in the program of my choosing. When I tried printing it on a Xerox MFP with a partially incompatible driver it printed the header, but it wouldn't duplex and I didn't want to print 700+ pages and then use the copy to 1 -> 2 function.
Does anyone know of a way to fix these cut off headers such that they either appear correctly in the PDF file or at least reliably print correctly? I have ghostscript easily available, TeX relatively easily available and the standard version of Acrobat X.
[update:]
After downloading the demo of Acrobat Pro XI, I was able to go to the "Print Production" tab and click on "Edit Object". When I clicked on the cut off chapter titles it showed me two bounding boxes that covered the entire page with one just a little taller than the other. When I right clicked on it I got the option to Add Clip and Delete Clip. When I click on Delete Clip it shows the entire chapter title. If I click on Add Clip it says, "One or more of the selected regions already have a clipping region. Proceed with setting the clipping regions for the selected objects? [No] [Yes]"
With that added information, I know there has to be a way to in a batch mode fix the issue, anyone know what command translates into this?
Without seeing the 'brokendoc.pdf' it's hard to know. If I see the file, I can tell you what's going on, and (probably) how to fix it or work around it.
I don't need the entire file, so just a shortened version that only has a few pages that shows the problem will suffice. You might be able to get this from the complete brokendoc.pdf using:
gs -sDEVICE=pdfwrite -o part.pdf -dLastPage=10 brokendoc.pdf
Also, you may want to try:
gs -sDEVICE=pdfwrite -o fitted.pdf -dPDFFitPage -sPAPERSIZE=letter -dFIXEDMEDIA brokendoc.pdf
The above will scale (and center) the page on to the specified page size. You can specify 'letter' or 'a4' or use -dMEDIAWIDTHPOINTS=_ -dMEDIAHEIGHTPOINTS=_ to get a specific output page size. The -dFIXEDMEDIA option causes gs to ignore the MediaBox in the file.
I've been using Ghostscript to convert my single figure plots rendered in PDF to PNG:
gswin32c -sDEVICE=png16m -r300x300 -sOutputFile=junk.png ^
-dBATCH -dNOPAUSE Figure_001-a.pdf
This works in the sense I get a PNG out and it contains the plot.
But it contains a huge amount of white space as well (an example source image: http://cdsweb.cern.ch/record/1258681/files/Figure_001-a.pdf).
If you view it in Acrobat you'll note there is no white space around the plot. If you use the above command line you'll find the plot is only about 1/3 of the space.
When doing the same thing with an EPS file I run into the same problem. However, there is the command-line parameter -dEPSCrop that one can pass to get the PS rendering engine to pay attention to the BoundingBox.
I need the similar argument for rendering PDFs. I was not able to find it in docs (nor even the -dEPSCrop, actually).
I had exactly the same issue. I fixed it by adding -dUseArtBox switch.
Example:
/usr/bin/gs -dUseArtBox -dNOPAUSE -sDEVICE=pngalpha -sOutputFile=output.png input.pdf
Note: -dUseArtBox switch is supported since ghostscript version 9.07
-dUseArtBox
Sets the page size to the ArtBox rather than the MediaBox. The art box defines the extent of the page's meaningful content (including potential white space) as intended by the page's creator. The art box is likely to be the smallest box. It can be useful when one wants to crop the page as much as possible without losing the content.
There are various options to control which "media size" Ghostscript renders a given input:
-dPDFFitPage
-dUseTrimBox
-dUseCropBox
With PDFFitPage Ghostscript will render to the current page device size (usually the default page size).
With UseTrimBox it will use the TrimBox (and it will at the same time set the PageSize to that value).
With UseCropBox it will use the CropBox (and it will at the same time set the PageSize to that value).
By default (give no parameter), Ghostscript will render using the MediaBox.
For your example, it looks like adding "-dUseCropBox" will do the job you're expecting.
Note, you can additionally control the overall size of your output by using "-sPAPERSIZE" (select amongst all pre-defined values Ghostscript knows) or (for more flexibility) use "-dDEVICEWIDTHPOINTS=NNN -dDEVICEHEIGHTPOINTS=NNN".
Have you tried using pdfcrop using pdftex (comes with texlive for example) or (not tried yet) the python script pdfcrop?
I have a similar workflow using the first tool mentioned.