Lossless rotation of PDF files with imagemagick - pdf

I want to rotate a 351K PDF named 08-file.pdf using CLI tools. I've tried imagemagick:
convert 08-file.pdf -rotate 90 08-file-rotated.pdf
But the original quality:
Suffered serious degradation:
I've tried adding the -density 300x300 argument, but the outcome was a 2.5M file, nearly one order of magnitude larger than the original, which is a huge waste.
Any idea how to losslessly rotate a PDF file using imagemagick?

I always had bad results in converting/altering pdf file with imagemagik/convert (bad resolution, or huge file). Playing with options -compress -density -quality was always frustrating and a waste of time (but i am no expert).
Proposal 1: pdftk
So I would recommend pdftk (you may need to install it via apt-get install)
Try :
pdftk 08-file.pdf cat 1-endright output 08-file-rotated.pdf
For old version of pdftk (v<3) rotation was indicated only by one letter:
N: 0, E: 90, S: 180, W: 270, L: -90, R: +90, D: +180. The same command was:
pdftk 08-file.pdf cat 1-endR output 08-file-rotated.pdf
From another post on this site, I have a brief an explanation of the syntax
pdftk input.pdf cat 1-endsouth output output.pdf
# \_______/ \___/\___/ \________/
# input file range | output file
# direction
You can see also https://linux.die.net/man/1/pdftk
Edit 2020:
Proposal 2: qpdf
I have found another alternative which is equivalent: qpdf, easier to remember and more powerful
see QPDF manual
#Syntax (you can rotate only some pages of the document -- see the manual --
qpdf --rotate=[+|-]angle[:page-range]
# Example
qpdf in.pdf out.pdf --rotate=+180
Other options worth to consider
pdfjam
PDF manipulation tools (CLI)
To be considered if pdftk is not available on your system.
pdfjam looks quite similar to pdftk
pdfsam
This is a toolbox to modify pdf files with a GUI (graphical user interface).
Code is open source and multiplateform.

Please use -compress lossless option:
convert -rotate 90 -compress lossless 08-file.pdf 08-file-rotated.pdf
From the documentation:
https://www.imagemagick.org/script/command-line-options.php#compress
Lossless refers to lossless JPEG, which is only available if the JPEG
library has been patched to support it.
Another option is to use the following command:
jhead -cmd "jpegtran -progressive -perfect -rotate 270 &i > &o"
Image-0001.jpeg
It will write output to a temporary file and when it succeeds it will overwrite the original file:
Cmd:jpegtran -progressive -perfect -rotate 270 "Image-0001.jpeg" >
"h1xQ6q"
Modified: Image-0001.jpeg

Related

Batch PDF Watermark [PDF -> JPG -> PDF]

I am working with over thousands PDF files for a Sheet Music publisher.
All of these PDF files needs a preview PDF. A watermark for PDF files can easily be removed so I am asking for a true way to watermark our PDF:s in a batch operation.
PDF->Apply Watermark->JPG->Back to PDF
How can I do this? Is there a good tool for this operations?
The free route
ImageMagick can do the complete process for you, especially with the composite command's -watermark operator.
#!/bin/sh
# ImageMagick will pick the correct conversion formats based on filename suffixes, or maybe actual binary content?
InputPDF=$1
WatermarkImg=$2
OutputPDF=$3
pdfToImage=pdfToImage.png
imageWithWatermark=imageWithWatermark.png
# Convert PDF to image
convert \
-density 300 \
-trim \
"$InputPDF" \
-quality 100 \
-flatten \
-sharpen 0x1.0 \
$pdfToImage
# Add watermark to intermediate image
composite \
-dissolve 15 \
-tile \
"$WatermarkImg" \
$pdfToImage \
$imageWithWatermark
# Convert intermediate image back to PDF
convert \
$imageWithWatermark \
"$OutputPDF"
# Clean up
rm $pdfToImage $imageWithWatermark
I find the PDF to image conversion acceptable in terms of quality, though you can see some differences when looking at the before and after side-by-side, especially in how bolded glyphs seem less bold:
You can check this good post and its answers for a number of options for converting a PDF to an image, Convert PDF to image with high resolution.
I checked out PDFtoPPM, which was also highly mentioned in that thread, and I still see some degrading of the bolded fonts when converted:
Some more tiling Magick
I used this copyright symbol from Wikimedia Commons and this ImageMagick script:
#!/bin/sh
Infile="Copyright.png"
Outfile="Copyright_tiled.png"
h2=$(convert $Infile -format "%[fx:round(h/2)]" info:)
convert $Infile \
\( -clone 0 -roll +0+"$h2" \) \
+append \
-write mpr:sometile \
+delete \
-size 1224X1584 \
tile:mpr:sometile \
$Outfile
to create this staggered tiling (1224X1584 is the page size (8.5in x 11in) multiplied by 72 px/in, times 2 for a good density of tiles):
And here it is unwatermarked again
#ZachYoung I used some different image magic, also scriptable, the point is:-
Although "What's done cannot be undone" Macbeth (Act 5.1. 63-4) is very true especially within a PDF or image. We also know and expect that it too applies to any PDF (de)constructs. Thus depending on value of a forgery it will always be worth engineering a partially reversed copy, fit for scrutiny or use, but will like the watermarked copy, still not be the original, however all the same, may look almost just as good.
The Idiom implies don't bother yourself about it. Its best not done in the first place.
The nearest to best, is use a watermark exactly the same as the text outlines, like this:-

PDF to EPS or PS to EPS conversion maintaining page size

I need to convert a PDF or Postscript file to EPS, I tried using Ghostscript with the following command to convert Postscript to EPS:
gswin32.exe -o output.eps -sDEVICE=eps2write -dFitPage input.ps
Or PDF to EPS:
gswin32c.exe -q -dNOCACHE -dNOPAUSE -dBATCH -dSAFER -sDEVICE=eps2write -o output.eps -dFitPage input.pdf
They both complete successfully but they are not maintaining the page size. The input PDF or PS files are the same drawings and they both a page size of 300x300pts. You can download these files here and here. They look like this:
But after converting them to EPS the results are these, PS to EPS and PDF to EPS. They look like this, the first one is the result from PS to EPS and the second one is the result from PDF to EPS (they are opened using EPS Viewer that rasterizes the image that's the reason for the low quality):
As you can see, none of them have the original 300x300 pts size, I've tried many Ghostscript options but I can't manage to get an EPS with the right Bounding Box. I just need to convert a PDF OR PS to EPS, whatever is easier or gives better results.
What you are asking for is, more or less, the exact opposite of what is normally required.
In general people want the EPS Bounding Box to be as tight as possible to the actual marks made by the EPS, because the normal use for an EPS file is to 'embed' it in another document. If you want extra white space you would normally add it around the EPS when you embed it.
Indeed, the EPS specification says that the BoundingBox comment should not include the white space. On page 8 of the EPSF specification:
"For an EPS file, the bounding box is the smallest rectangle that encloses all the marks painted on the single page of the EPS file"
Messing with Ghostscript switches isn't going to do anything helpful for you here, the device explicitly records the marks that are made by the input, and sets the BoundiongBox from those.
Perhaps if you were to explain why you want to have an EPS file with incorrect BoundingBox comments it would be possible to make some suggestions, but Ghostscript is doing exactly what it should do here.
[addendum]
(see comment below, this is in reply)
I suspect you need to change your process in some way then. One solution is to have the PDF start by filling the entire page with white. Contrary to many people's expectations that counts as making a mark on the page so the entire page would then be considered as the BoundingBox.
As long as you are using the Ghostscript eps2write device you could also parse the document for %%BeginPageSetup, the eps2write device still writes the original document size out in this section, Eg:
%!PS-Adobe-3.0 EPSF-3.0
%%Invocation: path/gswin32c -dDisplayFormat=198788 -dDisplayResolution=96 -sDEVICE=eps2write -sOutputFile=? ?
%%BoundingBox: 101 132 191 256
%%HiResBoundingBox: 101.80 132.80 190.30 255.20
%%Creator: GPL Ghostscript GIT PRERELEASE 951 (eps2write)
....
....
%%EndProlog
%%Page: 1 1
%%BeginPageSetup
4 0 obj
<</Type/Page/MediaBox [0 0 300 300]
/Parent 3 0 R
/Resources<</ProcSet[/PDF]
>>
/Contents 5 0 R
>>
endobj
%%EndPageSetup
You can see here that the original media size was 300x300, even though the BoundingBox correctly reflects the marks made on the page. Note! This is characteristic of EPS files produced by the current version of eps2write, it won't work for EPS files from other sources and may not work with eps2write in the future.
Other than that you're stuck with finding the media size from the input and passing it separately to the program doing the insertion, presumably by putting the data in some other text file to accompany the EPS. Or, of course, manually or programmatically editing the urx,ury co-ordinates of the BoundingBox.
Ghostscript isn't going to do this for you I'm afraid.

Ghostscript converting PostScript to PDF seems to ignore the page size / BoundingBox

I have created a PostScript file from a TIFF image using ImageMagick.
The command-line I am using is:
convert input.tif[0] -density 600 -alpha Off -size 5809x9408 -depth 16 intermediate.ps
This takes my input tiff image (just the main image, and not the thumbnail via using [0]) and creates a .ps file from the bitmap.
When I look at the header of my PostScript file, I can see that it has the correct page size:
%!PS-Adobe-3.0
%%Creator: (ImageMagick)
%%Title: (intermediate.ps)
%%CreationDate: (2017-05-22T08:43:44+10:00)
%%BoundingBox: -0 -0 697 1129
%%HiResBoundingBox: 0 0 697.08 1129
%%DocumentData: Clean7Bit
%%LanguageLevel: 1
%%Orientation: Portrait
%%PageOrder: Ascend
%%Pages: 1
%%EndComments
Yet, when I use GhostScript to convert this to a PDF, unless I go to a lot of trouble to specify otherwise, gs is cropping it and putting it on a US Letter sized page.
gs -dPDFA=1 -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sDefaultRGBProfile=AdobeRGB1998.icc -dOverrideICC -sOutputFile=output.pdf -r600 -P PDFA_def.ps intermediate.ps
When I open the resulting PDF, the crop box is 612 x 792 pt wich is US Letter. It should be 697 x 1129 pt, the size of the Bounding Box in the PostScript file.
I have created a custom .joboptions file using Acrobat Distiller that sets image compression and the like, and in this file if I specify the page size at the end, then the resulting PDF comes out the correct size:
<<
/HWResolution [600 600]
/PageSize [697.080 1128.960]
>> setpagedevice
Now this isn't a huge issue for a one-off conversion, but I have to convert a large number of images and I don't want to set the page size manually for every single file.
The lines you quote above are comments and, from the comments present, suggest that this is an EPS file, not a PostScript program.
The main difference is that EPS is 'encapsulated' which means its intended to be placed verbatim inside a PostScript program. The enclosing program contains the intelligence regarding the media size, and arranges to set the context such that the EPS is scaled, rotated, translated so that it fits appropriately on the media.
In order to do this successfully, the EPS file must follow certain rules; in particular it must not set any media size itself (because that would mess with the enclosing program).
So it seems likely to me that what you have is an EPS file which does not request any media size at all. So its hardly surprising that you have to tell Ghostscript what you want to do with it.
Now in order for the enclosing program to place the EPS it needs to know its characteristics, the size and shape of the content. That's what the comments are for. Ordinarily an EPS file is read by an application (eg MS Word, LibreOffice etc) which parses out those comments and uses the information when generating the final PostScript program. The reason an EPS uses comments to store this information is precisely so that it has no effect on the actual content of the EPS and so the entire EPS can be included without further processing by the application.
The short answer is that if you read the Ghostscript documentation here you will find descriptions of the EPSCrop and EPSFitPage command line switches which will do all the work for you.

Add white border to PDF (change paper format)

I have to change a given PDF from A4 (210mm*297mm) to 216mm*303mm.
The additional 6 mm for each dimension should be set as white border of 3mm on each side. The original content of the PDF pages should be centered on the output pages.
I tried with convert:
convert in.pdf -bordercolor "#FFFFFF" -border 9 out.pdf
This gives me exactly the needed result but I loose very much sharpness of the original images in the PDF. It is all kind of blurry.
I also checked with
convert in.pdf out.pdf
which does no changes at all but also screws up the images.
So I tried Ghostcript but did not get any result. The best approach I found so far from a German side is:
gs -sOutputFile=out.pdf -sDEVICE=pdfwrite -g6120x8590 \
-c "<</Install{1 1 scale 8.5 8.5}>> setpagedevice" \
-dNOPAUSE -dBATCH in.pdf
but I get Error: /typecheck in --.postinstall--.
By default, Imagemagick converts input PDF files into images with 72dpi. This is awfully low resolution, as you experienced firsthad. The output of Imagemagick is always a raster image, so if your input PDF was text, it will no longer be.
If you don't mind the output PDF's getting bigger, you can simply increase the ratio Imagemagick is probing the original PDF using -density option, like this:
convert -density 600 in.pdf -bordercolor "#FFFFFF" -border 9 out.pdf
I used 600 because it is the sweet spot that works well for OCR. I recomment trying 300, 450, 600, 900 and 1200 and picking the best one that doesn't get unwieldably huge.
Shifting the content on the media is not especially hard, but it does mean altering the content stream of the PDF file, which most PDF manipulation packages avoid, with good reason.
The code you quote above really won't work, it leaves garbage on the operand stack, and the PLRM explicitly states that it is followed by an implicit initgraphics which will reset all the standard parameters anyway.
You could try instead setting a /BeginPage procedure to translate the origin, which will probably work:
<</BeginPage {8.5 8.5 translate} >> setpagedevice
Note that you aren't simply manipulating the original PDF file; Ghostscript takes the original PDF file, interprets it into graphics primitives, then reassembles those primitives into a new PDF file, this has implications... For example, if an image is DCT encoded (a JPEG) in the original, it will be decompressed before being passed into the output file. You probably don't want to reapply DCT encoding as this will introduce visible artefacts.
A simpler alternative, but involving multiple processing steps and therefore more potential for problems, is to first convert the PDF to PostScript with the ps2write device, specifying your media size, and also the -dCenterPages switch, then use the pdfwrite device to turn the resulting PostScript into a new PDF file.
Instead of
-g6120x8590 \
-c "<</Install{1 1 scale 8.5 8.5}>> setpagedevice"
(which is wrong), you should use:
-g6120x8590 \
-c "<</Install{8.5 8.5 translate}>> setpagedevice"
or
-g6120x8590 \
-c "<</Install{3 25.4 div 72 mul dup translate}>> setpagedevice"
(which lets Ghostscript calculate the "3mm == 8.5pt" itself...)

Replacing vector images in a PDF with raster images

Is there any easy (scriptable) way to convert a PDF with vector images into a PDF with raster images? In other words, I want to generate a PDF with the exact same (un-rasterized) text but with each vector image replaced with a rasterized version.
I occasionally read PDFs of technical articles on my Kindle, and have found that reading a PDF directly is frustrating. Thankfully, Amazon's automatic conversion of PDFs to the Kindle format does a good job of reflowing the text portions of most of PDFs I have tried. However, while raster images seem to make it through the conversion process fine, vector images get horribly mangled. It would be great if I could easily convert a PDF so that all of its vector images were rasterized.
I am interested in any possible solutions, but a Linux- or Windows-based one would be preferable.
I had a similar issue, and solved it using ImageMagics convert tool (http://www.imagemagick.org/script/index.php). That comes with linux and runs fine on Windows/Cygwin or OS X
convert -density 300 largeVectorFileFromR.pdf out.pdf
With -density 300 you control resolution (as DPI).
Downside: Text is rasterized as well, I understand that Michael does not want this.
After some days searching for some solution, based on "Remove all text from PDF file" and "How to add a picture onto an existing pdf file?" I found a (ugly) scriptable solution:
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $INPUT_FILE && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfwrite -dFILTERTEXT $INPUT_FILE && \
convert -density $DPI -quality 100 /tmp/graphics.pdf /tmp/graphics.png && \
convert -density $DPI -quality 100 /tmp/graphics.png /tmp/graphics.pdf && \
pdftk /tmp/graphics.pdf stamp /tmp/onlytxt.pdf output $OUTPUT_FILE && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf /tmp/graphics.png
were we have three variables INPUT_FILE, OUTPUT_FILE, and DPI. We split the textual and graphical contents via Ghostscript, convert the graphical image to a raster image (PNG) and join the two using pdftk.
I've been using this successfully to convert huge vector images for use in scientific papers.
Pitstop Pro v2 update 3 from Enfocus can do exactly that. It has an action called "Rasterize page content, keeping text" which works pretty well. It is a plugin to Adobe Acrobat so it requires a little more but is also available as a server solution.
It's a little complicated, but you asked for any possible solution. Furthermore this solution is not automatable.
1) Open the pdf with the vector images in Inkscape. Then select the whole image with the select tool (F1)
2) If the vector image is consistant of more than one svg graphic press Ctrl + G (Object --> Group)
3) cut the grouped svg image Ctrl + x
4) open a new InkScape Window Ctrl + n and paste the image Ctrl + v
5) choose File --> export Bitmap (Shift + Ctrl + e), maybe you want to increase the dpi
6) go back to the first InkScape window, File --> import (Ctrl + i) and choose the previously exported bitmap
7) place the bitmap to the location where the svg image was
Save the pdf and the vector image is replaced by a bitmap image.
Here's one way to solve your problem:
Step 1: Use an online PDF-to-HTML converter, like the one here:
http://www.idrsolutions.com/online-pdf-to-html5-converter/
This tool converts the PDF into a set of images and a text overlay. The vector images should be converted to raster at this point.
Step 2: Convert the HTML+images back into PDF:
http://pdfcrowd.com/#convert_by_upload+with_options
The resulting PDF will have all the vector images rasterized, and all text will remain text, so you can select, copy, etc.
Convert the pdf to djvu with https://jwilk.net/software/pdf2djvu converter. Uncheck "antialias fonts,vectors..". It will reduce file size significantly and improve document load times.
I used the following:
gswin32c -o "%2" -dFirstPage=1 -dLastPage=1 -sDEVICE=pngalpha -r72x72 -dUseCropBox -dFitPage "%1" -dBATCH -dNOPAUSE
where %1 is the input file and %2 is the output. This can be used with LaTeX, the generated PNG has the same ratio and page size as the original PDF so the relative position of the image will not change.
Note that in Linux, you may need to use gs rather than gswin32c.
You can also set the page range and then print the pages back to PDF. The downside is that the text gets rasterized as well.
inkscape is the best solution, I quickly made this rather unoptimized batch file that does exactly that and you can play with it and change options. ImageMacick convert, gs, or pdftoimages don't work as good as inkscape they either don't export the layers or export but with bad quality :
#!/bin/bash
#set -xev
ORIGINAL_FOLDER=`pwd`
JPEGS=`mktemp -d`
unzip "$1" -d "$JPEGS"
cd "$JPEGS"
# expang the pdf in pdf pages
pdftk combined_to_do.pdf burst output pg_%04d.pdf
#1) print the pdf's to pngs as they are seen with alpha, layers, transparency etc, this cannot be done by ImageMacick convert or pdftoimages
ls ./pg*.pdf | xargs -L1 -I {} inkscape {} -z --export-dpi=300 --export-area-drawing --export-png={}.png
#2) Second change to jpgs
rm *.pdf
ls ./p*.png | xargs -L1 -I {} convert {} -quality 100 -density 300 {}.jpg
#3) This to make a pdf file out of every jpg image without loss of either resolution or quality:
ls -1 ./*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf
#4) This to concatenate the pdfpages into one:
pdftk *.jpg.pdf cat output combined.pdf
#5) And last I add an OCRed text layer that doesn't change the quality of the scan in the pdfs so they can be searchable:
pypdfocr combined.pdf
cp "$JPEGS/combined_ocr.pdf" "$ORIGINAL_FOLDER/$1_ocr.pdf"
cp "$JPEGS/combined.pdf" "$ORIGINAL_FOLDER/$1.pdf"
Based on Civ Lins solution, I came up with this:
#!/usr/bin/env sh
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $1 && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfimage24 -dFILTERTEXT -r600 -dDownScaleFactor=6 $1 && \
pdftk /tmp/graphics.pdf multistamp /tmp/onlytxt.pdf output $2 && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf
(In contrast to the previous solution, it handles multipage PDFs and uses gs to directly render the rasterized image without the detour of convert.)