ImageMagick Split PDF Output File Name Always Starts at Zero - pdf

I run the following command to split a PDF in ImageMagick:
convert file.pdf[5-10] file.png
The resulting output files are always suffixed starting with zero. That is:
file-0.png, file-1.png, file-2.png...
Any ideas what I might be doing wrong? The documentation states that the files should be suffixed starting at 5, matching the page numbers of the pages extracted.

I ended up solving this by using the -scene # command line parameter.
This causes the output to begin at the desired index. For posterity:
convert file.pdf -scene 5 file-%d.png

You see the result you describe because ImageMagick's page count for multi-page image formats is zero-based: Page 1 will have index 0, page 2 will have index 1, etc.
Also, ImageMagick cannot process PDF input files itself: it employs Ghostscript as its 'delegate' -- Ghostscript consumes the PDF first and emits a raster file for each PDF page. Only these raster files are then processed by ImageMagick.
Depending on your exact ImageMagick version and IM setup, this may result in an indirect PNG output generation, and the conversion chain may look like this:
PDF --> PPM (portable pixmap) --> PNG
^ ^
| |
| +-- (handled by ImageMagick)
+-- (handled by Ghostscript)
If you are unlucky, the result will be slow and the quality may not be as good as it could be.
To verify what exactly happens in a convert a.pdf a.png command, you can add the -verbose parameter. That will show you the Ghostscript command being employed by IM to process the PDF input:
convert -verbose a.pdf a.png
/var/tmp/magick-15951W3TZ3WRpwIUk1 PNG 612x792 612x792+0+0 8-bit sRGB 3.73KB 0.000u 0:00.000
a.pdf PDF 612x792 612x792+0+0 16-bit sRGB 3.73KB 0.000u 0:00.000
a.pdf=>a.png PDF 612x792 612x792+0+0 8-bit sRGB 2c 2.95KB 0.000u 0:00.000
[ghostscript library] -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT \
-dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" \
-dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" \
"-sOutputFile=/var/tmp/magick-15951W3TZ3WRpwIUk%d" \
"-f/var/tmp/magick-15951nJD8-fF8kA7j" \
"-f/var/tmp/magick-15951JTZDMwtEswHn"
(As you can see, my IM installation is set up to do a PDF->PNG conversion without the detour via PPM... Your mileage may vary.)
You may get better results when using Ghostscript directly, instead of running an IM convert command. (If ImageMagick works at all with PDF->PNG conversion, you have a working Ghostscript installation for sure.) So you can try this:
gs \
-o file-%03d.png \
-sDEVICE=pngalpha \
file.pdf
The -%03d file name suffix will cause Ghostscript to output file-001.png, file-002.png, file-003.png.
However, if you are unlucky and have an older version of Ghostscript installed, the file name will also start with a file-000 one...
In any case, since your sample command seems to suggest that you want to convert only a page range (5--10) from the PDF file (not all pages), here is the command to use:
gs \
-o file-%03d.png \
-sDEVICE=pngalpha \
-dFirstPage=5 \
-dLastPage=10 \
file.pdf
But the bad news here is: Ghostscript will STILL start with naming the output files as file-001.png (page 5) ... file-005.png (page 10).
To work around that, you'll have to generated the PNGs for the first 4 pages too, and later delete them again:
gs \
-o file-%03d.png \
-sDEVICE=pngalpha \
-dFirstPage=1 \
-dLastPage=10 \
file.pdf
rm -rf file-00{1,2,3,4}.png

Related

Batch PDF Watermark [PDF -> JPG -> PDF]

I am working with over thousands PDF files for a Sheet Music publisher.
All of these PDF files needs a preview PDF. A watermark for PDF files can easily be removed so I am asking for a true way to watermark our PDF:s in a batch operation.
PDF->Apply Watermark->JPG->Back to PDF
How can I do this? Is there a good tool for this operations?
The free route
ImageMagick can do the complete process for you, especially with the composite command's -watermark operator.
#!/bin/sh
# ImageMagick will pick the correct conversion formats based on filename suffixes, or maybe actual binary content?
InputPDF=$1
WatermarkImg=$2
OutputPDF=$3
pdfToImage=pdfToImage.png
imageWithWatermark=imageWithWatermark.png
# Convert PDF to image
convert \
-density 300 \
-trim \
"$InputPDF" \
-quality 100 \
-flatten \
-sharpen 0x1.0 \
$pdfToImage
# Add watermark to intermediate image
composite \
-dissolve 15 \
-tile \
"$WatermarkImg" \
$pdfToImage \
$imageWithWatermark
# Convert intermediate image back to PDF
convert \
$imageWithWatermark \
"$OutputPDF"
# Clean up
rm $pdfToImage $imageWithWatermark
I find the PDF to image conversion acceptable in terms of quality, though you can see some differences when looking at the before and after side-by-side, especially in how bolded glyphs seem less bold:
You can check this good post and its answers for a number of options for converting a PDF to an image, Convert PDF to image with high resolution.
I checked out PDFtoPPM, which was also highly mentioned in that thread, and I still see some degrading of the bolded fonts when converted:
Some more tiling Magick
I used this copyright symbol from Wikimedia Commons and this ImageMagick script:
#!/bin/sh
Infile="Copyright.png"
Outfile="Copyright_tiled.png"
h2=$(convert $Infile -format "%[fx:round(h/2)]" info:)
convert $Infile \
\( -clone 0 -roll +0+"$h2" \) \
+append \
-write mpr:sometile \
+delete \
-size 1224X1584 \
tile:mpr:sometile \
$Outfile
to create this staggered tiling (1224X1584 is the page size (8.5in x 11in) multiplied by 72 px/in, times 2 for a good density of tiles):
And here it is unwatermarked again
#ZachYoung I used some different image magic, also scriptable, the point is:-
Although "What's done cannot be undone" Macbeth (Act 5.1. 63-4) is very true especially within a PDF or image. We also know and expect that it too applies to any PDF (de)constructs. Thus depending on value of a forgery it will always be worth engineering a partially reversed copy, fit for scrutiny or use, but will like the watermarked copy, still not be the original, however all the same, may look almost just as good.
The Idiom implies don't bother yourself about it. Its best not done in the first place.
The nearest to best, is use a watermark exactly the same as the text outlines, like this:-

Ghostscript renders ugly text

I'm trying to add the capability to render LaTeX equations to a project I'm working on. To do so, I use XeLaTeX to create a PDF file, which I then render to a (transparent) 96dpi-PNG using Ghostscript.
I'd like to have the rendered LaTeX blend in with the rest of the text (which is rendered using standard .NET GDI+ methods, but that's off-topic), but I can't get a reliably "good" text rendering: the output always looks somehow blurry or otherwise "bad".
Example:
From left to right, the same (small) PDF rendered at 96dpi with Ghostscript, Photoshop, and TexWorks (which I understand uses Ghostscript internally).
The command I use to run Ghostscript is the following:
"C:/Program Files (x86)/gs/gs9.09/bin/gswin32c.exe" \
-q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT \
-dMaxBitmap=500000000 -dAlignToPixels=1 -dGridFitTT=2 \
"-sDEVICE=pngalpha" -dTextAlphaBits=4 \
-dGraphicsAlphaBits=4 "-r96" -dFirstPage=1 -dLastPage=1 \
-sOutputFile="output.png" "input.pdf"
(which I actually pretty much copied from the command ImageMagick calls when converting a PDF file, but that's another story). I tried changing any of the relevant options (dAlignToPixels=0, dGridFitTT=0/1/2, dTextAlphaBits=2/4 [or without this parameter altogether]) and I even tried to render the PDF to 4 times the resolution and then downscale it, without any noticeable improvement.
Yet, I'm sure there must be some way of decently rendering the PDF with Ghostscript (since TexWorks does), although I'm unable to find it.
Any hint? The PDF is this one.
You could try to render your PDF at a higher resolution. 96dpi just isn't enough for text with 11 pt size.
If you use 192dpi and then scale the display of the resulting image to 50% (wherever you use the PNG), these parts should still appear in the same size as befor, but with a higher resolution. What used to be a 4x7 pixels 's' should now be a 8x14 pixels 's'...
Update
Ok, since my explanation seems to have been not comprehendible enough for the OP, here's the deal.
Generate a PDF file containing the word "Test", using Ghostscript. In my case, it is Ghostscript v9.10:
gs \
-o test.pdf \
-sDEVICE=pdfwrite \
-g230x100 \
-c "/Helvetica findfont \
11 scalefont \
setfont \
1 1 moveto \
(Test) show \
showpage"
From this PDF, generate 6 different images depicting the word "Test", using 6 different resolutions. The gs is still Ghostscript v9.10 (to be checked with gs -version):
for i in 1 2 3 4 5 6; do \
gs \
-o t$(( ${i} * 96 )).png \
-r$(( ${i} * 96 )) \
-sDEVICE=pngalpha \
-dAlignToPixels=1 \
-dGridFitTT=2 \
-dTextAlphaBits=4 \
-dGraphicsAlphaBits=4 \
t.pdf ; \
done
This will create the following PNGs, as confirmed by ImageMagick's identify command:
identify -format "%f : %Wx%H pixels -- %b filesize\n" t[1-9]*.png
t96.png : 31x13 pixels -- 475B filesize
t192.png : 61x27 pixels -- 774B filesize
t288.png : 92x40 pixels -- 1.1KB filesize
t384.png : 123x53 pixels -- 1.43KB filesize
t480.png : 153x67 pixels -- 1.76KB filesize
t576.png : 184x80 pixels -- 2.01KB filesize
Create a sample LaTeX document and embed the different images side by side and/or line by line. Here is my sample code:
\begin{document}
Test
\includegraphics[height=7.5pt]{t96.png}
\includegraphics[height=7.5pt]{t96.png}
\includegraphics[height=7.5pt]{t192.png}
\includegraphics[height=7.5pt]{t288.png}
\includegraphics[height=7.5pt]{t384.png}
\includegraphics[height=7.5pt]{t480.png}
\includegraphics[height=7.5pt]{t576.png}
Test\\
{}
Test <== real text
\includegraphics[height=7.5pt]{t96.png} <-- 96 dpi figure
\includegraphics[height=7.5pt]{t192.png} <-- 192 dpi figure
\includegraphics[height=7.5pt]{t288.png} <-- 288 dpi figure
\includegraphics[height=7.5pt]{t384.png} <-- 384 dpi figure
\includegraphics[height=7.5pt]{t480.png} <-- 480 dpi figure
\includegraphics[height=7.5pt]{t576.png} <-- 576 dpi figure
Test <== real text
\end{document}
Here is a screenshot (at 400% zoom) from the PDF created via LuaLaTeX from the above LaTeX code:
The line with the 8 "Test" words has actual text only in the first and the last word. The 6 words in between are images with 96, 96, 192, 288, 384, 480 and 576 dpi.
I hope you can see now clearly how scaling up your image generation to a higher resolution will result in better quality for your final PDF if you include the higher resolution images into your LaTeX code...
You are rendering text at 11 points, at 96 dpi, that works out to about 14 pixels in height which, frankly, is not a lot (and in my output the 's' is 7 pixels high by 4 wide). Looking at your output all 3 look 'blurry' and the Photoshop output looks overly bold in the capital T.
If you don't want it blurred, then don't set TextAlphaBits, or don't set it to such a high value.
I'd also suggest using the current release (9.15).

Remove / Delete all images from a PDF using Ghostscript or ImageMagick

I want to delete / remove all the images in a PDF leaving only the text / font in the PDF with whatever command Line tool possible.
I tried using -dGraphicsAlphaBits=1 in a Ghostscript command but the images are present but like a big pixel.
You can use the draft option of cpdf:
cpdf -draft in.pdf -o out.pdf
This should work in most situations, but file a bug report if it doesn't do the right thing for you.
Disclosure: I am the author of cpdf.
Time has passed, and development of Ghostscript has progressed...
The latest releases have the following new command line parameters. These can be added to the command line:
-dFILTERIMAGE: produces an output where all raster drawings are removed.
-dFILTERTEXT: produces an output where all text elements are removed.
-dFILTERVECTOR: produces an output where all vector drawings are removed.
Any two of these options can be combined.
Example command:
gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
More details (including some illustrative screenshots) can be found in my answer to "How can I remove all images from a PDF?".
No, AFAIK, it's not possible to remove all images in a PDF with a commandline tool.
What's the purpose of your request anyway? Save on filesize? Remove information contained in images? Or ...?
Workaround
Whatever you aim at, here is a command that will downsample all images to a resolution of 2 ppi (Update: 1 ppi doesn't work). Which achieves two goals at once:
reduce filesize
make all images basically un-comprehendable
Here's how to do it selectively, for only the images on page 33 of original.pdf:
gs \
-o images-uncomprehendable.pdf \
-sDEVICE=pdfwrite \
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=2 \
-dGrayImageResolution=2 \
-dMonoImageResolution=2 \
-dFirstPage=33 \
-dLastPage=33 \
original.pdf
If you want to do it for all images on all pages, just skip the -dFirstPage and -dLastPage parameters.
If you want to remove all color information from images, convert them to Grayscale in the same command (search other answers on Stackoverflow where details for this are discussed).
Update: Originally, I had proposed to use a resolution of 1 PPI. It seems this doesn't work with Ghostscript. I now tested with 2 PPI. This works.
Update 2: See also the following (new) question with the answer:
How can I remove all images from a PDF?
It provides some sample PostScript code which completely removes all (raster) images from the PDF, leaving the rest of the page layout unchanged.
It also reflects the expanded new capabilities of Ghostscript which can now selectively remove either all text, or all raster images, or all vector objects from a PDF, or any combination of these 3 types.
To separate images and text to different layers, unfortunately there is no Free/Open Source Software utility available. Also not a free-as-in-beer one either...
This task can only be achieved with various payware software solutions. Since you didn't exclude this in your question, but you asked for 'whatever commandline tool possible', I'll tell you my favorite one:
callas pdfToolbox
A version for CLI usage (which includes a powerful SDK enabling lots of low-level PDF manipulations) is available, and this is supported on all major OS platforms, including Linux.
callas offers you a fully featured gratis test license which is enabled for (I believe) 14 days.
gs -o noImages.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
gs -o noText.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
gs -o noVectors.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf
gs -o onlyImages.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT input.pdf
gs -o onlyText.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
gs -o onlyVectors.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERTEXT input.pdf

Reducing PDF file size using Ghostscript on Linux didn't work

I have about 50-60 pdf files (images) that are 1.5MB large each. Now I don't want to have such large pdf files in my thesis as that would make downloading, reading and printing a pain in the rear. So I tried using ghostscript to do the following:
gs \
-dNOPAUSE -dBATCH \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.4 \
-dPDFSETTINGS="/screen" \
-sOutputFile=output.pdf \
L_2lambda_max_1wl_E0_1_zg.pdf
However, now my 1.4MB pdf is 1.5MB large.
What did I do wrong? Is there some way I can check the resolution of the pdf file? I just need 300dpi images, so would anyone suggest using convert to change the resolution or is there someway I could change the image resolution (reduce it) with gs, since the image is very grainy when I use convert
How I use convert:
convert \
-units PixelsPerInch \
~/Desktop/L_2lambda_max_1wl_E0_1_zg.pdf \
-density 600 \
~/Desktop/output.pdf
Example File
http://dl.dropbox.com/u/13223318/L_2lambda_max_1wl_E0_1_zg.pdf
If you run Ghostscript -dPDFSETTINGS=/screen this is just a sort of shortcut. In fact you'll get (implicitly) a whole bunch of settings used, which you can query with the following command:
gs \
-dNODISPLAY \
-c ".distillersettings {exch ==only ( ) print ===} forall quit" \
| grep '/screen'
On my Ghostscript (v9.06prerelease) I get the following output (slightly edited to increase readability):
/screen
<< /DoThumbnails false
/MonoImageResolution 300
/ColorImageDownsampleType /Average
/PreserveEPSInfo false
/ColorConversionStrategy /sRGB
/GrayImageDownsampleType /Average
/EmbedAllFonts true
/CannotEmbedFontPolicy /Warning
/PreserveOPIComments false
/GrayImageResolution 72
/GrayACSImageDict <<
/ColorTransform 1
/QFactor 0.76
/Blend 1
/HSamples [2 1 1 2]
/VSamples [2 1 1 2]
>>
/ColorImageResolution 72
/PreserveOverprintSettings false
/CreateJobTicket false
/AutoRotatePages /PageByPage
/MonoImageDownsampleType /Average
/NeverEmbed [/Courier
/Courier-Bold
/Courier-Oblique
/Courier-BoldOblique
/Helvetica
/Helvetica-Bold
/Helvetica-Oblique
/Helvetica-BoldOblique
/Times-Roman
/Times-Bold
/Times-Italic
/Times-BoldItalic
/Symbol
/ZapfDingbats]
/ColorACSImageDict <<
/ColorTransform 1
/QFactor 0.76
/Blend 1
/HSamples [2 1 1 2]
/VSamples [2 1 1 2] >>
/CompatibilityLevel 1.3
/UCRandBGInfo /Remove
>>
I'm wondering if your PDFs are image-heavy, and if this sort of conversion does un-welcome things (f.e. re-sampling images with the 'wrong' parameters) which increase the file size...
If this is the case (image-heavy PDF), tell so, and I'll update this answer with a few suggestions....
Update
I had a look at the sample file provided by DNA. Interesting...
No, it does not contain any image.
Instead, it contains one large stream (compressed using /FlateDecode) which consists of roughly 700.000+ (!!) operations, mostly single vector operations in PDF language, such as:
m (moveto),
l (lineto),
d (setdash),
w (setlinewidth),
S (stroke),
s (closepath and stroke),
W* (eoclip),
rg and RG (setrgbcolor)
and a few more.
(That PDF code is very inefficiently written AFAICS (but does its job), because it does concatenate many short strokes instead of doing 'long' ones, and nearly each stroke defines the color again (even if it is the same as before), and has all the other overhead (start stroke, end stroke,...).
Ghostscript's -dPDFSETTINGS=/screen do not have any effect here (there are no images to downsample, for example). The increased file size (+48 kByte to be precise) is probably due to Ghostscript re-organizing some of the internal stroking etc. commands to a different order when it interprets the file.
So there is not much you can do about the PDF file size ...
...unless you convert each of these PDF pages into a REAL image such as PNG:
gs \
-o out72.png \
-sDEVICE=pngalpha \
L_2lambda_max_1wl_E0_1_zg.pdf
(I used the pngalpha output to get transparent background.) The image dimensions of 'out.png' are 259x213px, the filesize is now 70 kByte. But I'm sure you'll not like the quality :-)
The output quality is 'bad' because Ghostscript uses a default resolution of 72 dpi.
Since you said you'd like to have 300dpi, the command becomes this:
gs \
-o out300.png \
-sDEVICE=pngalpha \
-r300 \
L_2lambda_max_1wl_E0_1_zg.pdf
The filesize now is 750 kByte, the image dimensions are 1080x889 Pixels.
Update 2
Since Curiosity is en vogue these days... :-) ...I tried to bring down the file size with the help of Adobe Acrobat X Pro on Mac.
You wanna know the results?
Performing a 'Save as... (PDF with reduced filesize)' -- which for me in the past always yielded very good results! -- created a 1,8++ MByte file (+29%). I guess this definitely puts Ghostscript's performance (file size increase +3%) into a realistic perspective !
DNA decided to go for grayscale PNGs. The way he's creating them is in two steps:
Step 1: Convert a color PDF page (such as this) to a grayscale PDF page, using Ghostscript's pdfwrite device and the settings
-dColorConversionStrategy=/Gray and
-dProcessColorModel=/DeviceGray.
Step 2: Convert the grayscale PDF page to a PNG, using Ghostscript's pngalpha device at a resolution of 300 dpi (-r300 on the GS commandline).
This reduces his initial file size of 1.4 MB to 0.7 MB.
But this workflow has the following disadvantage:
It looses all color info, without saving much disk space as compared to a color output written at the same resolution, directly from the PDF!
There are 2 alternatives to DNA's workflow:
A one-step conversion of (color) PDF -> (color) PNG, using Ghostscript's pngalpha device with the original PDF as input (same settings of 300 dpi resolution). This would have this advantage:
It would keep the color information in the PNG output, requiring only a little more space on disk!
A one-step conversion of (color) PDF -> grayscale PNG, using Ghostscript's pnggray device with the original PDF as input (same settings of 300 dpi resolution), with this mix of advantage/disadvantage :
It would loose the color information in the PNG output.
It would loose the transparent background that was preserved in DNA's workflow.
It would save lots of disk space, because the filesize would go down to about 20% of the output from DNA's workflow.
So you can make up your mind and see the output sizes and quality side-by-side, here is a shell script to demonstrate the differences:
#!/bin/bash
#
# Copywrite (c) 2012 <kurt.pfeifle#gmail.com>
# License: Creative Commons (CC BY-SA 3.0)
function echo_do() {
echo
echo "Command: ${*}"
echo "--------"
echo
"${#}"
}
[ -d out ] || mkdir out
echo
echo " We assume all PDF pages are 1-page PDFs!"
echo " (otherwise we'd have to include something like '%03d'"
echo " into the output filenames in order to get paged output)"
echo
echo '
# Convert Color PDF to Grayscale PDF.
# If PDF has transparent background (most do),
# this will remain transparent in output.)
# ATTENTION: since we don't use a resolution,
# pdfwrite will use its default value of '-r720'.
# (However, this setting will only affect raster objects...)
'
for i in *.pdf
do
echo_do gs \
-o "out/${i}---pdfwrite-devicegray-gs.pdf" \
-sDEVICE=pdfwrite \
-dColorConversionStrategy=/Gray \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 \
"${i}"
done
echo '
# Convert (previously generated) grayscale PDF to PNG using Alpha channel
# (Alpha channel can make backgrounds transparent)
'
for i in out/*pdfwrite-devicegray*.pdf
do
echo_do gs \
-o "out/$(basename "${i}")---pngalpha-from-pdfwrite-devicegray-gs.png" \
-sDEVICE=pngalpha \
-r300 \
"${i}"
done
echo '
# Convert (color) PDF to grayscale PNG using Alpha channel
# (Alpha channel can make backgrounds transparent)
'
for i in *.pdf
do
# Following only required for 'pdfwrite' output device, not for 'pngalpha'!
# -dProcessColorModel=/DeviceGray
echo_do gs \
-o "out/${i}---pngalphagray_gs.png" \
-sDEVICE=pngalpha \
-dColorConversionStrategy=/Gray \
-r300 \
"${i}"
done
echo '
# Convert (color) PDF to (color) PNG using Alpha channel
# (Alpha channel can make backgrounds transparent)
'
for i in *.pdf
do
echo_do gs \
-o "out/${i}---pngalphacolor_gs.png" \
-sDEVICE=pngalpha \
-r300 \
"${i}"
done
echo '
# Convert (color) PDF to grayscale PNG
# (no Alpha channel here, therefor [mostly] white backgrounds)
'
for i in *.pdf
do
echo_do gs \
-o "out/${i}---pnggray_gs.png" \
-sDEVICE=pnggray \
-r300 \
"${i}"
done
echo " All output to be found in ./out/ ..."
echo
Run this script and compare the different outputs side by side.
Yes, the 'direct-grayscale-PNG-from-color-PDF-using-pnggray-device' one may look a bit worse (and it doesn't sport the transparent background) than the other one -- but it is also only 20% of its file size. On the other hand, if you wan to buy a bit more quality by sacrificing a bit of disk space -- you could use -r400 instead of -r300...

Replacing vector images in a PDF with raster images

Is there any easy (scriptable) way to convert a PDF with vector images into a PDF with raster images? In other words, I want to generate a PDF with the exact same (un-rasterized) text but with each vector image replaced with a rasterized version.
I occasionally read PDFs of technical articles on my Kindle, and have found that reading a PDF directly is frustrating. Thankfully, Amazon's automatic conversion of PDFs to the Kindle format does a good job of reflowing the text portions of most of PDFs I have tried. However, while raster images seem to make it through the conversion process fine, vector images get horribly mangled. It would be great if I could easily convert a PDF so that all of its vector images were rasterized.
I am interested in any possible solutions, but a Linux- or Windows-based one would be preferable.
I had a similar issue, and solved it using ImageMagics convert tool (http://www.imagemagick.org/script/index.php). That comes with linux and runs fine on Windows/Cygwin or OS X
convert -density 300 largeVectorFileFromR.pdf out.pdf
With -density 300 you control resolution (as DPI).
Downside: Text is rasterized as well, I understand that Michael does not want this.
After some days searching for some solution, based on "Remove all text from PDF file" and "How to add a picture onto an existing pdf file?" I found a (ugly) scriptable solution:
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $INPUT_FILE && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfwrite -dFILTERTEXT $INPUT_FILE && \
convert -density $DPI -quality 100 /tmp/graphics.pdf /tmp/graphics.png && \
convert -density $DPI -quality 100 /tmp/graphics.png /tmp/graphics.pdf && \
pdftk /tmp/graphics.pdf stamp /tmp/onlytxt.pdf output $OUTPUT_FILE && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf /tmp/graphics.png
were we have three variables INPUT_FILE, OUTPUT_FILE, and DPI. We split the textual and graphical contents via Ghostscript, convert the graphical image to a raster image (PNG) and join the two using pdftk.
I've been using this successfully to convert huge vector images for use in scientific papers.
Pitstop Pro v2 update 3 from Enfocus can do exactly that. It has an action called "Rasterize page content, keeping text" which works pretty well. It is a plugin to Adobe Acrobat so it requires a little more but is also available as a server solution.
It's a little complicated, but you asked for any possible solution. Furthermore this solution is not automatable.
1) Open the pdf with the vector images in Inkscape. Then select the whole image with the select tool (F1)
2) If the vector image is consistant of more than one svg graphic press Ctrl + G (Object --> Group)
3) cut the grouped svg image Ctrl + x
4) open a new InkScape Window Ctrl + n and paste the image Ctrl + v
5) choose File --> export Bitmap (Shift + Ctrl + e), maybe you want to increase the dpi
6) go back to the first InkScape window, File --> import (Ctrl + i) and choose the previously exported bitmap
7) place the bitmap to the location where the svg image was
Save the pdf and the vector image is replaced by a bitmap image.
Here's one way to solve your problem:
Step 1: Use an online PDF-to-HTML converter, like the one here:
http://www.idrsolutions.com/online-pdf-to-html5-converter/
This tool converts the PDF into a set of images and a text overlay. The vector images should be converted to raster at this point.
Step 2: Convert the HTML+images back into PDF:
http://pdfcrowd.com/#convert_by_upload+with_options
The resulting PDF will have all the vector images rasterized, and all text will remain text, so you can select, copy, etc.
Convert the pdf to djvu with https://jwilk.net/software/pdf2djvu converter. Uncheck "antialias fonts,vectors..". It will reduce file size significantly and improve document load times.
I used the following:
gswin32c -o "%2" -dFirstPage=1 -dLastPage=1 -sDEVICE=pngalpha -r72x72 -dUseCropBox -dFitPage "%1" -dBATCH -dNOPAUSE
where %1 is the input file and %2 is the output. This can be used with LaTeX, the generated PNG has the same ratio and page size as the original PDF so the relative position of the image will not change.
Note that in Linux, you may need to use gs rather than gswin32c.
You can also set the page range and then print the pages back to PDF. The downside is that the text gets rasterized as well.
inkscape is the best solution, I quickly made this rather unoptimized batch file that does exactly that and you can play with it and change options. ImageMacick convert, gs, or pdftoimages don't work as good as inkscape they either don't export the layers or export but with bad quality :
#!/bin/bash
#set -xev
ORIGINAL_FOLDER=`pwd`
JPEGS=`mktemp -d`
unzip "$1" -d "$JPEGS"
cd "$JPEGS"
# expang the pdf in pdf pages
pdftk combined_to_do.pdf burst output pg_%04d.pdf
#1) print the pdf's to pngs as they are seen with alpha, layers, transparency etc, this cannot be done by ImageMacick convert or pdftoimages
ls ./pg*.pdf | xargs -L1 -I {} inkscape {} -z --export-dpi=300 --export-area-drawing --export-png={}.png
#2) Second change to jpgs
rm *.pdf
ls ./p*.png | xargs -L1 -I {} convert {} -quality 100 -density 300 {}.jpg
#3) This to make a pdf file out of every jpg image without loss of either resolution or quality:
ls -1 ./*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf
#4) This to concatenate the pdfpages into one:
pdftk *.jpg.pdf cat output combined.pdf
#5) And last I add an OCRed text layer that doesn't change the quality of the scan in the pdfs so they can be searchable:
pypdfocr combined.pdf
cp "$JPEGS/combined_ocr.pdf" "$ORIGINAL_FOLDER/$1_ocr.pdf"
cp "$JPEGS/combined.pdf" "$ORIGINAL_FOLDER/$1.pdf"
Based on Civ Lins solution, I came up with this:
#!/usr/bin/env sh
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $1 && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfimage24 -dFILTERTEXT -r600 -dDownScaleFactor=6 $1 && \
pdftk /tmp/graphics.pdf multistamp /tmp/onlytxt.pdf output $2 && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf
(In contrast to the previous solution, it handles multipage PDFs and uses gs to directly render the rasterized image without the detour of convert.)