I want to delete / remove all the images in a PDF leaving only the text / font in the PDF with whatever command Line tool possible.
I tried using -dGraphicsAlphaBits=1 in a Ghostscript command but the images are present but like a big pixel.
You can use the draft option of cpdf:
cpdf -draft in.pdf -o out.pdf
This should work in most situations, but file a bug report if it doesn't do the right thing for you.
Disclosure: I am the author of cpdf.
Time has passed, and development of Ghostscript has progressed...
The latest releases have the following new command line parameters. These can be added to the command line:
-dFILTERIMAGE: produces an output where all raster drawings are removed.
-dFILTERTEXT: produces an output where all text elements are removed.
-dFILTERVECTOR: produces an output where all vector drawings are removed.
Any two of these options can be combined.
Example command:
gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
More details (including some illustrative screenshots) can be found in my answer to "How can I remove all images from a PDF?".
No, AFAIK, it's not possible to remove all images in a PDF with a commandline tool.
What's the purpose of your request anyway? Save on filesize? Remove information contained in images? Or ...?
Workaround
Whatever you aim at, here is a command that will downsample all images to a resolution of 2 ppi (Update: 1 ppi doesn't work). Which achieves two goals at once:
reduce filesize
make all images basically un-comprehendable
Here's how to do it selectively, for only the images on page 33 of original.pdf:
gs \
-o images-uncomprehendable.pdf \
-sDEVICE=pdfwrite \
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=2 \
-dGrayImageResolution=2 \
-dMonoImageResolution=2 \
-dFirstPage=33 \
-dLastPage=33 \
original.pdf
If you want to do it for all images on all pages, just skip the -dFirstPage and -dLastPage parameters.
If you want to remove all color information from images, convert them to Grayscale in the same command (search other answers on Stackoverflow where details for this are discussed).
Update: Originally, I had proposed to use a resolution of 1 PPI. It seems this doesn't work with Ghostscript. I now tested with 2 PPI. This works.
Update 2: See also the following (new) question with the answer:
How can I remove all images from a PDF?
It provides some sample PostScript code which completely removes all (raster) images from the PDF, leaving the rest of the page layout unchanged.
It also reflects the expanded new capabilities of Ghostscript which can now selectively remove either all text, or all raster images, or all vector objects from a PDF, or any combination of these 3 types.
To separate images and text to different layers, unfortunately there is no Free/Open Source Software utility available. Also not a free-as-in-beer one either...
This task can only be achieved with various payware software solutions. Since you didn't exclude this in your question, but you asked for 'whatever commandline tool possible', I'll tell you my favorite one:
callas pdfToolbox
A version for CLI usage (which includes a powerful SDK enabling lots of low-level PDF manipulations) is available, and this is supported on all major OS platforms, including Linux.
callas offers you a fully featured gratis test license which is enabled for (I believe) 14 days.
gs -o noImages.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
gs -o noText.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
gs -o noVectors.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf
gs -o onlyImages.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT input.pdf
gs -o onlyText.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
gs -o onlyVectors.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERTEXT input.pdf
Related
I used instructions from here to convert an RGB PDF to CMYK using Ghostscript, and it's mostly OK except all the blacks are "rich" - they use not just K but also CMY inks.
Is there a way to convert such that all blacks are "flat" and just use K?
This is the code I used:
gs \
-o test-cmyk.pdf \
-sDEVICE=pdfwrite \
-sProcessColorModel=DeviceCMYK \
-sColorConversionStrategy=CMYK \
-sColorConversionStrategyForImages=CMYK \
test.pdf
Assuming your PDF file actually does use RGB (and not an ICC profile embedded in the PDF) then, in order to get R=G=B->C==M=Y=0, K=R, you need to set up a custom ICC profile link.
You'll need to tell Ghostscript to use a custom RGB profile in place of default_rgb.icc and a cuustom CMYK profile in place of default_cmyk.icc. You'll need to ensure that the mapping RGB->XYZ->CMYK results in pure K when R=G=B.
There's documentation in the Ghostscript 'doc' folder on the various colour management settings, but most of these will only be effective when rendering, not when outputting a PDF file. About the only things you can usefully alter when outputting PDF is the input and output profiles.
I run the following command to split a PDF in ImageMagick:
convert file.pdf[5-10] file.png
The resulting output files are always suffixed starting with zero. That is:
file-0.png, file-1.png, file-2.png...
Any ideas what I might be doing wrong? The documentation states that the files should be suffixed starting at 5, matching the page numbers of the pages extracted.
I ended up solving this by using the -scene # command line parameter.
This causes the output to begin at the desired index. For posterity:
convert file.pdf -scene 5 file-%d.png
You see the result you describe because ImageMagick's page count for multi-page image formats is zero-based: Page 1 will have index 0, page 2 will have index 1, etc.
Also, ImageMagick cannot process PDF input files itself: it employs Ghostscript as its 'delegate' -- Ghostscript consumes the PDF first and emits a raster file for each PDF page. Only these raster files are then processed by ImageMagick.
Depending on your exact ImageMagick version and IM setup, this may result in an indirect PNG output generation, and the conversion chain may look like this:
PDF --> PPM (portable pixmap) --> PNG
^ ^
| |
| +-- (handled by ImageMagick)
+-- (handled by Ghostscript)
If you are unlucky, the result will be slow and the quality may not be as good as it could be.
To verify what exactly happens in a convert a.pdf a.png command, you can add the -verbose parameter. That will show you the Ghostscript command being employed by IM to process the PDF input:
convert -verbose a.pdf a.png
/var/tmp/magick-15951W3TZ3WRpwIUk1 PNG 612x792 612x792+0+0 8-bit sRGB 3.73KB 0.000u 0:00.000
a.pdf PDF 612x792 612x792+0+0 16-bit sRGB 3.73KB 0.000u 0:00.000
a.pdf=>a.png PDF 612x792 612x792+0+0 8-bit sRGB 2c 2.95KB 0.000u 0:00.000
[ghostscript library] -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT \
-dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" \
-dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" \
"-sOutputFile=/var/tmp/magick-15951W3TZ3WRpwIUk%d" \
"-f/var/tmp/magick-15951nJD8-fF8kA7j" \
"-f/var/tmp/magick-15951JTZDMwtEswHn"
(As you can see, my IM installation is set up to do a PDF->PNG conversion without the detour via PPM... Your mileage may vary.)
You may get better results when using Ghostscript directly, instead of running an IM convert command. (If ImageMagick works at all with PDF->PNG conversion, you have a working Ghostscript installation for sure.) So you can try this:
gs \
-o file-%03d.png \
-sDEVICE=pngalpha \
file.pdf
The -%03d file name suffix will cause Ghostscript to output file-001.png, file-002.png, file-003.png.
However, if you are unlucky and have an older version of Ghostscript installed, the file name will also start with a file-000 one...
In any case, since your sample command seems to suggest that you want to convert only a page range (5--10) from the PDF file (not all pages), here is the command to use:
gs \
-o file-%03d.png \
-sDEVICE=pngalpha \
-dFirstPage=5 \
-dLastPage=10 \
file.pdf
But the bad news here is: Ghostscript will STILL start with naming the output files as file-001.png (page 5) ... file-005.png (page 10).
To work around that, you'll have to generated the PNGs for the first 4 pages too, and later delete them again:
gs \
-o file-%03d.png \
-sDEVICE=pngalpha \
-dFirstPage=1 \
-dLastPage=10 \
file.pdf
rm -rf file-00{1,2,3,4}.png
I'm trying to upload hi-res PDF files to our servers, but would like to generate a smaller PDF file size so that it loads quickly on my web application by reducing the dpi resolution.
Is this something that iTextSharp can do? Or is there another free software that can achieve this?
PDF files, in general, do not have DPI. Raster images embedded in a PDF file do. What you can do, is to extract the images embedded in your PDF file, resize them to a lower resolution, and put them back in your file.
There is a chapter about this topic in the book iText in Action.
Ghostscript is Free Software (if you want), and it can downsample PDFs any way you want (well, downsample the pixel images that may be embedded on its pages).
Example commandline, which downsamples all images to 72dpi (provided they have a resolution that's more than 144dpi). I'll not use the shortest command, but I deliberately try to enumerate all potentially useful parameters, so that you can experiment:
gs \
-o downsampled.pdf \
-sDEVICE=pdfwrite \
-dColorImageDownsampleThreshold=2.0 \
-dGrayImageDownsampleThreshold=2.0 \
-dMonoImageDownsampleThreshold=2.0 \
-dColorImageDownsampleType=/Bicubic \
-dGrayImageDownsampleType=/Bicubic \
-dMonoImageDownsampleType=/Bicubic \
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=72 \
-dGrayImageResolution=72 \
-dMonoImageResolution=72 \
-dAutoFilterColorImages=false \
-dAutoFilterGrayImages=false \
\
-dEncodeColorImages=true \
-dEncodeGrayImages=true \
-dEncodeMonoImages=true \
-dColorImageFilter=/DCTEncode \
-dGrayImageFilter=/DCTEncode \
-dMonoImageFilter=/CCITTFaxEncode \
input.pdf
If you want to downsample all color images (that is, also the ones from 73dpi to 144dpi), then use -dColorImageDownsampleThreshold=1.0 (Ghostscript's default is =1.5); the same goes for other *ImageDownsampleThreshold settings.
For the *ImageDownsampleTypes -- you can also experiment with values of /Average or /Subsample instead of my suggested /Bicubic. And you are of course als free to use different settings for resolution, sampling type and thresholds across the mono, gray and color image types.
Is there any easy (scriptable) way to convert a PDF with vector images into a PDF with raster images? In other words, I want to generate a PDF with the exact same (un-rasterized) text but with each vector image replaced with a rasterized version.
I occasionally read PDFs of technical articles on my Kindle, and have found that reading a PDF directly is frustrating. Thankfully, Amazon's automatic conversion of PDFs to the Kindle format does a good job of reflowing the text portions of most of PDFs I have tried. However, while raster images seem to make it through the conversion process fine, vector images get horribly mangled. It would be great if I could easily convert a PDF so that all of its vector images were rasterized.
I am interested in any possible solutions, but a Linux- or Windows-based one would be preferable.
I had a similar issue, and solved it using ImageMagics convert tool (http://www.imagemagick.org/script/index.php). That comes with linux and runs fine on Windows/Cygwin or OS X
convert -density 300 largeVectorFileFromR.pdf out.pdf
With -density 300 you control resolution (as DPI).
Downside: Text is rasterized as well, I understand that Michael does not want this.
After some days searching for some solution, based on "Remove all text from PDF file" and "How to add a picture onto an existing pdf file?" I found a (ugly) scriptable solution:
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $INPUT_FILE && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfwrite -dFILTERTEXT $INPUT_FILE && \
convert -density $DPI -quality 100 /tmp/graphics.pdf /tmp/graphics.png && \
convert -density $DPI -quality 100 /tmp/graphics.png /tmp/graphics.pdf && \
pdftk /tmp/graphics.pdf stamp /tmp/onlytxt.pdf output $OUTPUT_FILE && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf /tmp/graphics.png
were we have three variables INPUT_FILE, OUTPUT_FILE, and DPI. We split the textual and graphical contents via Ghostscript, convert the graphical image to a raster image (PNG) and join the two using pdftk.
I've been using this successfully to convert huge vector images for use in scientific papers.
Pitstop Pro v2 update 3 from Enfocus can do exactly that. It has an action called "Rasterize page content, keeping text" which works pretty well. It is a plugin to Adobe Acrobat so it requires a little more but is also available as a server solution.
It's a little complicated, but you asked for any possible solution. Furthermore this solution is not automatable.
1) Open the pdf with the vector images in Inkscape. Then select the whole image with the select tool (F1)
2) If the vector image is consistant of more than one svg graphic press Ctrl + G (Object --> Group)
3) cut the grouped svg image Ctrl + x
4) open a new InkScape Window Ctrl + n and paste the image Ctrl + v
5) choose File --> export Bitmap (Shift + Ctrl + e), maybe you want to increase the dpi
6) go back to the first InkScape window, File --> import (Ctrl + i) and choose the previously exported bitmap
7) place the bitmap to the location where the svg image was
Save the pdf and the vector image is replaced by a bitmap image.
Here's one way to solve your problem:
Step 1: Use an online PDF-to-HTML converter, like the one here:
http://www.idrsolutions.com/online-pdf-to-html5-converter/
This tool converts the PDF into a set of images and a text overlay. The vector images should be converted to raster at this point.
Step 2: Convert the HTML+images back into PDF:
http://pdfcrowd.com/#convert_by_upload+with_options
The resulting PDF will have all the vector images rasterized, and all text will remain text, so you can select, copy, etc.
Convert the pdf to djvu with https://jwilk.net/software/pdf2djvu converter. Uncheck "antialias fonts,vectors..". It will reduce file size significantly and improve document load times.
I used the following:
gswin32c -o "%2" -dFirstPage=1 -dLastPage=1 -sDEVICE=pngalpha -r72x72 -dUseCropBox -dFitPage "%1" -dBATCH -dNOPAUSE
where %1 is the input file and %2 is the output. This can be used with LaTeX, the generated PNG has the same ratio and page size as the original PDF so the relative position of the image will not change.
Note that in Linux, you may need to use gs rather than gswin32c.
You can also set the page range and then print the pages back to PDF. The downside is that the text gets rasterized as well.
inkscape is the best solution, I quickly made this rather unoptimized batch file that does exactly that and you can play with it and change options. ImageMacick convert, gs, or pdftoimages don't work as good as inkscape they either don't export the layers or export but with bad quality :
#!/bin/bash
#set -xev
ORIGINAL_FOLDER=`pwd`
JPEGS=`mktemp -d`
unzip "$1" -d "$JPEGS"
cd "$JPEGS"
# expang the pdf in pdf pages
pdftk combined_to_do.pdf burst output pg_%04d.pdf
#1) print the pdf's to pngs as they are seen with alpha, layers, transparency etc, this cannot be done by ImageMacick convert or pdftoimages
ls ./pg*.pdf | xargs -L1 -I {} inkscape {} -z --export-dpi=300 --export-area-drawing --export-png={}.png
#2) Second change to jpgs
rm *.pdf
ls ./p*.png | xargs -L1 -I {} convert {} -quality 100 -density 300 {}.jpg
#3) This to make a pdf file out of every jpg image without loss of either resolution or quality:
ls -1 ./*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf
#4) This to concatenate the pdfpages into one:
pdftk *.jpg.pdf cat output combined.pdf
#5) And last I add an OCRed text layer that doesn't change the quality of the scan in the pdfs so they can be searchable:
pypdfocr combined.pdf
cp "$JPEGS/combined_ocr.pdf" "$ORIGINAL_FOLDER/$1_ocr.pdf"
cp "$JPEGS/combined.pdf" "$ORIGINAL_FOLDER/$1.pdf"
Based on Civ Lins solution, I came up with this:
#!/usr/bin/env sh
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $1 && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfimage24 -dFILTERTEXT -r600 -dDownScaleFactor=6 $1 && \
pdftk /tmp/graphics.pdf multistamp /tmp/onlytxt.pdf output $2 && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf
(In contrast to the previous solution, it handles multipage PDFs and uses gs to directly render the rasterized image without the detour of convert.)
This is related to:
Converting PDF to CMYK (with identify recognizing CMYK).
Script (or some other means) to convert RGB to CMYK in PDF?
... but a bit more specific here: say I have an RGB PDF, where the text color is "rich black" (R:0 G:0 B:0 gone to C:100 M:100 Y:100 K:100), and diverse images and vector graphics.
I would like to convert this to a CMYK PDF, using a free command line tool (so it is batch scriptable under Linux), which
has contents only in the black (K) channel:
Preserves vector graphics (+ text glyphs) - colors become grayscale in black (K) channel only
Images get converted to grayscale in black (K) channel only
Thanks in advance for any answers,
Cheers!
As hinted in my comment to #Mark Storer, it turns out that forcing a gray print only on the K plate in CMYK, may not be so trivial ... I guess it depends much on what is being used as "preflight" preview device - for Linux, the only thing I can find is ghostscript with tiffsep, which is what I use for 'sanity check' regarding CMYK separations.
Anyways, I got a lot of help in this thread on comp.lang.postscript:
PDF to PDF (gs?): rich RGB black to plain K (CMYK) black? - comp.lang.postscript | Google Groups
... and one workflow that works for me is:
Convert PDF to PS using ghostscript's ps2write
Use ghostscript to convert this PS back to PDF, while executing replacement functions in HackRGB-cmyk-inv.ps
Use ghostscript's tiffsep to check actual separations
In respect to, say, this PDF generated by OpenOffice: blah-slide.pdf, the command lines would be:
# PDF to PS using `ps2write` device of `ghostscript`
gs \
-dNOPAUSE \
-dBATCH \
-sDEVICE=ps2write \
-sOutputFile=./blah-slide-gsps2w.ps \
./blah-slide.pdf
# PS to PDF using replacement function in HackRGB-cmyk-inv.ps
gs \
-dNOPAUSE \
-dBATCH \
-sDEVICE=pdfwrite \
-sOutputFile=./blah-slide-hackRGB-cmyk-inv.pdf \
./HackRGB-cmyk-inv.ps \
./blah-slide-gsps2w.ps
# check separations
gs \
-dNOPAUSE \
-dBATCH \
-dSAFER \
-sDEVICE=tiffsep \
-dFirstPage=1 \
-dLastPage=1 \
-sOutputFile=p%02d.tif \
blah-slide-hackRGB-cmyk-inv.pdf \
\
&& eog p01.tif 2>/dev/null
This should only work on RGB values where R=G=B (and hopefully grayscale values), and only on text colors, and it also flattens text information - but it should be possible to confirm via tiffsep that the text indeed ends up only on the K plate.
As mentioned in the newsgroup post, this is not extensively tested, but looks promising so far...
Cheers!
As an improvement to sdaau's great answer, I can recommend using pdftops from xpdf for converting pdf to ps, instead of ghostscript ps2write, because the latter e.g. causes the font to become staircasey, and is said to not preserve the original pdf accurately. Compare by zooming into text areas of the resulting pdfs.
I suggest you convert the PDF using GS twice. Once to a Shades Of Gray colorspace, and then to CMYK.
I'm not sure it'll work, but I'd be a bit surprised if it didn't. G->CMYK sounds like a brain-dead X -> 0 0 0 X conversion. At least if you stick to "device gray" and "device CMYK" instead of some calibrated color space that'll tweak things this way and that.