Can iTextSharp reduce dpi resolution of a pdf? - pdf

I'm trying to upload hi-res PDF files to our servers, but would like to generate a smaller PDF file size so that it loads quickly on my web application by reducing the dpi resolution.
Is this something that iTextSharp can do? Or is there another free software that can achieve this?

PDF files, in general, do not have DPI. Raster images embedded in a PDF file do. What you can do, is to extract the images embedded in your PDF file, resize them to a lower resolution, and put them back in your file.
There is a chapter about this topic in the book iText in Action.

Ghostscript is Free Software (if you want), and it can downsample PDFs any way you want (well, downsample the pixel images that may be embedded on its pages).
Example commandline, which downsamples all images to 72dpi (provided they have a resolution that's more than 144dpi). I'll not use the shortest command, but I deliberately try to enumerate all potentially useful parameters, so that you can experiment:
gs \
-o downsampled.pdf \
-sDEVICE=pdfwrite \
-dColorImageDownsampleThreshold=2.0 \
-dGrayImageDownsampleThreshold=2.0 \
-dMonoImageDownsampleThreshold=2.0 \
-dColorImageDownsampleType=/Bicubic \
-dGrayImageDownsampleType=/Bicubic \
-dMonoImageDownsampleType=/Bicubic \
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=72 \
-dGrayImageResolution=72 \
-dMonoImageResolution=72 \
-dAutoFilterColorImages=false \
-dAutoFilterGrayImages=false \
\
-dEncodeColorImages=true \
-dEncodeGrayImages=true \
-dEncodeMonoImages=true \
-dColorImageFilter=/DCTEncode \
-dGrayImageFilter=/DCTEncode \
-dMonoImageFilter=/CCITTFaxEncode \
input.pdf
If you want to downsample all color images (that is, also the ones from 73dpi to 144dpi), then use -dColorImageDownsampleThreshold=1.0 (Ghostscript's default is =1.5); the same goes for other *ImageDownsampleThreshold settings.
For the *ImageDownsampleTypes -- you can also experiment with values of /Average or /Subsample instead of my suggested /Bicubic. And you are of course als free to use different settings for resolution, sampling type and thresholds across the mono, gray and color image types.

Related

Batch PDF Watermark [PDF -> JPG -> PDF]

I am working with over thousands PDF files for a Sheet Music publisher.
All of these PDF files needs a preview PDF. A watermark for PDF files can easily be removed so I am asking for a true way to watermark our PDF:s in a batch operation.
PDF->Apply Watermark->JPG->Back to PDF
How can I do this? Is there a good tool for this operations?
The free route
ImageMagick can do the complete process for you, especially with the composite command's -watermark operator.
#!/bin/sh
# ImageMagick will pick the correct conversion formats based on filename suffixes, or maybe actual binary content?
InputPDF=$1
WatermarkImg=$2
OutputPDF=$3
pdfToImage=pdfToImage.png
imageWithWatermark=imageWithWatermark.png
# Convert PDF to image
convert \
-density 300 \
-trim \
"$InputPDF" \
-quality 100 \
-flatten \
-sharpen 0x1.0 \
$pdfToImage
# Add watermark to intermediate image
composite \
-dissolve 15 \
-tile \
"$WatermarkImg" \
$pdfToImage \
$imageWithWatermark
# Convert intermediate image back to PDF
convert \
$imageWithWatermark \
"$OutputPDF"
# Clean up
rm $pdfToImage $imageWithWatermark
I find the PDF to image conversion acceptable in terms of quality, though you can see some differences when looking at the before and after side-by-side, especially in how bolded glyphs seem less bold:
You can check this good post and its answers for a number of options for converting a PDF to an image, Convert PDF to image with high resolution.
I checked out PDFtoPPM, which was also highly mentioned in that thread, and I still see some degrading of the bolded fonts when converted:
Some more tiling Magick
I used this copyright symbol from Wikimedia Commons and this ImageMagick script:
#!/bin/sh
Infile="Copyright.png"
Outfile="Copyright_tiled.png"
h2=$(convert $Infile -format "%[fx:round(h/2)]" info:)
convert $Infile \
\( -clone 0 -roll +0+"$h2" \) \
+append \
-write mpr:sometile \
+delete \
-size 1224X1584 \
tile:mpr:sometile \
$Outfile
to create this staggered tiling (1224X1584 is the page size (8.5in x 11in) multiplied by 72 px/in, times 2 for a good density of tiles):
And here it is unwatermarked again
#ZachYoung I used some different image magic, also scriptable, the point is:-
Although "What's done cannot be undone" Macbeth (Act 5.1. 63-4) is very true especially within a PDF or image. We also know and expect that it too applies to any PDF (de)constructs. Thus depending on value of a forgery it will always be worth engineering a partially reversed copy, fit for scrutiny or use, but will like the watermarked copy, still not be the original, however all the same, may look almost just as good.
The Idiom implies don't bother yourself about it. Its best not done in the first place.
The nearest to best, is use a watermark exactly the same as the text outlines, like this:-

Remove / Delete all images from a PDF using Ghostscript or ImageMagick

I want to delete / remove all the images in a PDF leaving only the text / font in the PDF with whatever command Line tool possible.
I tried using -dGraphicsAlphaBits=1 in a Ghostscript command but the images are present but like a big pixel.
You can use the draft option of cpdf:
cpdf -draft in.pdf -o out.pdf
This should work in most situations, but file a bug report if it doesn't do the right thing for you.
Disclosure: I am the author of cpdf.
Time has passed, and development of Ghostscript has progressed...
The latest releases have the following new command line parameters. These can be added to the command line:
-dFILTERIMAGE: produces an output where all raster drawings are removed.
-dFILTERTEXT: produces an output where all text elements are removed.
-dFILTERVECTOR: produces an output where all vector drawings are removed.
Any two of these options can be combined.
Example command:
gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
More details (including some illustrative screenshots) can be found in my answer to "How can I remove all images from a PDF?".
No, AFAIK, it's not possible to remove all images in a PDF with a commandline tool.
What's the purpose of your request anyway? Save on filesize? Remove information contained in images? Or ...?
Workaround
Whatever you aim at, here is a command that will downsample all images to a resolution of 2 ppi (Update: 1 ppi doesn't work). Which achieves two goals at once:
reduce filesize
make all images basically un-comprehendable
Here's how to do it selectively, for only the images on page 33 of original.pdf:
gs \
-o images-uncomprehendable.pdf \
-sDEVICE=pdfwrite \
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=2 \
-dGrayImageResolution=2 \
-dMonoImageResolution=2 \
-dFirstPage=33 \
-dLastPage=33 \
original.pdf
If you want to do it for all images on all pages, just skip the -dFirstPage and -dLastPage parameters.
If you want to remove all color information from images, convert them to Grayscale in the same command (search other answers on Stackoverflow where details for this are discussed).
Update: Originally, I had proposed to use a resolution of 1 PPI. It seems this doesn't work with Ghostscript. I now tested with 2 PPI. This works.
Update 2: See also the following (new) question with the answer:
How can I remove all images from a PDF?
It provides some sample PostScript code which completely removes all (raster) images from the PDF, leaving the rest of the page layout unchanged.
It also reflects the expanded new capabilities of Ghostscript which can now selectively remove either all text, or all raster images, or all vector objects from a PDF, or any combination of these 3 types.
To separate images and text to different layers, unfortunately there is no Free/Open Source Software utility available. Also not a free-as-in-beer one either...
This task can only be achieved with various payware software solutions. Since you didn't exclude this in your question, but you asked for 'whatever commandline tool possible', I'll tell you my favorite one:
callas pdfToolbox
A version for CLI usage (which includes a powerful SDK enabling lots of low-level PDF manipulations) is available, and this is supported on all major OS platforms, including Linux.
callas offers you a fully featured gratis test license which is enabled for (I believe) 14 days.
gs -o noImages.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
gs -o noText.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
gs -o noVectors.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf
gs -o onlyImages.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT input.pdf
gs -o onlyText.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
gs -o onlyVectors.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERTEXT input.pdf

Reducing PDF file size using Ghostscript on Linux didn't work

I have about 50-60 pdf files (images) that are 1.5MB large each. Now I don't want to have such large pdf files in my thesis as that would make downloading, reading and printing a pain in the rear. So I tried using ghostscript to do the following:
gs \
-dNOPAUSE -dBATCH \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.4 \
-dPDFSETTINGS="/screen" \
-sOutputFile=output.pdf \
L_2lambda_max_1wl_E0_1_zg.pdf
However, now my 1.4MB pdf is 1.5MB large.
What did I do wrong? Is there some way I can check the resolution of the pdf file? I just need 300dpi images, so would anyone suggest using convert to change the resolution or is there someway I could change the image resolution (reduce it) with gs, since the image is very grainy when I use convert
How I use convert:
convert \
-units PixelsPerInch \
~/Desktop/L_2lambda_max_1wl_E0_1_zg.pdf \
-density 600 \
~/Desktop/output.pdf
Example File
http://dl.dropbox.com/u/13223318/L_2lambda_max_1wl_E0_1_zg.pdf
If you run Ghostscript -dPDFSETTINGS=/screen this is just a sort of shortcut. In fact you'll get (implicitly) a whole bunch of settings used, which you can query with the following command:
gs \
-dNODISPLAY \
-c ".distillersettings {exch ==only ( ) print ===} forall quit" \
| grep '/screen'
On my Ghostscript (v9.06prerelease) I get the following output (slightly edited to increase readability):
/screen
<< /DoThumbnails false
/MonoImageResolution 300
/ColorImageDownsampleType /Average
/PreserveEPSInfo false
/ColorConversionStrategy /sRGB
/GrayImageDownsampleType /Average
/EmbedAllFonts true
/CannotEmbedFontPolicy /Warning
/PreserveOPIComments false
/GrayImageResolution 72
/GrayACSImageDict <<
/ColorTransform 1
/QFactor 0.76
/Blend 1
/HSamples [2 1 1 2]
/VSamples [2 1 1 2]
>>
/ColorImageResolution 72
/PreserveOverprintSettings false
/CreateJobTicket false
/AutoRotatePages /PageByPage
/MonoImageDownsampleType /Average
/NeverEmbed [/Courier
/Courier-Bold
/Courier-Oblique
/Courier-BoldOblique
/Helvetica
/Helvetica-Bold
/Helvetica-Oblique
/Helvetica-BoldOblique
/Times-Roman
/Times-Bold
/Times-Italic
/Times-BoldItalic
/Symbol
/ZapfDingbats]
/ColorACSImageDict <<
/ColorTransform 1
/QFactor 0.76
/Blend 1
/HSamples [2 1 1 2]
/VSamples [2 1 1 2] >>
/CompatibilityLevel 1.3
/UCRandBGInfo /Remove
>>
I'm wondering if your PDFs are image-heavy, and if this sort of conversion does un-welcome things (f.e. re-sampling images with the 'wrong' parameters) which increase the file size...
If this is the case (image-heavy PDF), tell so, and I'll update this answer with a few suggestions....
Update
I had a look at the sample file provided by DNA. Interesting...
No, it does not contain any image.
Instead, it contains one large stream (compressed using /FlateDecode) which consists of roughly 700.000+ (!!) operations, mostly single vector operations in PDF language, such as:
m (moveto),
l (lineto),
d (setdash),
w (setlinewidth),
S (stroke),
s (closepath and stroke),
W* (eoclip),
rg and RG (setrgbcolor)
and a few more.
(That PDF code is very inefficiently written AFAICS (but does its job), because it does concatenate many short strokes instead of doing 'long' ones, and nearly each stroke defines the color again (even if it is the same as before), and has all the other overhead (start stroke, end stroke,...).
Ghostscript's -dPDFSETTINGS=/screen do not have any effect here (there are no images to downsample, for example). The increased file size (+48 kByte to be precise) is probably due to Ghostscript re-organizing some of the internal stroking etc. commands to a different order when it interprets the file.
So there is not much you can do about the PDF file size ...
...unless you convert each of these PDF pages into a REAL image such as PNG:
gs \
-o out72.png \
-sDEVICE=pngalpha \
L_2lambda_max_1wl_E0_1_zg.pdf
(I used the pngalpha output to get transparent background.) The image dimensions of 'out.png' are 259x213px, the filesize is now 70 kByte. But I'm sure you'll not like the quality :-)
The output quality is 'bad' because Ghostscript uses a default resolution of 72 dpi.
Since you said you'd like to have 300dpi, the command becomes this:
gs \
-o out300.png \
-sDEVICE=pngalpha \
-r300 \
L_2lambda_max_1wl_E0_1_zg.pdf
The filesize now is 750 kByte, the image dimensions are 1080x889 Pixels.
Update 2
Since Curiosity is en vogue these days... :-) ...I tried to bring down the file size with the help of Adobe Acrobat X Pro on Mac.
You wanna know the results?
Performing a 'Save as... (PDF with reduced filesize)' -- which for me in the past always yielded very good results! -- created a 1,8++ MByte file (+29%). I guess this definitely puts Ghostscript's performance (file size increase +3%) into a realistic perspective !
DNA decided to go for grayscale PNGs. The way he's creating them is in two steps:
Step 1: Convert a color PDF page (such as this) to a grayscale PDF page, using Ghostscript's pdfwrite device and the settings
-dColorConversionStrategy=/Gray and
-dProcessColorModel=/DeviceGray.
Step 2: Convert the grayscale PDF page to a PNG, using Ghostscript's pngalpha device at a resolution of 300 dpi (-r300 on the GS commandline).
This reduces his initial file size of 1.4 MB to 0.7 MB.
But this workflow has the following disadvantage:
It looses all color info, without saving much disk space as compared to a color output written at the same resolution, directly from the PDF!
There are 2 alternatives to DNA's workflow:
A one-step conversion of (color) PDF -> (color) PNG, using Ghostscript's pngalpha device with the original PDF as input (same settings of 300 dpi resolution). This would have this advantage:
It would keep the color information in the PNG output, requiring only a little more space on disk!
A one-step conversion of (color) PDF -> grayscale PNG, using Ghostscript's pnggray device with the original PDF as input (same settings of 300 dpi resolution), with this mix of advantage/disadvantage :
It would loose the color information in the PNG output.
It would loose the transparent background that was preserved in DNA's workflow.
It would save lots of disk space, because the filesize would go down to about 20% of the output from DNA's workflow.
So you can make up your mind and see the output sizes and quality side-by-side, here is a shell script to demonstrate the differences:
#!/bin/bash
#
# Copywrite (c) 2012 <kurt.pfeifle#gmail.com>
# License: Creative Commons (CC BY-SA 3.0)
function echo_do() {
echo
echo "Command: ${*}"
echo "--------"
echo
"${#}"
}
[ -d out ] || mkdir out
echo
echo " We assume all PDF pages are 1-page PDFs!"
echo " (otherwise we'd have to include something like '%03d'"
echo " into the output filenames in order to get paged output)"
echo
echo '
# Convert Color PDF to Grayscale PDF.
# If PDF has transparent background (most do),
# this will remain transparent in output.)
# ATTENTION: since we don't use a resolution,
# pdfwrite will use its default value of '-r720'.
# (However, this setting will only affect raster objects...)
'
for i in *.pdf
do
echo_do gs \
-o "out/${i}---pdfwrite-devicegray-gs.pdf" \
-sDEVICE=pdfwrite \
-dColorConversionStrategy=/Gray \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 \
"${i}"
done
echo '
# Convert (previously generated) grayscale PDF to PNG using Alpha channel
# (Alpha channel can make backgrounds transparent)
'
for i in out/*pdfwrite-devicegray*.pdf
do
echo_do gs \
-o "out/$(basename "${i}")---pngalpha-from-pdfwrite-devicegray-gs.png" \
-sDEVICE=pngalpha \
-r300 \
"${i}"
done
echo '
# Convert (color) PDF to grayscale PNG using Alpha channel
# (Alpha channel can make backgrounds transparent)
'
for i in *.pdf
do
# Following only required for 'pdfwrite' output device, not for 'pngalpha'!
# -dProcessColorModel=/DeviceGray
echo_do gs \
-o "out/${i}---pngalphagray_gs.png" \
-sDEVICE=pngalpha \
-dColorConversionStrategy=/Gray \
-r300 \
"${i}"
done
echo '
# Convert (color) PDF to (color) PNG using Alpha channel
# (Alpha channel can make backgrounds transparent)
'
for i in *.pdf
do
echo_do gs \
-o "out/${i}---pngalphacolor_gs.png" \
-sDEVICE=pngalpha \
-r300 \
"${i}"
done
echo '
# Convert (color) PDF to grayscale PNG
# (no Alpha channel here, therefor [mostly] white backgrounds)
'
for i in *.pdf
do
echo_do gs \
-o "out/${i}---pnggray_gs.png" \
-sDEVICE=pnggray \
-r300 \
"${i}"
done
echo " All output to be found in ./out/ ..."
echo
Run this script and compare the different outputs side by side.
Yes, the 'direct-grayscale-PNG-from-color-PDF-using-pnggray-device' one may look a bit worse (and it doesn't sport the transparent background) than the other one -- but it is also only 20% of its file size. On the other hand, if you wan to buy a bit more quality by sacrificing a bit of disk space -- you could use -r400 instead of -r300...

Ghostscript to merge PDFs compresses the result

I found this neat command to merge multiple PDF into one, using Ghostscript:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=out.pdf in1.pdf in2.pdf
The resulting size is smaller than the combined size of the 2 PDFs.
Running the command with a single file as input still results to a smaller size output file.
Is there an option on Ghostscript to just copy the pages as they appear on merging without doing any compression?
If not, is it possible that the Ghostscript compression is so good that it will result in absolutely no loss in quality?
Here's some additional options that you can pass when using pdfwrite as your device. According to that page if you don't pass anything then -dPDFSETTINGS it gets set to something close to /screen, although it doesn't get more specific. You could try setting it to -dPDFSETTINGS=/prepress which should only compress things above 300 dpi.
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile=out.pdf in1.pdf in2.pdf
Another alternative is pdftk:
pdftk in1.pdf in2.pdf cat output out.pdf
Some of the size optimizations that you observed may come from Ghostscript's cleaning up of unused objects, its recently acquired font optimization improvements (do you use a very recent version of GS?!?) and possibly image re-/down-sampling that may have happened.
Ghostscript, if used for PDF -> PDF conversions, basically operates like this:
Read in the input file(s) with all its objects and convert them into its internal format for graphical page representations.
Do the manipulations asked for on the commandline to the page contents in the internal format.
Write out a completely new PDF.
This means that for most PDF -> PDF operations you'll have different ordering and numbering for the PDF objects, and even the object's internal code may have changed (even if your eyes don't discover any differences between input and output PDF).
By default Ghostscript also will compress any object streams that have been uncompressed in the original file (but this is a lossless compression).
Now for your very simplistic commandline which does not contain any wishes for manipulations, Ghostscript assumes you want to use -dPDFSETTINGS=/default, sets this parameter implicitly and operates accordingly.
Now what are the /default PDFSETTINGS?! You have two options to find out:
Read the manual. The large table in middle of this section gives an overview. You can see that this one -dPDFSETTINGS=/default in itself is just a shorthand for the several dozen other more specific settings which it represents. The link to the documentation given is for current HEAD of the development code and your actually used version may be different of course.
Query (your own) Ghostscript for the detailed meaning of this setting. My answers to question 'Querying Ghostscript for the default options/settings of an output device...' and question 'What are PostScript dictionaries, and how can they be accessed (via Ghostscript)?' do elaborate a bit more on this. In short, to query Ghostscript for the details of its /default PDFSETTINGS, run this command:
gs \
-q \
-dNODISPLAY \
-c ".distillersettings /default get {exch ==only ( ) print ===} forall quit"
You should get a result very similar to this:
/Optimize false
/DoThumbnails false
/PreserveEPSInfo true
/ColorConversionStrategy /LeaveColorUnchanged
/DownsampleMonoImages false
/EmbedAllFonts true
/CannotEmbedFontPolicy /Warning
/PreserveOPIComments true
/GrayACSImageDict << /HSamples [2 1 1 2] /VSamples [2 1 1 2] /QFactor 0.9 /Blend 1 >>
/DownsampleColorImages false
/PreserveOverprintSettings true
/CreateJobTicket false
/AutoRotatePages /PageByPage
/NeverEmbed [/Courier /Courier-Bold /Courier-Oblique /Courier-BoldOblique /Helvetica /Helvetica-Bold /Helvetica-Oblique /Helvetica-BoldOblique /Times-Roman /Times-Bold /Times-Italic /Times-BoldItalic /Symbol /ZapfDingbats]
/ColorACSImageDict << /HSamples [2 1 1 2] /VSamples [2 1 1 2] /QFactor 0.9 /Blend 1 >>
/DownsampleGrayImages false
/UCRandBGInfo /Preserve
The only point that stands out from these: you may want to change /AutoRotagePages from /PageByPage to /None. On the commandline you would put it as -dAutoRotatePages=/None.
To give you a complete list of parameters which would specifically tell Ghostscript to employ as much of a passthrough mode as it possibly can to the input PDF by adding these parameters:
-dAntiAliasColorImage=false \
-dAntiAliasGrayImage=false \
-dAntiAliasMonoImage=false \
-dAutoFilterColorImages=false \
-dAutoFilterGrayImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-dColorConversionStrategy=/LeaveColorUnchanged \
-dConvertCMYKImagesToRGB=false \
-dConvertImagesToIndexed=false \
-dUCRandBGInfo=/Preserve \
-dPreserveHalftoneInfo=true \
-dPreserveOPIComments=true \
-dPreserveOverprintSettings=true \
So you could try this command:
gs \
-o output.pdf \
-sDEVICE=pdfwrite \
-dAntiAliasColorImage=false \
-dAntiAliasGrayImage=false \
-dAntiAliasMonoImage=false \
-dAutoFilterColorImages=false \
-dAutoFilterGrayImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-dColorConversionStrategy=/LeaveColorUnchanged \
-dConvertCMYKImagesToRGB=false \
-dConvertImagesToIndexed=false \
-dUCRandBGInfo=/Preserve \
-dPreserveHalftoneInfo=true \
-dPreserveOPIComments=true \
-dPreserveOverprintSettings=true \
input1.pdf \
input2.pdf
Finally, as Chris Haas already hinted at: you can also use pdftk if you specifically do not want any of the optimizations that Ghostscript applies by default. pdftk is simply unable to do such things, and you'll gain quite a bit of speed for its relative dumbness of operation (but probably also much larger file size outputs than from Ghostscript).
I used the following code with success on iOS terminal to compress multiple PDFs recursively. I posted it because I couldn't find something that worked for me with a simple copy and paste.
find . -name '*.pdf' | while read pdf; do gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="${pdf}_new.pdf" "$pdf"; done
Note you may want a different output quality, sou you can change the -dPDFSETTINGS parameter as follows:
-dPDFSETTINGS=/screen: lower quality, smaller size.
-dPDFSETTINGS=/ebook: for better quality, but slightly larger pdfs.
-dPDFSETTINGS=/prepress: output similar to Acrobat Distiller "Prepress Optimized" setting.
-dPDFSETTINGS=/printer: selects output similar to the Acrobat Distiller "Print Optimized" setting.
-dPDFSETTINGS=/default: selects output intended to be useful across a wide variety of uses, possibly at the expense of a larger output file.

Converting (any) PDF to black (K)-only CMYK

This is related to:
Converting PDF to CMYK (with identify recognizing CMYK).
Script (or some other means) to convert RGB to CMYK in PDF?
... but a bit more specific here: say I have an RGB PDF, where the text color is "rich black" (R:0 G:0 B:0 gone to C:100 M:100 Y:100 K:100), and diverse images and vector graphics.
I would like to convert this to a CMYK PDF, using a free command line tool (so it is batch scriptable under Linux), which
has contents only in the black (K) channel:
Preserves vector graphics (+ text glyphs) - colors become grayscale in black (K) channel only
Images get converted to grayscale in black (K) channel only
Thanks in advance for any answers,
Cheers!
As hinted in my comment to #Mark Storer, it turns out that forcing a gray print only on the K plate in CMYK, may not be so trivial ... I guess it depends much on what is being used as "preflight" preview device - for Linux, the only thing I can find is ghostscript with tiffsep, which is what I use for 'sanity check' regarding CMYK separations.
Anyways, I got a lot of help in this thread on comp.lang.postscript:
PDF to PDF (gs?): rich RGB black to plain K (CMYK) black? - comp.lang.postscript | Google Groups
... and one workflow that works for me is:
Convert PDF to PS using ghostscript's ps2write
Use ghostscript to convert this PS back to PDF, while executing replacement functions in HackRGB-cmyk-inv.ps
Use ghostscript's tiffsep to check actual separations
In respect to, say, this PDF generated by OpenOffice: blah-slide.pdf, the command lines would be:
# PDF to PS using `ps2write` device of `ghostscript`
gs \
-dNOPAUSE \
-dBATCH \
-sDEVICE=ps2write \
-sOutputFile=./blah-slide-gsps2w.ps \
./blah-slide.pdf
# PS to PDF using replacement function in HackRGB-cmyk-inv.ps
gs \
-dNOPAUSE \
-dBATCH \
-sDEVICE=pdfwrite \
-sOutputFile=./blah-slide-hackRGB-cmyk-inv.pdf \
./HackRGB-cmyk-inv.ps \
./blah-slide-gsps2w.ps
# check separations
gs \
-dNOPAUSE \
-dBATCH \
-dSAFER \
-sDEVICE=tiffsep \
-dFirstPage=1 \
-dLastPage=1 \
-sOutputFile=p%02d.tif \
blah-slide-hackRGB-cmyk-inv.pdf \
\
&& eog p01.tif 2>/dev/null
This should only work on RGB values where R=G=B (and hopefully grayscale values), and only on text colors, and it also flattens text information - but it should be possible to confirm via tiffsep that the text indeed ends up only on the K plate.
As mentioned in the newsgroup post, this is not extensively tested, but looks promising so far...
Cheers!
As an improvement to sdaau's great answer, I can recommend using pdftops from xpdf for converting pdf to ps, instead of ghostscript ps2write, because the latter e.g. causes the font to become staircasey, and is said to not preserve the original pdf accurately. Compare by zooming into text areas of the resulting pdfs.
I suggest you convert the PDF using GS twice. Once to a Shades Of Gray colorspace, and then to CMYK.
I'm not sure it'll work, but I'd be a bit surprised if it didn't. G->CMYK sounds like a brain-dead X -> 0 0 0 X conversion. At least if you stick to "device gray" and "device CMYK" instead of some calibrated color space that'll tweak things this way and that.