Converting (any) PDF to black (K)-only CMYK - pdf

This is related to:
Converting PDF to CMYK (with identify recognizing CMYK).
Script (or some other means) to convert RGB to CMYK in PDF?
... but a bit more specific here: say I have an RGB PDF, where the text color is "rich black" (R:0 G:0 B:0 gone to C:100 M:100 Y:100 K:100), and diverse images and vector graphics.
I would like to convert this to a CMYK PDF, using a free command line tool (so it is batch scriptable under Linux), which
has contents only in the black (K) channel:
Preserves vector graphics (+ text glyphs) - colors become grayscale in black (K) channel only
Images get converted to grayscale in black (K) channel only
Thanks in advance for any answers,
Cheers!

As hinted in my comment to #Mark Storer, it turns out that forcing a gray print only on the K plate in CMYK, may not be so trivial ... I guess it depends much on what is being used as "preflight" preview device - for Linux, the only thing I can find is ghostscript with tiffsep, which is what I use for 'sanity check' regarding CMYK separations.
Anyways, I got a lot of help in this thread on comp.lang.postscript:
PDF to PDF (gs?): rich RGB black to plain K (CMYK) black? - comp.lang.postscript | Google Groups
... and one workflow that works for me is:
Convert PDF to PS using ghostscript's ps2write
Use ghostscript to convert this PS back to PDF, while executing replacement functions in HackRGB-cmyk-inv.ps
Use ghostscript's tiffsep to check actual separations
In respect to, say, this PDF generated by OpenOffice: blah-slide.pdf, the command lines would be:
# PDF to PS using `ps2write` device of `ghostscript`
gs \
-dNOPAUSE \
-dBATCH \
-sDEVICE=ps2write \
-sOutputFile=./blah-slide-gsps2w.ps \
./blah-slide.pdf
# PS to PDF using replacement function in HackRGB-cmyk-inv.ps
gs \
-dNOPAUSE \
-dBATCH \
-sDEVICE=pdfwrite \
-sOutputFile=./blah-slide-hackRGB-cmyk-inv.pdf \
./HackRGB-cmyk-inv.ps \
./blah-slide-gsps2w.ps
# check separations
gs \
-dNOPAUSE \
-dBATCH \
-dSAFER \
-sDEVICE=tiffsep \
-dFirstPage=1 \
-dLastPage=1 \
-sOutputFile=p%02d.tif \
blah-slide-hackRGB-cmyk-inv.pdf \
\
&& eog p01.tif 2>/dev/null
This should only work on RGB values where R=G=B (and hopefully grayscale values), and only on text colors, and it also flattens text information - but it should be possible to confirm via tiffsep that the text indeed ends up only on the K plate.
As mentioned in the newsgroup post, this is not extensively tested, but looks promising so far...
Cheers!

As an improvement to sdaau's great answer, I can recommend using pdftops from xpdf for converting pdf to ps, instead of ghostscript ps2write, because the latter e.g. causes the font to become staircasey, and is said to not preserve the original pdf accurately. Compare by zooming into text areas of the resulting pdfs.

I suggest you convert the PDF using GS twice. Once to a Shades Of Gray colorspace, and then to CMYK.
I'm not sure it'll work, but I'd be a bit surprised if it didn't. G->CMYK sounds like a brain-dead X -> 0 0 0 X conversion. At least if you stick to "device gray" and "device CMYK" instead of some calibrated color space that'll tweak things this way and that.

Related

Batch PDF Watermark [PDF -> JPG -> PDF]

I am working with over thousands PDF files for a Sheet Music publisher.
All of these PDF files needs a preview PDF. A watermark for PDF files can easily be removed so I am asking for a true way to watermark our PDF:s in a batch operation.
PDF->Apply Watermark->JPG->Back to PDF
How can I do this? Is there a good tool for this operations?
The free route
ImageMagick can do the complete process for you, especially with the composite command's -watermark operator.
#!/bin/sh
# ImageMagick will pick the correct conversion formats based on filename suffixes, or maybe actual binary content?
InputPDF=$1
WatermarkImg=$2
OutputPDF=$3
pdfToImage=pdfToImage.png
imageWithWatermark=imageWithWatermark.png
# Convert PDF to image
convert \
-density 300 \
-trim \
"$InputPDF" \
-quality 100 \
-flatten \
-sharpen 0x1.0 \
$pdfToImage
# Add watermark to intermediate image
composite \
-dissolve 15 \
-tile \
"$WatermarkImg" \
$pdfToImage \
$imageWithWatermark
# Convert intermediate image back to PDF
convert \
$imageWithWatermark \
"$OutputPDF"
# Clean up
rm $pdfToImage $imageWithWatermark
I find the PDF to image conversion acceptable in terms of quality, though you can see some differences when looking at the before and after side-by-side, especially in how bolded glyphs seem less bold:
You can check this good post and its answers for a number of options for converting a PDF to an image, Convert PDF to image with high resolution.
I checked out PDFtoPPM, which was also highly mentioned in that thread, and I still see some degrading of the bolded fonts when converted:
Some more tiling Magick
I used this copyright symbol from Wikimedia Commons and this ImageMagick script:
#!/bin/sh
Infile="Copyright.png"
Outfile="Copyright_tiled.png"
h2=$(convert $Infile -format "%[fx:round(h/2)]" info:)
convert $Infile \
\( -clone 0 -roll +0+"$h2" \) \
+append \
-write mpr:sometile \
+delete \
-size 1224X1584 \
tile:mpr:sometile \
$Outfile
to create this staggered tiling (1224X1584 is the page size (8.5in x 11in) multiplied by 72 px/in, times 2 for a good density of tiles):
And here it is unwatermarked again
#ZachYoung I used some different image magic, also scriptable, the point is:-
Although "What's done cannot be undone" Macbeth (Act 5.1. 63-4) is very true especially within a PDF or image. We also know and expect that it too applies to any PDF (de)constructs. Thus depending on value of a forgery it will always be worth engineering a partially reversed copy, fit for scrutiny or use, but will like the watermarked copy, still not be the original, however all the same, may look almost just as good.
The Idiom implies don't bother yourself about it. Its best not done in the first place.
The nearest to best, is use a watermark exactly the same as the text outlines, like this:-

How can I convert an RGB PDF to CMYK with flat black?

I used instructions from here to convert an RGB PDF to CMYK using Ghostscript, and it's mostly OK except all the blacks are "rich" - they use not just K but also CMY inks.
Is there a way to convert such that all blacks are "flat" and just use K?
This is the code I used:
gs \
-o test-cmyk.pdf \
-sDEVICE=pdfwrite \
-sProcessColorModel=DeviceCMYK \
-sColorConversionStrategy=CMYK \
-sColorConversionStrategyForImages=CMYK \
test.pdf
Assuming your PDF file actually does use RGB (and not an ICC profile embedded in the PDF) then, in order to get R=G=B->C==M=Y=0, K=R, you need to set up a custom ICC profile link.
You'll need to tell Ghostscript to use a custom RGB profile in place of default_rgb.icc and a cuustom CMYK profile in place of default_cmyk.icc. You'll need to ensure that the mapping RGB->XYZ->CMYK results in pure K when R=G=B.
There's documentation in the Ghostscript 'doc' folder on the various colour management settings, but most of these will only be effective when rendering, not when outputting a PDF file. About the only things you can usefully alter when outputting PDF is the input and output profiles.

ghostscript cmyk export yields wrong black

I'm trying to convert an RGB-pdf file created by Inkscape to a print ready cmyk-pdf using PSOcoated_v3.icc color profile. PDF generation works fine. However, I'd like to check for correct final colors, especially for the black. As I did not found any (free) tool to pick the cmyk color from the final pdf I thought as a first check I convert the RGB-pdf to cmyk-tiff and check the black value. Doing so using
gs -q -dBATCH -dSAFER -dNOPAUSE \
-sDEVICE=tiff32nc \
-sDefaultRGBProfile=sRGB2014.icc \
-dOverrideICC \
-sOutputICCProfile=PSOcoated_v3.icc \
-sProcessColorModel=DeviceCMYK \
-sColorConversionStrategy=CMYK \
-sOutputFile=rgb.pdf \
cmyk.tiff
yields a cmyk black value of [0.83, 0.67, 0.51, 0.95]. Contrarily, when I use libcms2 to convert rgb (0,0,0) to cmyk I get [0.92, 0.64, 0.45, 0.96] which matches (almost) some information about the PSOcoated_v3.icc profile I found here. To confirmed that the source RGB-file black reads (0,0,0) I convert the RGB-pdf to an RGB-tiff and do find the black to be (0,0,0).
Am I missing something in the command of might this be a gs bug?
If I take an RGB color of [0,0,0] in sRGB color space and convert it to the CMYK value defined by the PSO coated v3 ICC profile in Photoshop (using Adobe ACE CMM in Photoshop), I get exactly the CMYK values you are seeing with gs, that is [0.83, 0.67, 0.51, 0.95].
That was using a relative colorimetric rendering intent with black point compensation enabled. Those are the settings that gs would be using for lcms by default.
I suspect that when you use libcms2, it is using a different rendering intent. For example, when I use the perceptual rendering intent with Adobe ACE I get [0.90, 0.64, 0.45, 0.96].
Note that you can specify with gs what rendering intent you want to use with
-dRenderIntent=0/1/2/3 . See https://ghostscript.com/doc/current/Use.htm#ICC_color_parameters for details.

Convert PDF to CMYK but ignore black?

I am converting an RGB PDF to CMYK using the following command:
/usr/local/bin/gs -dSAFER -dBATCH -dNOPAUSE -dNOCACHE -sDEVICE=pdfwrite \
-sColorConversionStrategy=CMYK -sColorConversionStrategyForImages=CMYK \
-dProcessColorModel=/DeviceCMYK -dEncodeColorImages=false \
-dEncodeGrayImages=false -dEncodeMonoImages=false -sOutputFile=CMYK.PDF RGB.PDF
The resulting file is 100% CMYK, however anything that was 100% black in the RGB PDF is now:
C: 72%
M: 68%
Y: 67%
K: 89%
The result is that black is a slightly dark grey when really it should be black.
Is there anything I need to add to this command to ensure black remains as black during the conversion?
With the current version of pdfwrite you are restricted to the colour conversions in the software, and there is nothign you can do about this.
The colour management is undergoing serious revision, and the next release should make it possible to deal with this.
Examples of Ghostscript commands to convert PDF from sRGB or eciRGB_v2 to eciCMYK_v2 (FOGRA 59) while keeping black plain (K only), not rich (CMYK)
I managed to convert PDF file in RGB to CMYK while keeping RGB black #000000 plain K black, not rich CMYK black.
DeviceLink ICC had to be created from e.g. eciRGB_v2 (or sRGB, if your source is sRGB) profile to the appropriate CMYK profile using collink tool (from argyll package) with the "-f" attribute to hack the black colors.
Ghostscript is then called with a control file declaring use of the profile and its parameters.
Example of making the DeviceLink RGB to CMYK profile
collink -v -f eciRGB_v2.icc eciCMYK_v2.icc eciRGB_v2_to_eciCMYK_v2.icc
Example of a control file to map eciRGB_v2 to eciCMYK_v2
Image_RGB eciRGB_v2_to_eciCMYK_v2.icc 0 1 0
Graphic_RGB eciRGB_v2_to_eciCMYK_v2.icc 0 1 0
Text_RGB eciRGB_v2_to_eciCMYK_v2.icc 0 1 0
Sample ghostscript command to do the actual conversion
gs -o 2-output-cmyk-from-eciRGB.pdf \
-sDEVICE=pdfwrite \
-sColorConversionStrategy=CMYK \
-sSourceObjectICC=control-eciRGB_v2.txt \
1-input-rgb.pdf

Replacing vector images in a PDF with raster images

Is there any easy (scriptable) way to convert a PDF with vector images into a PDF with raster images? In other words, I want to generate a PDF with the exact same (un-rasterized) text but with each vector image replaced with a rasterized version.
I occasionally read PDFs of technical articles on my Kindle, and have found that reading a PDF directly is frustrating. Thankfully, Amazon's automatic conversion of PDFs to the Kindle format does a good job of reflowing the text portions of most of PDFs I have tried. However, while raster images seem to make it through the conversion process fine, vector images get horribly mangled. It would be great if I could easily convert a PDF so that all of its vector images were rasterized.
I am interested in any possible solutions, but a Linux- or Windows-based one would be preferable.
I had a similar issue, and solved it using ImageMagics convert tool (http://www.imagemagick.org/script/index.php). That comes with linux and runs fine on Windows/Cygwin or OS X
convert -density 300 largeVectorFileFromR.pdf out.pdf
With -density 300 you control resolution (as DPI).
Downside: Text is rasterized as well, I understand that Michael does not want this.
After some days searching for some solution, based on "Remove all text from PDF file" and "How to add a picture onto an existing pdf file?" I found a (ugly) scriptable solution:
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $INPUT_FILE && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfwrite -dFILTERTEXT $INPUT_FILE && \
convert -density $DPI -quality 100 /tmp/graphics.pdf /tmp/graphics.png && \
convert -density $DPI -quality 100 /tmp/graphics.png /tmp/graphics.pdf && \
pdftk /tmp/graphics.pdf stamp /tmp/onlytxt.pdf output $OUTPUT_FILE && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf /tmp/graphics.png
were we have three variables INPUT_FILE, OUTPUT_FILE, and DPI. We split the textual and graphical contents via Ghostscript, convert the graphical image to a raster image (PNG) and join the two using pdftk.
I've been using this successfully to convert huge vector images for use in scientific papers.
Pitstop Pro v2 update 3 from Enfocus can do exactly that. It has an action called "Rasterize page content, keeping text" which works pretty well. It is a plugin to Adobe Acrobat so it requires a little more but is also available as a server solution.
It's a little complicated, but you asked for any possible solution. Furthermore this solution is not automatable.
1) Open the pdf with the vector images in Inkscape. Then select the whole image with the select tool (F1)
2) If the vector image is consistant of more than one svg graphic press Ctrl + G (Object --> Group)
3) cut the grouped svg image Ctrl + x
4) open a new InkScape Window Ctrl + n and paste the image Ctrl + v
5) choose File --> export Bitmap (Shift + Ctrl + e), maybe you want to increase the dpi
6) go back to the first InkScape window, File --> import (Ctrl + i) and choose the previously exported bitmap
7) place the bitmap to the location where the svg image was
Save the pdf and the vector image is replaced by a bitmap image.
Here's one way to solve your problem:
Step 1: Use an online PDF-to-HTML converter, like the one here:
http://www.idrsolutions.com/online-pdf-to-html5-converter/
This tool converts the PDF into a set of images and a text overlay. The vector images should be converted to raster at this point.
Step 2: Convert the HTML+images back into PDF:
http://pdfcrowd.com/#convert_by_upload+with_options
The resulting PDF will have all the vector images rasterized, and all text will remain text, so you can select, copy, etc.
Convert the pdf to djvu with https://jwilk.net/software/pdf2djvu converter. Uncheck "antialias fonts,vectors..". It will reduce file size significantly and improve document load times.
I used the following:
gswin32c -o "%2" -dFirstPage=1 -dLastPage=1 -sDEVICE=pngalpha -r72x72 -dUseCropBox -dFitPage "%1" -dBATCH -dNOPAUSE
where %1 is the input file and %2 is the output. This can be used with LaTeX, the generated PNG has the same ratio and page size as the original PDF so the relative position of the image will not change.
Note that in Linux, you may need to use gs rather than gswin32c.
You can also set the page range and then print the pages back to PDF. The downside is that the text gets rasterized as well.
inkscape is the best solution, I quickly made this rather unoptimized batch file that does exactly that and you can play with it and change options. ImageMacick convert, gs, or pdftoimages don't work as good as inkscape they either don't export the layers or export but with bad quality :
#!/bin/bash
#set -xev
ORIGINAL_FOLDER=`pwd`
JPEGS=`mktemp -d`
unzip "$1" -d "$JPEGS"
cd "$JPEGS"
# expang the pdf in pdf pages
pdftk combined_to_do.pdf burst output pg_%04d.pdf
#1) print the pdf's to pngs as they are seen with alpha, layers, transparency etc, this cannot be done by ImageMacick convert or pdftoimages
ls ./pg*.pdf | xargs -L1 -I {} inkscape {} -z --export-dpi=300 --export-area-drawing --export-png={}.png
#2) Second change to jpgs
rm *.pdf
ls ./p*.png | xargs -L1 -I {} convert {} -quality 100 -density 300 {}.jpg
#3) This to make a pdf file out of every jpg image without loss of either resolution or quality:
ls -1 ./*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf
#4) This to concatenate the pdfpages into one:
pdftk *.jpg.pdf cat output combined.pdf
#5) And last I add an OCRed text layer that doesn't change the quality of the scan in the pdfs so they can be searchable:
pypdfocr combined.pdf
cp "$JPEGS/combined_ocr.pdf" "$ORIGINAL_FOLDER/$1_ocr.pdf"
cp "$JPEGS/combined.pdf" "$ORIGINAL_FOLDER/$1.pdf"
Based on Civ Lins solution, I came up with this:
#!/usr/bin/env sh
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $1 && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfimage24 -dFILTERTEXT -r600 -dDownScaleFactor=6 $1 && \
pdftk /tmp/graphics.pdf multistamp /tmp/onlytxt.pdf output $2 && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf
(In contrast to the previous solution, it handles multipage PDFs and uses gs to directly render the rasterized image without the detour of convert.)