How to batch convert CMYK PDFs to RGB? - pdf

I have a large batch of PDFs (6,000 +) that need to be converted from a CMYK color profile to RGB. Are there any scripts that can accomplish this task, and ideally without a (too) visible change in color? The PDFs are book files that were originally designed for print and are being prepared to be loaded as e-books.
I've found a few InDesign scripts that might be able to do this, but at this point obtaining and re-exporting from the original design files will be extraordinarily time consuming. Another option seems to be running actions via Adobe Acrobat, but I haven't had any success with that yet.
I've also found this bit of Java, if anyone can vouch for it:
http://www.aspose.com/docs/display/pdfjava/Changing+Color+space+of+a+PDF+document
Any suggestions or insights?

You can use Ghostscript for this job. Make sure to use a very recent release though.
Here is a command to try:
gs \
-o rgb.pdf \
-sDEVICE=pdfwrite \
-sProcessColorModel=DeviceRGB \
-sColorConversionStrategy=RGB \
-sColorConversionStrategyForImages=RGB \
cmyk.pdf
Note, that your goal to achieve the conversion 'ideally without a (too) visible change in color' is not always possible. It very much depends on wether the input PDF did use an embedded color profile, and which.
It also depends on the color profile you apply. Above command will use a default RGB profile compiled into Ghostscript. To use a custom profile, you can add various command line parameters. To use one profile for all types of PDF content, use:
-sDefaultRGBProfile=rgb-profile-filename
This defines source colors which are not already colorimetrically defined in the source docu- ment.
If you want to override the profiles which are already embedded in the PDF document, add this:
-dOverrideICC=true
On top of these options, you can also control the ICC profile for the output device, by adding:
-sOutputICCProfile=output-profile-filename
When using an output profile, you frequently also want to set the rendering intent. For this purpose use:
-dRenderIntent=intent
where intent is one of
0 : for Perceptual
1 : for Colorimetric
2 : for Saturation
3 : for Absolute Colorimetric intent.
Ghostscript even supports to use different profiles for different types of PDF content: graphics, text and images. See here:
-sGraphicICCProfile=graphicprofile-filename
-sTextICCProfile=textprofile-filename
-sImageICCProfile=imageprofile-filename
Similar to the above explained generic option -dRenderIntent, you can specify different intents for different content types:
-dGraphicIntent=intent
-dTextIntent=intent
-dImageIntent=intent

I'd use Adobe Acrobat Pro. Go into Tools, Preflight (might be in a different spot depending on what version you have).
In the PDF fixups section look for "Convert to sRGB". You can run this command manually a single PDF to see if it works for you. If it does, go to the Options menu and select "Create Preflight Droplet"
You'll get some options about what to do on success and failure but when you click the "Save" button you'll get an actual EXE file for Windows, Mac you should get an Application file. This file you can drag file and I think folders directly only to and it will run that action, just like Photoshop.

Preflight generated using Adobe Acrobat Pro can be used in a batch process. In my case I have to convert spot colors to CMYK without affecting other colours so i choose convert to CMYK only (SWOP) in the PDF fixups section
A .exe file is generated after saving the preflight. That can be tested using the command below in command prompt this can be tested.
"%location_of_file%\Convert to CMYK only (SWOP).exe" "" "%file_name%"
I also prepared a script so as to automate the process i can give the small prototype of it.
d:
cd %dir% ::directory on which the batch process is to be run.
:cycle
set count_files=0
for %%x in (*.pdf) do set /a count_files+=1 ::PDF in my case so *.pdf
if %count_files%==0 ( GOTO :MISSING ) else (for /F %%a in ('dir /a-d /b /o-d *.pdf') do set oldest=%%a)
"%location_of_executable%\Convert to CMYK only (SWOP).exe" "" "%oldest%"
move %oldest% %output_folder_with_location%\%oldest%
timeout 3 ::delay so that conversion process get completed
:MISSING
goto :cycle
This batch script goes on looping itself weather it gets failed or pass and in case there are PDF to be processed this BATCH script starts converting the oldest file first.

Related

inkscape: multiple page pdf to multiple png

when I convert pdf to image in linux command line, it seems inkscape gets the best result (better quality than gs with same dpi). Unfortunately, it only converts the first page to png. How to convert every pdf page to different png file? Do I have to extract one PDF page and store to a new pdf file , then do inkscape concert, and so on?
This isn't solely using Inkscape, but you could use e.g. pdftk to split up the pdf-file into separate pages and convert every page into a png with Inkscape. For example, like this:
pdftk file.pdf burst;
l=$(ls pg_*.pdf)
for i in $l; do inkscape "$i" -z --export-dpi=300 --export-area-page --export-png="$i.png"; done
Note that pdftk burst creates pdf-files called pg_0001.pdf, etc., so if you have any files named like that, they'll be overwritten. You can remove them afterwards easily using
rm pg_*.pdf
Lu Kas' answer threw warnings for me without doing the conversion. Probably because I'm running Inkscape 1.1
However, i got it running by replacing some deprecated commands:
inkscape pdfFile.pdf --export-dpi=300 --export-area-page --export-filename=imageFile.png;
For batch processing rather than slowly looping through file by file inkscape has a shell mode for command file scripting. See https://wiki.inkscape.org/wiki/index.php/Using_the_Command_Line#Shell_mode
However like all other #file.txt scripts you need to write a custom text file. and for Windows users run against higher ranking inkscape.com not .exe
Since version 1.0 (currently 1.2) a multipage pdf of contents can be addressed for multiple outputs. for some other examples see https://inkscape.org/doc/inkscape-man.html#EXAMPLES
Commands get replaced over time so currently to export png use --export-type="xxx" to batch export a list of input files to type xxx. Thus in this case --export-type="png"
Also for pdf related inputs and support see https://wiki.inkscape.org/wiki/index.php/Using_the_Command_Line#New_options
For windows users there is a handy batchfile converter here https://gist.github.com/JohannesDeml/779b29128cdd7f216ab5000466404f11

PDF to PDF/A-2b without dUseCIEColor

My goal is to take arbitrary PDF from the users, and save it as PDF/A-2b.
The current approach is to use Ghostscript 9.21 (via ghost4j) to create the converted file. This works but not without some problems. I got it to work with two different sets of parameters to Ghostscript.
Firstly
Using the option -dUseCIEColor as shown below will work, and produces valid PDF/A-2b with a couple of different test files. This will however print pages of errors into the log saying it is not recommended to use.
These are the full arguments:
-dBATCH
-dNOPAUSE
-dPrinted=true
-sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1
-sColorConversionStrategy=/UseDeviceIndependentColor
-sProcessColorModel=DeviceCMYK
-sOutputICCProfile=/tmp/icc.icc
-sOutputFile=/tmp/result.pdf
-dPDFA=2
-dUseCIEColor
/tmp/PDFA_def.ps
/tmp/test.pdf
And the PDFA_def.ps is the default supplier 9.21, pointing to the same ICC Profile and this line in the bottom:
<</NeverEmbed []>> setdistillerparams
The ICC Profile is a random (CMYK) profile published by Adobe.
This works, apart from the errors in the log.
Secondly
Then I will try to do as the log errors told, and remove the -dUseCIEColor.
Now some of the test files work, some wont. I suspect this has to do with the color profile of the original PDF, or something like that.
3-heights gives the validation error: A device-specific color space (DeviceRGB) without an appropriate output intent is used.
This can be corrected by switching the -sProcessColorModel=DeviceRGB and switch the ICC Profile to an random RGB profile.
Then for another document, you get the error: A device-specific color space (DeviceCMYK) without an appropriate output intent is used.
Is there something I'm missing with this? It would seem I need to switch the options based on the original PDF file which would be far from the preferred style. If it helps, I would be fine with black and white PDF/A-2b too. Thanks!
Its impossible to say what the problems are without seeing the files. UseCIEColor is an awful PostScript hack to try and do colour management, it isn't reliable (in terms of colour) and will effectively defeat any real colour management. Clearly you aren't performing colour management since you are using a random profile, but all the same....
Since you don't really care about colour management, I would suggest that instead of UseDeviceIndependentColor, you select CMYK (as that's the ProcessColorModel you are using). Note that if you select ColorConversionStrategy=/CMYK then you don't need to set ProcessColorModel, that's assumed from the conversion.
Other than that, I'd have to suggest that you open a bug report. If people don't report problems then they won't get fixed....
The correct PDF/A-compatible replacement for UseCIEColor seems to be in combination of these 2 options:
-sProcessColorModel=DeviceCMYK
-sColorConversionStrategy=UseDeviceIndependentColor
Both DeviceCMYK and DeviceRGB work for me.

Ghostscript: How to auto-crop STDIN to "bounding box" and write to PDF?

Here have been already quite a few questions and answers about cropping documents with Ghostscript.
However, the answers are not matching my exact needs and are still confusing to me.
I expected that there would be a single option e.g. "-AutoCropToBBox" or something like this.
For clarification, as a bounding box, I understand the smallest rectangular box which contains all (non-white(?)) printed objects completely.
Furthermore, I want/have to use a printer port redirection (RedMon) to generate a cropped PDF via printing to a Postscript-printer from basically any application.
So, under Win7/64bit, I set the redirected port properties:
Redirected port properties Win7/64bit
The output is redirected to
C:\Windows\system32\cmd.exe
The arguments for the program are:
/c gswin64c.exe -sDEVICE=pdfwrite -o -sOutputFile="%1".pdf -
"%1" contains the user input for filename. With this, I get a full-page PDF. Fine!
But how to add the cropping options?
Additional question:
If I have a multipage document will such an (auto-)cropping be individual for each page? Or would there be an option to keep it all the same e.g. like the first page or like the largest bounding box of all pages?
Another related issue:
the window for prompting for the filename is always popping up behind the application I am printing from. Any ideas to always bring it to the front?
Another question:
There is the Perl-script "ps2eps" and program bbox.exe (see http://ctan.org/pkg/ps2eps). It's said there that Ghostscript (or ps2epsi) is occationally(?) calculating wrong bounding boxes. Is this (still) true?
Thanks for your help.
Well your first problem is that PostScript programs are normally written to expect to be rendered to a specific media size, and are usually not tightly bounded to it. White space is important for readability.
So ordinarily the PostScript program you generate will request a specific media size, and the interpreter will do its best to match that. If it can't match it then it will use a strategy to try and get as close as possible, and scale the entire content to fit that media.
You can't expect the printer to perform any of those things if it doesn't know the required size until its finished, and you can't be certain of the bounding box until you have rendered all the marking content. It is true that some files generally EPS files have a %%BoundingBox comment but.. that's a comment, it has no effect in PostScript, its there for the benefit of applications which don't want to interpret the PostScript.
So that's why the simple switch you want isn't there, it would break the interpreter's normal functioning, for rendering.
So, the first thing you need to do is determine the bounding box of the content. You can do that, as Stefan says, by using the bbox device. And on that note, as far as I know the bbox device produces accurate output. If it does not then we would appreciate a bug report proving it so we can fix it. If people don't report bugs how are we supposed to know about them ? Its disappointing to see someone spreading FUD instead of helping out with a bug report.......
ps2epsi isn't Ghostscript, its a crappy cheap and cheerful script, I wouldn't use it. However..... If the original PostScript leaves stuff on the stack then it will end up as a corrupted (or invalid) EPS file and the original PostScript should be fixed before trying to use it as it will break any PostScript program that tries to use it (eg if you include the EPS in a docuemnt and then print it).
So if you are using Ghostscript, and you want to take a PostScript program and get an EPS out of it, use the eps2write device. It won't have a preview bu frankly who cares.
Now if I remember correctly the bbox device (and eps2write) record all marking operations, you can't simply record all the non-white marking operations; what if the white overwrites an existing mark on the page ? What if the media is not white ? Note that if you render to a PNG with Ghostscript, the untouched portion of the output is transparent, whereas white marks are not.
So the bbox is the extent of all the marking operations, regardless of the colour. The only other way to proceed would be to render the content and count the non-white pixels. But that only works at a specific resolution, change the resolution and the precise bounding box may change as well.
Once you have the Bounding Box you can tell Ghostscript to use media that size. Note that you will almost certainly also have to translate the origin, as its unlikely that the content will start tightly at the bottom left corner. You will need -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS to set the media size, and you will need to use -c and -f to send PostScript to alter the origin appropriately. In simple cases an '-x -y translate' will suffice but if the program executes initgraphics you will instead have to set a BeginPage procedure to alter the initial CTM.
If you set the media size with -dDEVICEWIDTHPOINTS etc then all pages will be the same size. If you don't want that then you need to write a BeginPage procedure to resize each page individually (you will also need to hook setpagedevice and remove the /PageSize entries from the dictionary.
I've no idea why Windows is putting the dialog box behind the active Window, it seems to have started doing that with Windows 7 (or possibly Vista). I don't see any way to alter that because I'm not sure what is generating the dialog.....
Personally I would suggest that you try the 2-step approach of running the original through Ghostscript's eps2write device and then take the EPS and create a PDF file using the pdfwrite device and the -dEPSCrop switch. Double converting is bad, but other solutions are worse. Note that EPS files cannot be multi-page, so you will have to create 'n' EPS files from an n-page PostScript program, and then supply a command line listing each EPS file as input to the pdfwrite device.
Take an example file and try this out from the command line before you try scripting it.
As I understood from #KenS explanations:
the way eps2write works, it may not or will not or actually cannot result in the minimum possible bounding box
it needs to be a 2-step process via -sDEVICE=bbox
So, I now ended up with the following process to "print" a PDF with a correct minimum possible bounding box:
Redirected Printer Port to cmd.exe
C:\Windows\system32\cmd.exe
Arguments for the program:
/c gswin64c.exe -q -o "%1".ps -sDEVICE=ps2write - && gswin64c.exe -q -dBATCH -dNOPAUSE -sDEVICE=bbox -dLastPage=1 "%1".ps 2>&1 >nul | perl.exe C:\myFiles\CropPS2PDF.pl "%1"
Unfortunately, it requires a little Perl script (let's call it: CropPS2PDF.pl):
#!usr/bin/perl -w
use strict;
my $FileName = $ARGV[0];
$/ = undef;
my $Crop = <STDIN>;
$Crop =~ /%%BoundingBox: (\d+) (\d+) (\d+) (\d+)/s; # get the bbox coordinates
my ($llx, $lly, $urx, $ury) = ($1, $2, $3, $4);
print "\n$FileName: $llx, $lly, $urx, $ury \n"; # print just to check
my $Command = qq{gswin64c.exe -q -o $FileName.pdf -sDEVICE=pdfwrite -c "[/CropBox [$llx $lly $urx $ury]" -c " /PAGE pdfmark" -f $FileName.ps};
print $Command; # print just to check
system($Command); # execute command
It seems to work... :-)
Improvements are welcome.
My questions are still:
Can this be done somehow without Perl? Just Win7, cmd.exe and Ghostscript?
Is there maybe a way without writing the PS-File to disk which I do not need? Of course, I could also delete it afterwards with the Perl-script.

Replace all font glyphs in a PDF by converting them to outline shapes

I am looking for a way to 'outline' all text/fonts in a PDF file, i.e. convert them to curves.
I would prefer to do this without having to convert the PDF to PostScript and back. Also, I would like to use free lightweight cross-platform tools that can be automated from the command line, such as Ghostscript or MuPDF.
Yes, you can use Ghostscript to achieve what you want.
I. For Ghostscript versions up to 9.14
You need to go through 2 steps:
Convert the PDF to a PostScript file, but use the side effect of a relatively unknown parameter: it is called -dNOCACHE. This will convert all used fonts to outline shapes:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Convert the PS back to PDF (and, maybe delete the intermediate PS again):
gs -o somepdf-with-outlines.pdf -sDEVICE=pdfwrite somepdf.ps
rm somepdf.ps
This method is not reliable long-term, because the Ghostscript developers have stated that -dNOCACHE may not be present in future versions.
Note: the resulting PDF will very likely be larger than the original one. Plus, without additional command line parameters, all images in the original PDF will likely also be processed according to Ghostscript builtin defaults. This can lead to unwanted side-effects. Those side-effects can be avoided by adding more command line parameters to do otherwise.
II. Ghostscript versions 9.15 or newer
Ghostscript version 9.15 (released in September 2014) supports a new command line parameter:
-dNoOutputFonts
This will cause the output devices pdfwrite, ps2write and eps2write "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
This means: the two steps described for pre-9.15 GS versions can be avoided. The desired result can be achieved with a single command:
gs -o file-with-outlines.pdf -dNoOutputFonts -sDEVICE=pdfwrite file.pdf
Note: the same caveat is true as already noted in part I. If your PDF includes images, there may be unwanted side effects introduced by the simple command line above. To avoid these, you need to add more specific parameters.
This commit adds a new switch -dNoOutputFonts to the Ghostscript pdfwrite and ps2write devices which will produce a PDF file (or PostScript, depending on the selected device) where all the glyphs have been created as vectors, not as text.
You will need at least version 9.15 of Ghostscript to get this feature. Be aware that the PDF file will almost certainly be larger and copy/paste/search will (obviously) not work.
III. Ghostscript versions 9.54.0 (Windows 10)
I found a method that preserves all fonts flawlessly as vectors without any visual errors and with just two printing steps, after Ghostscript is first installed and configured correctly.
(Note! You must Add the Ghostscript bin-/ and lib-folder to your windows PATH in order to get Ghostscript to do anything)
Instructions here
Print your PDF-file that contains vector based fonts or other vector elements with Acrobat Reader and using Microsoft PS Class Driver to a YourFile.prn file. (To install this driver -- Control Panel - Devices - Printers & Scanners - Add a Printer or scanner -- and let first Windows to look for a while for a connected printer, and when it stops select an option -- The printer that I want is not listed - Add a local printer or network printer with manual settings - Next - Use an existing port: > File:(Print to File) - Next - Microsoft: Microsoft PS Class Driver - Next)
Open Command prompt, navigate to the folder where YourFile.prn file is located and type: "C:\Program Files\gs\gs9.54.0\bin\gswin64c.exe" -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=eps2write -sOutputFile=YourFile.eps YourFile.prn
If you have a constant need to do this you can also create prn2eps.bat file containing the following:
"C:\Program Files\gs\gs9.54.0\bin\gswin64c.exe" -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=eps2write -sOutputFile=%1.eps %1.prn
To use that bat file you just need to type: prn2eps YourFile.
(Note! you must have the bat file and Yourfile.prn in the same directory)
For some reason newest Ghostscript ps2epsi function didn't work in Windows 10, and Adobe made PDF:s had e.g. minor but consistent errors in some font characters when I imported them in non-Adobe design software as PDF:s. I have found out during the years that EPS-file format is one of the most reliable formats when vectors must be preserved from one software to another. Many times printing PDF again to PDF using just another printer driver may be enough or single file format change using Ghostscript, but not always.

How to merge many PDF files into a single one? [duplicate]

This question already has answers here:
Merge / convert multiple PDF files into one PDF [closed]
(23 answers)
Closed 6 years ago.
I have 16 pdfs that I want to convert into a single one... I am on Ubuntu 10.10, how can I do it?
First, get Pdftk:
sudo apt-get install pdftk
Now, as shown on example page, use
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
for merging pdf files into one.
You can also use Ghostscript to merge different PDFs. You can even use it to merge a mix of PDFs, PostScript (PS) and EPS into one single output PDF file:
gs \
-o merged.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
input_1.pdf \
input_2.pdf \
input_3.eps \
input_4.ps \
input_5.pdf
However, I agree with other answers: for your use case of merging PDF file types only, pdftk may be the best (and certainly fastest) option.
Update:
If processing time is not the main concern, but if the main concern is file size (or a fine-grained control over certain features of the output file), then the Ghostscript way certainly offers more power to you. To highlight a few of the differences:
Ghostscript can 'consolidate' the fonts of the input files which leads to a smaller file size of the output. It also can re-sample images, or scale all pages to a different size, or achieve a controlled color conversion from RGB to CMYK (or vice versa) should you need this (but that will require more CLI options than outlined in above command).
pdftk will just concatenate each file, and will not convert any colors. If each of your 16 input PDFs contains 5 subsetted fonts, the resulting output will contain 80 subsetted fonts. The resulting PDF's size is (nearly exactly) the sum of the input file bytes.
You can use http://www.mergepdf.net/ for example
Or:
PDFTK http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
If you are NOT on Ubuntu and you have the same problem (and you wanted to start a new topic on SO and SO suggested to have a look at this question) you can also do it like this:
Things You'll Need:
* Full Version of Adobe Acrobat
Open all the .pdf files you wish to merge. These can be minimized on your desktop as individual tabs.
Pull up what you wish to be the first page of your merged document.
Click the 'Combine Files' icon on the top left portion of the screen.
The 'Combine Files' window that pops up is divided into three sections. The first section is titled, 'Choose the files you wish to combine'. Select the 'Add Open Files' option.
Select the other open .pdf documents on your desktop when prompted.
Rearrange the documents as you wish in the second window, titled, 'Arrange the files in the order you want them to appear in the new PDF'
The final window, titled, 'Choose a file size and conversion setting' allows you to control the size of your merged PDF document. Consider the purpose of your new document. If its to be sent as an e-mail attachment, use a low size setting. If the PDF contains images or is to be used for presentation, choose a high setting. When finished, select 'Next'.
A final choice: choose between either a single PDF document, or a PDF package, which comes with the option of creating a specialized cover sheet. When finished, hit 'Create', and save to your preferred location.
Tips & Warnings
Double check the PDF documents prior to merging to make sure all pertinent information is included. Its much easier to re-create a single PDF page than a multi-page document.
There are lots of free tools that can do this.
I use PDFTK (a open source cross-platform command-line tool) for things like that.
Also seem pdfjam: http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic/firth/software/pdfjam/