PDF to PDF/A-2b without dUseCIEColor - pdf

My goal is to take arbitrary PDF from the users, and save it as PDF/A-2b.
The current approach is to use Ghostscript 9.21 (via ghost4j) to create the converted file. This works but not without some problems. I got it to work with two different sets of parameters to Ghostscript.
Firstly
Using the option -dUseCIEColor as shown below will work, and produces valid PDF/A-2b with a couple of different test files. This will however print pages of errors into the log saying it is not recommended to use.
These are the full arguments:
-dBATCH
-dNOPAUSE
-dPrinted=true
-sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1
-sColorConversionStrategy=/UseDeviceIndependentColor
-sProcessColorModel=DeviceCMYK
-sOutputICCProfile=/tmp/icc.icc
-sOutputFile=/tmp/result.pdf
-dPDFA=2
-dUseCIEColor
/tmp/PDFA_def.ps
/tmp/test.pdf
And the PDFA_def.ps is the default supplier 9.21, pointing to the same ICC Profile and this line in the bottom:
<</NeverEmbed []>> setdistillerparams
The ICC Profile is a random (CMYK) profile published by Adobe.
This works, apart from the errors in the log.
Secondly
Then I will try to do as the log errors told, and remove the -dUseCIEColor.
Now some of the test files work, some wont. I suspect this has to do with the color profile of the original PDF, or something like that.
3-heights gives the validation error: A device-specific color space (DeviceRGB) without an appropriate output intent is used.
This can be corrected by switching the -sProcessColorModel=DeviceRGB and switch the ICC Profile to an random RGB profile.
Then for another document, you get the error: A device-specific color space (DeviceCMYK) without an appropriate output intent is used.
Is there something I'm missing with this? It would seem I need to switch the options based on the original PDF file which would be far from the preferred style. If it helps, I would be fine with black and white PDF/A-2b too. Thanks!

Its impossible to say what the problems are without seeing the files. UseCIEColor is an awful PostScript hack to try and do colour management, it isn't reliable (in terms of colour) and will effectively defeat any real colour management. Clearly you aren't performing colour management since you are using a random profile, but all the same....
Since you don't really care about colour management, I would suggest that instead of UseDeviceIndependentColor, you select CMYK (as that's the ProcessColorModel you are using). Note that if you select ColorConversionStrategy=/CMYK then you don't need to set ProcessColorModel, that's assumed from the conversion.
Other than that, I'd have to suggest that you open a bug report. If people don't report problems then they won't get fixed....

The correct PDF/A-compatible replacement for UseCIEColor seems to be in combination of these 2 options:
-sProcessColorModel=DeviceCMYK
-sColorConversionStrategy=UseDeviceIndependentColor
Both DeviceCMYK and DeviceRGB work for me.

Related

Ghostscript: How to auto-crop STDIN to "bounding box" and write to PDF?

Here have been already quite a few questions and answers about cropping documents with Ghostscript.
However, the answers are not matching my exact needs and are still confusing to me.
I expected that there would be a single option e.g. "-AutoCropToBBox" or something like this.
For clarification, as a bounding box, I understand the smallest rectangular box which contains all (non-white(?)) printed objects completely.
Furthermore, I want/have to use a printer port redirection (RedMon) to generate a cropped PDF via printing to a Postscript-printer from basically any application.
So, under Win7/64bit, I set the redirected port properties:
Redirected port properties Win7/64bit
The output is redirected to
C:\Windows\system32\cmd.exe
The arguments for the program are:
/c gswin64c.exe -sDEVICE=pdfwrite -o -sOutputFile="%1".pdf -
"%1" contains the user input for filename. With this, I get a full-page PDF. Fine!
But how to add the cropping options?
Additional question:
If I have a multipage document will such an (auto-)cropping be individual for each page? Or would there be an option to keep it all the same e.g. like the first page or like the largest bounding box of all pages?
Another related issue:
the window for prompting for the filename is always popping up behind the application I am printing from. Any ideas to always bring it to the front?
Another question:
There is the Perl-script "ps2eps" and program bbox.exe (see http://ctan.org/pkg/ps2eps). It's said there that Ghostscript (or ps2epsi) is occationally(?) calculating wrong bounding boxes. Is this (still) true?
Thanks for your help.
Well your first problem is that PostScript programs are normally written to expect to be rendered to a specific media size, and are usually not tightly bounded to it. White space is important for readability.
So ordinarily the PostScript program you generate will request a specific media size, and the interpreter will do its best to match that. If it can't match it then it will use a strategy to try and get as close as possible, and scale the entire content to fit that media.
You can't expect the printer to perform any of those things if it doesn't know the required size until its finished, and you can't be certain of the bounding box until you have rendered all the marking content. It is true that some files generally EPS files have a %%BoundingBox comment but.. that's a comment, it has no effect in PostScript, its there for the benefit of applications which don't want to interpret the PostScript.
So that's why the simple switch you want isn't there, it would break the interpreter's normal functioning, for rendering.
So, the first thing you need to do is determine the bounding box of the content. You can do that, as Stefan says, by using the bbox device. And on that note, as far as I know the bbox device produces accurate output. If it does not then we would appreciate a bug report proving it so we can fix it. If people don't report bugs how are we supposed to know about them ? Its disappointing to see someone spreading FUD instead of helping out with a bug report.......
ps2epsi isn't Ghostscript, its a crappy cheap and cheerful script, I wouldn't use it. However..... If the original PostScript leaves stuff on the stack then it will end up as a corrupted (or invalid) EPS file and the original PostScript should be fixed before trying to use it as it will break any PostScript program that tries to use it (eg if you include the EPS in a docuemnt and then print it).
So if you are using Ghostscript, and you want to take a PostScript program and get an EPS out of it, use the eps2write device. It won't have a preview bu frankly who cares.
Now if I remember correctly the bbox device (and eps2write) record all marking operations, you can't simply record all the non-white marking operations; what if the white overwrites an existing mark on the page ? What if the media is not white ? Note that if you render to a PNG with Ghostscript, the untouched portion of the output is transparent, whereas white marks are not.
So the bbox is the extent of all the marking operations, regardless of the colour. The only other way to proceed would be to render the content and count the non-white pixels. But that only works at a specific resolution, change the resolution and the precise bounding box may change as well.
Once you have the Bounding Box you can tell Ghostscript to use media that size. Note that you will almost certainly also have to translate the origin, as its unlikely that the content will start tightly at the bottom left corner. You will need -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS to set the media size, and you will need to use -c and -f to send PostScript to alter the origin appropriately. In simple cases an '-x -y translate' will suffice but if the program executes initgraphics you will instead have to set a BeginPage procedure to alter the initial CTM.
If you set the media size with -dDEVICEWIDTHPOINTS etc then all pages will be the same size. If you don't want that then you need to write a BeginPage procedure to resize each page individually (you will also need to hook setpagedevice and remove the /PageSize entries from the dictionary.
I've no idea why Windows is putting the dialog box behind the active Window, it seems to have started doing that with Windows 7 (or possibly Vista). I don't see any way to alter that because I'm not sure what is generating the dialog.....
Personally I would suggest that you try the 2-step approach of running the original through Ghostscript's eps2write device and then take the EPS and create a PDF file using the pdfwrite device and the -dEPSCrop switch. Double converting is bad, but other solutions are worse. Note that EPS files cannot be multi-page, so you will have to create 'n' EPS files from an n-page PostScript program, and then supply a command line listing each EPS file as input to the pdfwrite device.
Take an example file and try this out from the command line before you try scripting it.
As I understood from #KenS explanations:
the way eps2write works, it may not or will not or actually cannot result in the minimum possible bounding box
it needs to be a 2-step process via -sDEVICE=bbox
So, I now ended up with the following process to "print" a PDF with a correct minimum possible bounding box:
Redirected Printer Port to cmd.exe
C:\Windows\system32\cmd.exe
Arguments for the program:
/c gswin64c.exe -q -o "%1".ps -sDEVICE=ps2write - && gswin64c.exe -q -dBATCH -dNOPAUSE -sDEVICE=bbox -dLastPage=1 "%1".ps 2>&1 >nul | perl.exe C:\myFiles\CropPS2PDF.pl "%1"
Unfortunately, it requires a little Perl script (let's call it: CropPS2PDF.pl):
#!usr/bin/perl -w
use strict;
my $FileName = $ARGV[0];
$/ = undef;
my $Crop = <STDIN>;
$Crop =~ /%%BoundingBox: (\d+) (\d+) (\d+) (\d+)/s; # get the bbox coordinates
my ($llx, $lly, $urx, $ury) = ($1, $2, $3, $4);
print "\n$FileName: $llx, $lly, $urx, $ury \n"; # print just to check
my $Command = qq{gswin64c.exe -q -o $FileName.pdf -sDEVICE=pdfwrite -c "[/CropBox [$llx $lly $urx $ury]" -c " /PAGE pdfmark" -f $FileName.ps};
print $Command; # print just to check
system($Command); # execute command
It seems to work... :-)
Improvements are welcome.
My questions are still:
Can this be done somehow without Perl? Just Win7, cmd.exe and Ghostscript?
Is there maybe a way without writing the PS-File to disk which I do not need? Of course, I could also delete it afterwards with the Perl-script.

Ghostscript won't convert PDF to PDF/A. Annotation Issue

I'm trying to convert a PDF to PDF/A. At every pass I'm getting the error "GPL Ghostscript 9.19: Annotation set to non-printing, not permitted in PDF/A, reverting to normal PDF output".
The PDF has previously been generated from HTML by wkhtmltopdf. With the error being quite vague I've done some research around PDF annotations. I've confirmed the PDF has no annotations, flattening annotations (though there isn't one) hasn't worked, I tried the -dShowAnnots=false switch. All to no avail. I've also tried it with a variety of different PDFs and I'm getting the same error on them all.
The command I'm using to do the conversion is "gs -dPDFA=2 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o output.pdf /Users/work/Documents/Projects/pdf-generator-service-tests/PDFA_def.ps -dPDFACompatibilityPolicy=1 input.pdf"
I tried creating a basic PDF page from Google's homepage using wkhtmltopdf https://google.com putput.pdf and again, no joy (this is an example of the PDFs I've tried to convert, for people who may want to try and replicate the issue).
I thought the error was quite specific; PDF/A does not permit annotations to be set to non-printing. You haven't included an actual example of the kind of file causing you a problem, so I can't possibly comment on the presence of any annotations, but I assure you that its not possible to get this message without having annotations.
Since you've already set PDFACompatibility to 1 there's not much else I can say. You could open a bug report and attach the file there, or post a link to one here. Without that I can't say much.
Oh and you don't say which version of Ghostscript you are using, or where you sourced it from. Occasionally packagers break things so it might be worth trying to build from source.
One point; You execute the PDFA_def.ps file before setting PDFACompatibility=1, that's probably not going to work, you'll want to switch those two around. You should set the controls before you do any input or stuff might go awry, trying to change midstream isn't really a good idea.
I used gs (v9.21) to convert a PDF with annotations set to non-printing (hyperref) to a PDF/A compliant file. Annotations will not be present in the output file but, in my case, that was not an issue.
The command I used is:
gs -dPDFA=2 -dBATCH -dNOPAUSE -dPDFACompatibilityPolicy=1 -dUseCIEColor -sProcessColorModel=DeviceGray -sDEVICE=pdfwrite -sOutputFile=output_file.pdf input_file.pdf
Notes:
-dPDFACompatibilityPolicy=1 instead of -sPDFACompatibilityPolicy=1. The latter does not force gs to elide the annotation while the former does.
I used -dUseCIEColor because pdfa validation (https://www.pdf-online.com/osa/validate.aspx) failed with an issue related to the color space. This parameter is deprecated but I did not find any other way around this issue. For more details, see Convert PS files to PDF/A via Ghostscript, color space problems
Like KenS said, it's hard to know anything without a PDF to look at but since you're having trouble with the Google home page converted to PDF, I suspect that it's the external links that are causing the problem. Links are annotations and in PDF/A, external links are not permitted. Any link in HTML when converted to PDF will be considered external.

Ghostscript loses emdash characters and replaces with hyphens

When I run a PDF which was originally created with LibreOffice on Linux, through ghostscript 9.19 on OSX, to produce another (flattened) PDF, the output is perfect except for one problem. All emdashes in the entire document have been replaced with a standard hyphen (awkwardly followed by half of a space.) Oddly enough, if I highlight the resulting "hyphen+space", my context menu shows that I've selected an emdash, so the underlying text is still an emdash, it is just rendering the wrong glyph.
I can reproduce this on multiple documents from the same source, and I'm assuming there's a setting or switch somewhere that can help resolve this.
I don't know whether the font used makes a difference, but for the sake of reference, the body text of my document is set in Arno Pro. When I use a modern version of LibreOffice on OS X to make a sample document also containing an emdash in Arno Pro, the same problem is not exhibited, so it seems to be specific to the software which originally made these PDF files.
These PDFs are of legacy projects that I am not set-up to re-produce at this time, so I need to prepare them for reprinting using the existing files.
How do I retain emdash glyphs when running a command such as the following?
gs -dSAFER -dBATCH -dNOPAUSE -dNOCACHE -sDEVICE=pdfwrite \
-sColorConversionStrategy=/LeaveColorUnchanged \
-dAutoFilterColorImages=true -dAutoFilterGrayImages=true \
-sOutputFile=output.pdf input.pdf
I can add an example of the input PDF to this question if needed.
Without seeing the PDF file it isn't possible to give you an answer. Most likely the font isn't embedded, or if it is embedded doesn't have an emdash glyph.
Copy and paste uses the ToUnicode CMap, so it isn't dependent on the font. Its simply a list of character codes and the Unicode code point associated with each, when using a given font.
Note that this doesn't mean 'the underlying text is still an emdash'. The ToUnicode information is utterly separate from the font end of things, it is effectively metadata and bears no real relation to the font or rendering.
Put the file on DropBox and post the URL and someone can look into it. I'll be on vacation for the next few days though, but maybe someone else will look.
Note that in PDF you don't necessarily specify characters and positions as a list of consecutive characters; you can specify the position of each individually, or you can specify widths which override the width in the font, etc. So there almost certainly is only one glyph, the 'white space' you refer to is probably just that, white space, its not another glyph.
I should also point out (I do this a lot) that Ghostscript never 'flattens', concatenates, merges, or anything similar operation on PDF files. WHen using Ghostscript and the pdfwrite device the original input (in whatever format) is fully interpreted into graphics marking operations, and sent tot eh device. The device executes the marking operations; in the case of a rendering device, it scan-converts and writes to a bitmap. In the case of pdfwrite, it creates PDF operators.
The result of this is that the output PDF file bears no relation to the input PDF, other than its visual appearance.
You also don't say which version of Ghostscript you are using....

CMYK Overprinting and Knockout in Ghostscript

I'm trying to get a grasp on the capabilities of the current version of Ghostscript (see also this question that I asked a few days ago). So, I downloaded a "test form" for the PDF/X-4 standard from www.pdfx-ready.ch, a standards organization in Switzerland, and tried to render it... (In case anyone wants to try this, here's the direct download link: http://www.pdfx-ready.ch/files/PDFX-ready-OutputTest_PDFX4-CMYK_V301d.zip. You can find more info on this page (in German): http://www.pdfx-ready.ch/index.php?show=496)
Anyway: I was pleasantly surprised to see that most of the test fields were rendered correctly on screen. Most of the other PDF viewers that I had tried had failed miserably. Then I noticed that there were a few test cases that produced errors:
CMYK Overprint Mode (on page 1) is not respected for fonts and
vectors (it works fine for images, masks and shadings).
Rendering of Knockout Transparency Groups (on page 2) is not performed correctly.
Rendering of a few more fields (on page 4) that had to do with overprinting (Spot to CMYK, CMYK over Spot, Image Overprint etc.) failed.
So, I started experimenting... First I noticed that I still had an old version of Ghostscript installed. So, I compiled the new version 9.16 and tried again. This time, the Knockout Transparency Groups (see above) were rendered correctly. Great!
Then I read here that "the handling of overprinting and spot colors depends upon the process color model of the output device". So, instead of -sDEVICE=x11 I now tried -sDEVICE=x11cmyk. And to my surprise, the errors regarding the CMYK Overprint Mode went away. Unfortunately, the errors on page 4 remained.
What's more, I now have two new problems: First of all, the pages are now rendered in wrong colors. In fact, the white background of the test pages now appears in cyan! Also, it seems, Ghostscript is now trying to simulate some kind of ugly halftoning on screen. I read here again that "The differences in appearance of files with overprinting and spot colors caused by the differences in the color model of the output device [...] are not due to a limitation in the implementation of Ghostscript or its output devices." So, I'm assuming that I'm missing something. But what is it?
Summarizing:
Is there a way (maybe another device, a command line parameter or something) to tell Ghostscript to handle overprinting correctly? Or hasn't this been implemented yet?
What causes the cyan tinting of the white background?
Is there a way to print this correctly to an inkjet, the way it appears on screen? (lpr doesn't seem to work well.)
Thanks in advance.
UPDATE
So, I experimented a lot and read a few discussions. Also, the documentation here, which I found pretty interesting, as it says:
"Ghostscript currently provides overprint simulation for spot
colorants when rendering to the separation devices psdcmyk and
tiffsep. These devices maintain all the spot color planes and merge
these together to provide a simulated preview of what would be
printed."
Alright, this is what #KenS (see below) mentioned in a comment. But then
"It is possible to get a simulated preview of overprinting with other
CMYK devices by specifying -dSimulateOverprint = true/false In this
case, simulated overprinting is achieved through a blending of the
CMYK colorants." [p.9]
Now, I read that as saying that I can use a CMYK device (like tiff32nc) to get a simulated preview of overprinting with spot colors. Am I correct? So, after some more reading here (just in case this has anything to do with CMYK, which I doubt), I finally tried the following:
gs -dBATCH -dNOPAUSE -dSAFER
-dSimulateOverprint=true
-sDefaultCMYKProfile=ISOcoated_v2_300_eci.icc
-sOutputICCProfile=ISOcoated_v2_300_eci.icc
-sDEVICE=tiff32nc
-sOutputFile=out.tif
in.pdf
I even experimented with the options -dOverrideICC, -dRenderIntent and -sProofProfile. Nothing seems to work. What am I misunderstanding here? Is there really no way to render a non-seperated full-color preview of correctly overprinted spot colors?
UPDATE 2
So, I finally tried the tiffsep device (not really, what I would like to achieve, but interesting as a test case) and checked the five files that are produced. And there are still errors! If you would like to check, run the command
gs -dBATCH -dNOPAUSE -dSAFER
-sDEVICE=tiffsep
-dFirstPage=4
-dLastPage=4
-sOutputFile=page4.tif
PDFX-ready_Output-Test_301d_X4.pdf
over the aforementioned PDF/X-4 document. Then examine, e.g., the third test field in the first row in the left column (page 4).
So, I really don't know what to make of this. Does that mean that Ghostscript can't handle overprinting with spot colors at all - contrary to what the documentation says? Is that a bug? Or do I have the command wrong? Am I missing anything?
First answer is stop trying to use the X11 device, its an RGB device and not hugely well supported. In order to do X11CMYK the input must be rendered to CMYK then post filtered to RGB. Its not a good solution.
Overprinting is only defined for CMYK process colours (and spots), any other colour model will not perform overprinting. So I would suggest you render to TIFF or JPEG devices using their CMYK variants.
Spot colours are even more complex, if the device does not support the requested spot colour then it uses the tint transform to convert into the defined alternate colour space. If tint transformation takes place the spot is not overprinted.
Since the display devices cannot support spot colours, you can't preview spot overprinting using a display device. If you want to do this you should use the tiffsep device.
If you believe you have found a bug in Ghostscript, then please report it as such, but you will have to report it against a CMYK device, and I'll say now that we won't be very active with bugs in the X11 CMYK device, its practically unused.
Printing to an inkjet device depends on the printing workflow, and I have no idea what you are using for that. If its CUPS (and I'm guessing solely based on the fact that you are using an X11 device) then this 'should' just work. But it depends on the complete end-to-end print process, and I have no idea what it is you are doing.
Again note that spot colours will not be available on a CMYK printer, so overprinting spots is probably not going to work the way you expect.
I may be very late to the party but this works for me:
gs -dBATCH -dNOPAUSE -dSimulateOverprint=true \
-sDEVICE=jpegcmyk -sOUTPUTFILE=overprint.jpg overprint.pdf

Obey the MediaBox/CropBox in PDF when using Ghostscript to render a PDF to a PNG

I've been using Ghostscript to convert my single figure plots rendered in PDF to PNG:
gswin32c -sDEVICE=png16m -r300x300 -sOutputFile=junk.png ^
-dBATCH -dNOPAUSE Figure_001-a.pdf
This works in the sense I get a PNG out and it contains the plot.
But it contains a huge amount of white space as well (an example source image: http://cdsweb.cern.ch/record/1258681/files/Figure_001-a.pdf).
If you view it in Acrobat you'll note there is no white space around the plot. If you use the above command line you'll find the plot is only about 1/3 of the space.
When doing the same thing with an EPS file I run into the same problem. However, there is the command-line parameter -dEPSCrop that one can pass to get the PS rendering engine to pay attention to the BoundingBox.
I need the similar argument for rendering PDFs. I was not able to find it in docs (nor even the -dEPSCrop, actually).
I had exactly the same issue. I fixed it by adding -dUseArtBox switch.
Example:
/usr/bin/gs -dUseArtBox -dNOPAUSE -sDEVICE=pngalpha -sOutputFile=output.png input.pdf
Note: -dUseArtBox switch is supported since ghostscript version 9.07
-dUseArtBox
Sets the page size to the ArtBox rather than the MediaBox. The art box defines the extent of the page's meaningful content (including potential white space) as intended by the page's creator. The art box is likely to be the smallest box. It can be useful when one wants to crop the page as much as possible without losing the content.
There are various options to control which "media size" Ghostscript renders a given input:
-dPDFFitPage
-dUseTrimBox
-dUseCropBox
With PDFFitPage Ghostscript will render to the current page device size (usually the default page size).
With UseTrimBox it will use the TrimBox (and it will at the same time set the PageSize to that value).
With UseCropBox it will use the CropBox (and it will at the same time set the PageSize to that value).
By default (give no parameter), Ghostscript will render using the MediaBox.
For your example, it looks like adding "-dUseCropBox" will do the job you're expecting.
Note, you can additionally control the overall size of your output by using "-sPAPERSIZE" (select amongst all pre-defined values Ghostscript knows) or (for more flexibility) use "-dDEVICEWIDTHPOINTS=NNN -dDEVICEHEIGHTPOINTS=NNN".
Have you tried using pdfcrop using pdftex (comes with texlive for example) or (not tried yet) the python script pdfcrop?
I have a similar workflow using the first tool mentioned.