Ghostscript removing trim and bleed boxes and output profiles - pdf

We have ready created single page pdfs with trim and bleed boxes and greyscaled using an ICC profile. We are then using Ghostscript to combine into a multi-page pdf however after it has combined them the trim and bleed boxes disappear and the greyscale reverts to color. We can use the Ghostscript greyscale command but this doesn't help with the trim/bleed boxes which we need for imposition.
This is what we are using:
$command = 'gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile="' . $outputPath . '" ' . implode(' ', $pdfFiles);
Be glad of any help or suggestions, we do a high volume so are currently using PDFTK to combine which keeps the boxes but doesn't fix the greyscale issue either.

You have not stated which operating system you are using, nor which version of Ghostscript, and you haven't supplied an example file.
The pdfwrite device goes to considerable effort not to alter the colour space or values of the input. If the input is in DeviceGray, then the output will be in DeviceGray, unless you specifically request a different space using the ColorConversionStrategy switch. What exactly do you mean by "the greyscale reverts to colour" ? The PDF displays differently ? Some other tool reports the file is 'colour' ?
There's really nothing anyone can suggest without a lot more information, in particular an example input file and ideally the file after you've run it through Ghostscript using the pdfwrite device.
Please note that Ghostscript's pdfwrite device does not 'combine' PDF files. The actual process is complex, and though the end result may appear to be the original files 'combined' that's not what is going on behind the scenes. The actual process is documented here.

Related

PDF to PDF/A-2b without dUseCIEColor

My goal is to take arbitrary PDF from the users, and save it as PDF/A-2b.
The current approach is to use Ghostscript 9.21 (via ghost4j) to create the converted file. This works but not without some problems. I got it to work with two different sets of parameters to Ghostscript.
Firstly
Using the option -dUseCIEColor as shown below will work, and produces valid PDF/A-2b with a couple of different test files. This will however print pages of errors into the log saying it is not recommended to use.
These are the full arguments:
-dBATCH
-dNOPAUSE
-dPrinted=true
-sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1
-sColorConversionStrategy=/UseDeviceIndependentColor
-sProcessColorModel=DeviceCMYK
-sOutputICCProfile=/tmp/icc.icc
-sOutputFile=/tmp/result.pdf
-dPDFA=2
-dUseCIEColor
/tmp/PDFA_def.ps
/tmp/test.pdf
And the PDFA_def.ps is the default supplier 9.21, pointing to the same ICC Profile and this line in the bottom:
<</NeverEmbed []>> setdistillerparams
The ICC Profile is a random (CMYK) profile published by Adobe.
This works, apart from the errors in the log.
Secondly
Then I will try to do as the log errors told, and remove the -dUseCIEColor.
Now some of the test files work, some wont. I suspect this has to do with the color profile of the original PDF, or something like that.
3-heights gives the validation error: A device-specific color space (DeviceRGB) without an appropriate output intent is used.
This can be corrected by switching the -sProcessColorModel=DeviceRGB and switch the ICC Profile to an random RGB profile.
Then for another document, you get the error: A device-specific color space (DeviceCMYK) without an appropriate output intent is used.
Is there something I'm missing with this? It would seem I need to switch the options based on the original PDF file which would be far from the preferred style. If it helps, I would be fine with black and white PDF/A-2b too. Thanks!
Its impossible to say what the problems are without seeing the files. UseCIEColor is an awful PostScript hack to try and do colour management, it isn't reliable (in terms of colour) and will effectively defeat any real colour management. Clearly you aren't performing colour management since you are using a random profile, but all the same....
Since you don't really care about colour management, I would suggest that instead of UseDeviceIndependentColor, you select CMYK (as that's the ProcessColorModel you are using). Note that if you select ColorConversionStrategy=/CMYK then you don't need to set ProcessColorModel, that's assumed from the conversion.
Other than that, I'd have to suggest that you open a bug report. If people don't report problems then they won't get fixed....
The correct PDF/A-compatible replacement for UseCIEColor seems to be in combination of these 2 options:
-sProcessColorModel=DeviceCMYK
-sColorConversionStrategy=UseDeviceIndependentColor
Both DeviceCMYK and DeviceRGB work for me.

Ghostscript: How to auto-crop STDIN to "bounding box" and write to PDF?

Here have been already quite a few questions and answers about cropping documents with Ghostscript.
However, the answers are not matching my exact needs and are still confusing to me.
I expected that there would be a single option e.g. "-AutoCropToBBox" or something like this.
For clarification, as a bounding box, I understand the smallest rectangular box which contains all (non-white(?)) printed objects completely.
Furthermore, I want/have to use a printer port redirection (RedMon) to generate a cropped PDF via printing to a Postscript-printer from basically any application.
So, under Win7/64bit, I set the redirected port properties:
Redirected port properties Win7/64bit
The output is redirected to
C:\Windows\system32\cmd.exe
The arguments for the program are:
/c gswin64c.exe -sDEVICE=pdfwrite -o -sOutputFile="%1".pdf -
"%1" contains the user input for filename. With this, I get a full-page PDF. Fine!
But how to add the cropping options?
Additional question:
If I have a multipage document will such an (auto-)cropping be individual for each page? Or would there be an option to keep it all the same e.g. like the first page or like the largest bounding box of all pages?
Another related issue:
the window for prompting for the filename is always popping up behind the application I am printing from. Any ideas to always bring it to the front?
Another question:
There is the Perl-script "ps2eps" and program bbox.exe (see http://ctan.org/pkg/ps2eps). It's said there that Ghostscript (or ps2epsi) is occationally(?) calculating wrong bounding boxes. Is this (still) true?
Thanks for your help.
Well your first problem is that PostScript programs are normally written to expect to be rendered to a specific media size, and are usually not tightly bounded to it. White space is important for readability.
So ordinarily the PostScript program you generate will request a specific media size, and the interpreter will do its best to match that. If it can't match it then it will use a strategy to try and get as close as possible, and scale the entire content to fit that media.
You can't expect the printer to perform any of those things if it doesn't know the required size until its finished, and you can't be certain of the bounding box until you have rendered all the marking content. It is true that some files generally EPS files have a %%BoundingBox comment but.. that's a comment, it has no effect in PostScript, its there for the benefit of applications which don't want to interpret the PostScript.
So that's why the simple switch you want isn't there, it would break the interpreter's normal functioning, for rendering.
So, the first thing you need to do is determine the bounding box of the content. You can do that, as Stefan says, by using the bbox device. And on that note, as far as I know the bbox device produces accurate output. If it does not then we would appreciate a bug report proving it so we can fix it. If people don't report bugs how are we supposed to know about them ? Its disappointing to see someone spreading FUD instead of helping out with a bug report.......
ps2epsi isn't Ghostscript, its a crappy cheap and cheerful script, I wouldn't use it. However..... If the original PostScript leaves stuff on the stack then it will end up as a corrupted (or invalid) EPS file and the original PostScript should be fixed before trying to use it as it will break any PostScript program that tries to use it (eg if you include the EPS in a docuemnt and then print it).
So if you are using Ghostscript, and you want to take a PostScript program and get an EPS out of it, use the eps2write device. It won't have a preview bu frankly who cares.
Now if I remember correctly the bbox device (and eps2write) record all marking operations, you can't simply record all the non-white marking operations; what if the white overwrites an existing mark on the page ? What if the media is not white ? Note that if you render to a PNG with Ghostscript, the untouched portion of the output is transparent, whereas white marks are not.
So the bbox is the extent of all the marking operations, regardless of the colour. The only other way to proceed would be to render the content and count the non-white pixels. But that only works at a specific resolution, change the resolution and the precise bounding box may change as well.
Once you have the Bounding Box you can tell Ghostscript to use media that size. Note that you will almost certainly also have to translate the origin, as its unlikely that the content will start tightly at the bottom left corner. You will need -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS to set the media size, and you will need to use -c and -f to send PostScript to alter the origin appropriately. In simple cases an '-x -y translate' will suffice but if the program executes initgraphics you will instead have to set a BeginPage procedure to alter the initial CTM.
If you set the media size with -dDEVICEWIDTHPOINTS etc then all pages will be the same size. If you don't want that then you need to write a BeginPage procedure to resize each page individually (you will also need to hook setpagedevice and remove the /PageSize entries from the dictionary.
I've no idea why Windows is putting the dialog box behind the active Window, it seems to have started doing that with Windows 7 (or possibly Vista). I don't see any way to alter that because I'm not sure what is generating the dialog.....
Personally I would suggest that you try the 2-step approach of running the original through Ghostscript's eps2write device and then take the EPS and create a PDF file using the pdfwrite device and the -dEPSCrop switch. Double converting is bad, but other solutions are worse. Note that EPS files cannot be multi-page, so you will have to create 'n' EPS files from an n-page PostScript program, and then supply a command line listing each EPS file as input to the pdfwrite device.
Take an example file and try this out from the command line before you try scripting it.
As I understood from #KenS explanations:
the way eps2write works, it may not or will not or actually cannot result in the minimum possible bounding box
it needs to be a 2-step process via -sDEVICE=bbox
So, I now ended up with the following process to "print" a PDF with a correct minimum possible bounding box:
Redirected Printer Port to cmd.exe
C:\Windows\system32\cmd.exe
Arguments for the program:
/c gswin64c.exe -q -o "%1".ps -sDEVICE=ps2write - && gswin64c.exe -q -dBATCH -dNOPAUSE -sDEVICE=bbox -dLastPage=1 "%1".ps 2>&1 >nul | perl.exe C:\myFiles\CropPS2PDF.pl "%1"
Unfortunately, it requires a little Perl script (let's call it: CropPS2PDF.pl):
#!usr/bin/perl -w
use strict;
my $FileName = $ARGV[0];
$/ = undef;
my $Crop = <STDIN>;
$Crop =~ /%%BoundingBox: (\d+) (\d+) (\d+) (\d+)/s; # get the bbox coordinates
my ($llx, $lly, $urx, $ury) = ($1, $2, $3, $4);
print "\n$FileName: $llx, $lly, $urx, $ury \n"; # print just to check
my $Command = qq{gswin64c.exe -q -o $FileName.pdf -sDEVICE=pdfwrite -c "[/CropBox [$llx $lly $urx $ury]" -c " /PAGE pdfmark" -f $FileName.ps};
print $Command; # print just to check
system($Command); # execute command
It seems to work... :-)
Improvements are welcome.
My questions are still:
Can this be done somehow without Perl? Just Win7, cmd.exe and Ghostscript?
Is there maybe a way without writing the PS-File to disk which I do not need? Of course, I could also delete it afterwards with the Perl-script.

Ghostscript loses emdash characters and replaces with hyphens

When I run a PDF which was originally created with LibreOffice on Linux, through ghostscript 9.19 on OSX, to produce another (flattened) PDF, the output is perfect except for one problem. All emdashes in the entire document have been replaced with a standard hyphen (awkwardly followed by half of a space.) Oddly enough, if I highlight the resulting "hyphen+space", my context menu shows that I've selected an emdash, so the underlying text is still an emdash, it is just rendering the wrong glyph.
I can reproduce this on multiple documents from the same source, and I'm assuming there's a setting or switch somewhere that can help resolve this.
I don't know whether the font used makes a difference, but for the sake of reference, the body text of my document is set in Arno Pro. When I use a modern version of LibreOffice on OS X to make a sample document also containing an emdash in Arno Pro, the same problem is not exhibited, so it seems to be specific to the software which originally made these PDF files.
These PDFs are of legacy projects that I am not set-up to re-produce at this time, so I need to prepare them for reprinting using the existing files.
How do I retain emdash glyphs when running a command such as the following?
gs -dSAFER -dBATCH -dNOPAUSE -dNOCACHE -sDEVICE=pdfwrite \
-sColorConversionStrategy=/LeaveColorUnchanged \
-dAutoFilterColorImages=true -dAutoFilterGrayImages=true \
-sOutputFile=output.pdf input.pdf
I can add an example of the input PDF to this question if needed.
Without seeing the PDF file it isn't possible to give you an answer. Most likely the font isn't embedded, or if it is embedded doesn't have an emdash glyph.
Copy and paste uses the ToUnicode CMap, so it isn't dependent on the font. Its simply a list of character codes and the Unicode code point associated with each, when using a given font.
Note that this doesn't mean 'the underlying text is still an emdash'. The ToUnicode information is utterly separate from the font end of things, it is effectively metadata and bears no real relation to the font or rendering.
Put the file on DropBox and post the URL and someone can look into it. I'll be on vacation for the next few days though, but maybe someone else will look.
Note that in PDF you don't necessarily specify characters and positions as a list of consecutive characters; you can specify the position of each individually, or you can specify widths which override the width in the font, etc. So there almost certainly is only one glyph, the 'white space' you refer to is probably just that, white space, its not another glyph.
I should also point out (I do this a lot) that Ghostscript never 'flattens', concatenates, merges, or anything similar operation on PDF files. WHen using Ghostscript and the pdfwrite device the original input (in whatever format) is fully interpreted into graphics marking operations, and sent tot eh device. The device executes the marking operations; in the case of a rendering device, it scan-converts and writes to a bitmap. In the case of pdfwrite, it creates PDF operators.
The result of this is that the output PDF file bears no relation to the input PDF, other than its visual appearance.
You also don't say which version of Ghostscript you are using....

Ghostscript textwriter preserve blank lines

I'm trying to convert pdfs to text files.
I use this command to perform the conversion:
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=output.txt input.pdf
Ghostscript version is 9.07.
I get all the text shown in PDF. I'd like to preserve the blank lines in the text file if possible.
Thanks
You should upgrade, the current version of Ghostscript is 9.18 and 9.19 will be released very shortly. Each of the interim versions includes fixes to the txtwrite device.
Although it is true that PDF files do not include blank lines, the txtwrite device does have a mode whereby it will attempt to produce a reasonable representation of the original layout by using spaces and blank lines in a text file.
This is the default action in the current version of txtwrite, so you ought to be getting this already, unless you have selected a different TextFormat.
This mode is highly heuristic, easily fooled, doesn't cope well with superscripts, subscripts, significant point size changes and possibly other attributes which make the layout difficult to reproduce. Obviously without seeing your input file, there's nothing more I can tell you.

Obey the MediaBox/CropBox in PDF when using Ghostscript to render a PDF to a PNG

I've been using Ghostscript to convert my single figure plots rendered in PDF to PNG:
gswin32c -sDEVICE=png16m -r300x300 -sOutputFile=junk.png ^
-dBATCH -dNOPAUSE Figure_001-a.pdf
This works in the sense I get a PNG out and it contains the plot.
But it contains a huge amount of white space as well (an example source image: http://cdsweb.cern.ch/record/1258681/files/Figure_001-a.pdf).
If you view it in Acrobat you'll note there is no white space around the plot. If you use the above command line you'll find the plot is only about 1/3 of the space.
When doing the same thing with an EPS file I run into the same problem. However, there is the command-line parameter -dEPSCrop that one can pass to get the PS rendering engine to pay attention to the BoundingBox.
I need the similar argument for rendering PDFs. I was not able to find it in docs (nor even the -dEPSCrop, actually).
I had exactly the same issue. I fixed it by adding -dUseArtBox switch.
Example:
/usr/bin/gs -dUseArtBox -dNOPAUSE -sDEVICE=pngalpha -sOutputFile=output.png input.pdf
Note: -dUseArtBox switch is supported since ghostscript version 9.07
-dUseArtBox
Sets the page size to the ArtBox rather than the MediaBox. The art box defines the extent of the page's meaningful content (including potential white space) as intended by the page's creator. The art box is likely to be the smallest box. It can be useful when one wants to crop the page as much as possible without losing the content.
There are various options to control which "media size" Ghostscript renders a given input:
-dPDFFitPage
-dUseTrimBox
-dUseCropBox
With PDFFitPage Ghostscript will render to the current page device size (usually the default page size).
With UseTrimBox it will use the TrimBox (and it will at the same time set the PageSize to that value).
With UseCropBox it will use the CropBox (and it will at the same time set the PageSize to that value).
By default (give no parameter), Ghostscript will render using the MediaBox.
For your example, it looks like adding "-dUseCropBox" will do the job you're expecting.
Note, you can additionally control the overall size of your output by using "-sPAPERSIZE" (select amongst all pre-defined values Ghostscript knows) or (for more flexibility) use "-dDEVICEWIDTHPOINTS=NNN -dDEVICEHEIGHTPOINTS=NNN".
Have you tried using pdfcrop using pdftex (comes with texlive for example) or (not tried yet) the python script pdfcrop?
I have a similar workflow using the first tool mentioned.