Ghostscript won't convert PDF to PDF/A. Annotation Issue - pdf

I'm trying to convert a PDF to PDF/A. At every pass I'm getting the error "GPL Ghostscript 9.19: Annotation set to non-printing, not permitted in PDF/A, reverting to normal PDF output".
The PDF has previously been generated from HTML by wkhtmltopdf. With the error being quite vague I've done some research around PDF annotations. I've confirmed the PDF has no annotations, flattening annotations (though there isn't one) hasn't worked, I tried the -dShowAnnots=false switch. All to no avail. I've also tried it with a variety of different PDFs and I'm getting the same error on them all.
The command I'm using to do the conversion is "gs -dPDFA=2 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o output.pdf /Users/work/Documents/Projects/pdf-generator-service-tests/PDFA_def.ps -dPDFACompatibilityPolicy=1 input.pdf"
I tried creating a basic PDF page from Google's homepage using wkhtmltopdf https://google.com putput.pdf and again, no joy (this is an example of the PDFs I've tried to convert, for people who may want to try and replicate the issue).

I thought the error was quite specific; PDF/A does not permit annotations to be set to non-printing. You haven't included an actual example of the kind of file causing you a problem, so I can't possibly comment on the presence of any annotations, but I assure you that its not possible to get this message without having annotations.
Since you've already set PDFACompatibility to 1 there's not much else I can say. You could open a bug report and attach the file there, or post a link to one here. Without that I can't say much.
Oh and you don't say which version of Ghostscript you are using, or where you sourced it from. Occasionally packagers break things so it might be worth trying to build from source.
One point; You execute the PDFA_def.ps file before setting PDFACompatibility=1, that's probably not going to work, you'll want to switch those two around. You should set the controls before you do any input or stuff might go awry, trying to change midstream isn't really a good idea.

I used gs (v9.21) to convert a PDF with annotations set to non-printing (hyperref) to a PDF/A compliant file. Annotations will not be present in the output file but, in my case, that was not an issue.
The command I used is:
gs -dPDFA=2 -dBATCH -dNOPAUSE -dPDFACompatibilityPolicy=1 -dUseCIEColor -sProcessColorModel=DeviceGray -sDEVICE=pdfwrite -sOutputFile=output_file.pdf input_file.pdf
Notes:
-dPDFACompatibilityPolicy=1 instead of -sPDFACompatibilityPolicy=1. The latter does not force gs to elide the annotation while the former does.
I used -dUseCIEColor because pdfa validation (https://www.pdf-online.com/osa/validate.aspx) failed with an issue related to the color space. This parameter is deprecated but I did not find any other way around this issue. For more details, see Convert PS files to PDF/A via Ghostscript, color space problems

Like KenS said, it's hard to know anything without a PDF to look at but since you're having trouble with the Google home page converted to PDF, I suspect that it's the external links that are causing the problem. Links are annotations and in PDF/A, external links are not permitted. Any link in HTML when converted to PDF will be considered external.

Related

Ghostscript: preserve annotations (keep print/non-print distinction)?

I'm currently using GS to convert PDFs into PDFs (mostly to convert T1 fonts to CFF).
-dPDFSETTINGS=/prepress -dAutoRotatePages=/Nones -dEmbedAllFonts=true -dSubsetFonts=true -dPrinted=false
These are my flags that I use right now. Because I set dPrinted=false, it preserves the links in my files, even when they are set not to print.
But if I have links and annotations which normally would only show in one medium (on screen), dPrinted seems to force me to choose one way or the other for my converted file. Is there any way to input a PDF which has annotations on screen but not on paper, and output a PDF which has the same distinction?
Thanks.
Currently, no, there is no way to keep both printing and non-printing annotations, you can have one or the other.
By altering Printed you can keep either printing annotations (Printed=true) or non-printed (Printed=false) annotations.
The distinction isn't down to the creation of the output PDF file, but the PDF interpreter, which can only behave, at the moment, as either a screen device or a printer. So it only processes one set of annotations.

PDF to PDF/A-2b without dUseCIEColor

My goal is to take arbitrary PDF from the users, and save it as PDF/A-2b.
The current approach is to use Ghostscript 9.21 (via ghost4j) to create the converted file. This works but not without some problems. I got it to work with two different sets of parameters to Ghostscript.
Firstly
Using the option -dUseCIEColor as shown below will work, and produces valid PDF/A-2b with a couple of different test files. This will however print pages of errors into the log saying it is not recommended to use.
These are the full arguments:
-dBATCH
-dNOPAUSE
-dPrinted=true
-sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1
-sColorConversionStrategy=/UseDeviceIndependentColor
-sProcessColorModel=DeviceCMYK
-sOutputICCProfile=/tmp/icc.icc
-sOutputFile=/tmp/result.pdf
-dPDFA=2
-dUseCIEColor
/tmp/PDFA_def.ps
/tmp/test.pdf
And the PDFA_def.ps is the default supplier 9.21, pointing to the same ICC Profile and this line in the bottom:
<</NeverEmbed []>> setdistillerparams
The ICC Profile is a random (CMYK) profile published by Adobe.
This works, apart from the errors in the log.
Secondly
Then I will try to do as the log errors told, and remove the -dUseCIEColor.
Now some of the test files work, some wont. I suspect this has to do with the color profile of the original PDF, or something like that.
3-heights gives the validation error: A device-specific color space (DeviceRGB) without an appropriate output intent is used.
This can be corrected by switching the -sProcessColorModel=DeviceRGB and switch the ICC Profile to an random RGB profile.
Then for another document, you get the error: A device-specific color space (DeviceCMYK) without an appropriate output intent is used.
Is there something I'm missing with this? It would seem I need to switch the options based on the original PDF file which would be far from the preferred style. If it helps, I would be fine with black and white PDF/A-2b too. Thanks!
Its impossible to say what the problems are without seeing the files. UseCIEColor is an awful PostScript hack to try and do colour management, it isn't reliable (in terms of colour) and will effectively defeat any real colour management. Clearly you aren't performing colour management since you are using a random profile, but all the same....
Since you don't really care about colour management, I would suggest that instead of UseDeviceIndependentColor, you select CMYK (as that's the ProcessColorModel you are using). Note that if you select ColorConversionStrategy=/CMYK then you don't need to set ProcessColorModel, that's assumed from the conversion.
Other than that, I'd have to suggest that you open a bug report. If people don't report problems then they won't get fixed....
The correct PDF/A-compatible replacement for UseCIEColor seems to be in combination of these 2 options:
-sProcessColorModel=DeviceCMYK
-sColorConversionStrategy=UseDeviceIndependentColor
Both DeviceCMYK and DeviceRGB work for me.

Scale to fit and resize page

I am using (PDF)LaTeX to make a document, and I also need to embed already existing PDF documents in it. The problem is that I have PDF documents in several different page sizes (letter, a4, etc) and I want to compile all of them into a single b5 PDF document.
If I use the pdfpages package from CTAN, all hyperlinks from the original PDFs are removed. So I tried to do it with GhostScript.
This sounds like something normal to do but I have failed to find a working solution.
I have, in the meanwhile, read a few question and answers, but failed to figure out what I am doing wrong and what I am missing.
This doesn't seem to address my problem of scaling.
Neither does that.
This seems to go in the right direction but I couldn't make use of the information :-(.
To make the problem easier, let's just try to resize a single PDF so that:
its contents are scaled to fit the page
the new page has the size I want
Sounds easy, and it is easy to do, for example with pdfjam:
pdfjam --outfile b5-foo.pdf --paper b5paper foo.pdf
Now the problem with this is that pdfjam throws away hyperlinks. From its website:
A potential drawback of pdfjam and other scripts based upon it is that any hyperlinks in the source PDF are lost.
This must be because it seems to use pdfpages mentioned above.
Unlike pdfjam, GhostScript keeps hyperlinks. However, it either:
crops the original when I downscale; or
does not put the scaled content on a page of the size I need -- instead, I get a page that seems to be scaled down, while keeping the original aspect ratio.
This is what I have installed:
$ gs --version
9.21
(Installed on Linux)
This is how I can use GhostScript to crop the content:
gs -dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dFIXEDMEDIA -sPAPERSIZE=isob5 \
-o b5-foo.pdf foo.pdf
... and here is how I can use -dPDFFitPage to scale the content but also keep the aspect ratio of the original page size:
gs -dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dFIXEDMEDIA -sPAPERSIZE=isob5 -dPDFFitPage \
-o b5-foo.pdf foo.pdf
To be even clearer: I seem to get a page that is scaled so that it would fit inside the b5 I am asking for, but it is not b5: it still has the H/W ratio the original (letter) had!
I'd be happy if this can be done just using switches but if I need to use PostScript that's perfectly fine.
The solution seems to be to use -dPSFitPage instead of -dPDFFitPage. This might have something to do with the PDF files that I am trying to resize. Unfortunately, I cannot share those :-(. When I tried to reproduce this with files that I generated and the problem does not reproduce. I don't know why this is or how I should have known it.
To summarize, using PDF files for both input and output:
-dFitPage and -dPDFFitPage give me scaled pages with the original aspect ratio
-dPSFitPage gives me scaled content on the page size I request with -sPAPERSIZE="$PAPERSIZE"
This seems to go against what the documentation says.

Ghostscript textwriter preserve blank lines

I'm trying to convert pdfs to text files.
I use this command to perform the conversion:
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=output.txt input.pdf
Ghostscript version is 9.07.
I get all the text shown in PDF. I'd like to preserve the blank lines in the text file if possible.
Thanks
You should upgrade, the current version of Ghostscript is 9.18 and 9.19 will be released very shortly. Each of the interim versions includes fixes to the txtwrite device.
Although it is true that PDF files do not include blank lines, the txtwrite device does have a mode whereby it will attempt to produce a reasonable representation of the original layout by using spaces and blank lines in a text file.
This is the default action in the current version of txtwrite, so you ought to be getting this already, unless you have selected a different TextFormat.
This mode is highly heuristic, easily fooled, doesn't cope well with superscripts, subscripts, significant point size changes and possibly other attributes which make the layout difficult to reproduce. Obviously without seeing your input file, there's nothing more I can tell you.

Ghostscript PDF printing garbled

I'm trying to use Ghostscript 9.02 on Windows 7 to print a PDF to an Epson Workforce printer from the command line using the following command:
gswin32c -dPrinted -dBATCH -dNOPAUSE -dNOSAFER -q -dNumCopies=1 -sDEVICE=epson -sOutputFile=\\spool\EPSON C:\Document1.pdf
When executing this command, pages will print from my printer, but it is just garbled text instead of the PDF.
I have tried 3 different PDF files with similar results.
I doubt that the previous answer is the issue, but rather is a problem with getting the epson format data passed through correctly as binary. Particularly if the 'init_string' == "\f\033#" doesn't make it in,
the rest of the data will be interpreted by the printer as text instead of raster data.
Since you are on Windows, you may get better results by using the -sDEVICE=mswinpr2 device which sends the raster image for the page through GDI to the manufacturer's driver. See http://artifex.com/gs-current-release/Devices.htm#Win for documentation on printing from windows using Ghostscript.
BTW, you can easily check if the problem is with gswin32c being able to properly render the input PDF by
looking at it on the default 'display' device using:
gswin32c C:\Document1.pdf
your problem may be probably related with encoding used by pdf file
how this pdf has been produced?
I seen several times this problem arise with pdf produced by internal pdf exporter of OpenOffice
I have had a similar issue, and it looks like not all listed devices are capable of printing PDF files. I have used ljet4 option for Ricoh network printer and it prints fine. The only problem is it always prints immediately instead of "HoldPrint" queue.