Ghostscript won't generate PDF/A with UTF16BE text string detected in DOCINFO - in spite of PDFACompatibilityPolicy saying otherwise - pdf

I am trying to convert normal PDF files to PDF/A with this command line:
gs -dPDFA -dBATCH -dNOPAUSE -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=output.pdf input.pdf
However, I get the message
GPL Ghostscript 9.26: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, reverting to normal PDF output
an gs reverts to normal PDF.
Apparently, the message stems from this code fragment of gs, but there we read that the message can occur only when pdev->PDFACompatibilityPolicy == 0. My understanding was that the parameter -sPDFACompatibilityPolicy=1 in the command line has the purpose of preventing this.
Q: Why does gs behave as if the desired policy were 0 instead of 1? Is there another way to set the policy to 1?
Also, just as it makes me curious:
Q: Is there a way to see what kind of strange DOCINFO there is causing the original problem or to prevent it in the first place? Using Acrobat Reader, I cannot see anything "suspicuous" in the file. If it helps: The input.pdf is generated on Window from Word (and I tried even with the UseISO19005-1 setting, which should produce PDF/A to begin with, but the problem occurs anyway).

You have put -sPDFACompatibilityPolicy=1. That, I'm afraid, is incorrect. Ghostscript has two kinds of switches -s which deals with string values, and -d which deals with numeric and name values (names in PostScript begin with '/').
You've assigned a string value of '1' to the parameter PDFACompatbilityPolicy, which (internally) expects a numeric value. For reasons to do with the fact that these values are required to be accessible from the PostScript environment, we can't flag the type confusion as an error. Instead we leave the actual control at its default value of 0.
If you instead set -dPDFACompatibilityPolicy=1 I expect you will see the behaviour you expect.
As for seeing the data, without looking at the PDF file I cannot tell. However, if you stop in the debugger at that point and look at p->data you will be able to see what the data is. If you look at pairs + i instead of pairs + i + 1 you will be able to see the key which is associated with the value from the DOCINFO pdfmark.
You won't be able to see anything 'suspicious' by looking at the file in Acrobat, because Acrobat will translate the UTF16BE into whatever your system requires in order to display the text correctly. It may even be that this is ASCII, you can still represent that as UTF16.
If you open the file in a text editor you may be able to see the relevant string (note that the BOM in Ghostscript is in octal, so that's 0xFE 0xFF in hexadecimal), provided its not in a compressed object stream.

Examining the source of latest ghostscript (9.50), it seems that the PDFACompatibilityPolicy values in this case (see devices/vector/gdevpdfm.c around line 1951) set the error-containing behavior as such:
0 will revert to normal PDF output (not really what I wanted)
1 will discard PDFINFO (even worse)
2 will throw an error (even even worse)
any other value is ignored in the switch and works as a pass-through!
So, in my case, the whole thing was solved simply by setting
-dPDFACompatibilityPolicy=3
Ghostscript does not complain, does not abort PDF/A output, does not discard the PDFINFO, and, most importantly, veraPDF checker still verifies the PDF as perfectly okay.
I'm not commenting on how ugly this solution is, but it works just great. Since all other switch statements just assume compatibility policy 0 if anything above 2 gets passed in, this "shortcut" seems to be an unintended, but very useful bug.

The answer of exa is not quiet correct. Ghostscript will continue its output but the resulting pdf will not conform the veraPDF validator.
At this moment im busy trying to make ghostscript work so i get a valid zugferd invoice pdf. Therefore the PDF needs to be a valid PDF/A-3(a,b or u) file.
Problem with the Answer
If you just use -dPDFACompatibilityPolicy=3 verPDF wont validate the PDF.
Instead you should fix the file with right encoding.
In my case the pdf looked like this:
How to resolve it:
Create a new file (example "pdfmarks") with this content:
[ /Title (Foo Title)
/Author (Foo Bar)
/Subject (Foo Bar Subject)
/Keywords ()
/ModDate (D:20061204092842)
/CreationDate (D:20061204092842)
/Creator (Foo Bar)
/Producer (Foo Bar)
/DOCINFO pdfmark
(There's no ending square brackets ']')
Run gs like this:
Windows:
"C:\Program Files\gs\gs9.53.3\bin\gswin64c.exe" -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=/path/to/output.pdf /path/to/input.pdf /path/to/pdfmarks
Linux:
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=/path/to/output.pdf /path/to/input.pdf /path/to/pdfmarks
You can either include your stuff or call gs a second time.
I hope i could safe you guys some time with this.

Related

How to identify the pdf object in raw pdf file?

I want to remove certain objects using programs.
Using cpdf I can get the objects, if I can somehow identify the objects that I want to delete, then I should be able to modify pdf files with programs.
$ cpdf in.pdf -output-json -output-json-parse-content-streams -o out.json
$ cpdf -j out.json -o out.pdf
However, I can not find out the object corresponding to my target text. For example, text search does not work on a raw pdf file. What is the best way to identify the target object of a text?
EDIT: Here is a test pdf. Please remove XYZ from the top of each page. Note that the test is a significant simplification of the real pdf file. So the solution should not be so simple so that it can not be applied to real complicated pdf files.
curl -s https://i.stack.imgur.com/whsnm.gif | tail -c +43 > test.pdf
The output of cpdf -output-json -output-json-parse-content-streams may or may not contain text which is recognisable to you. This depends on the font encodings in use, and the way in which text is layed out. In your file, for example, the painting of the string "XYZ" is represented as
[ "\u0000;\u0000<\u0000=", "Tj" ]
This is a string representing three codepoints indexing into the font. Cpdf presently has no way to show you what actual text this corresponds to; a future version will.
So I don't think your task can be done via cpdf -output-json in the general case, or indeed in this specific case.

GhostScript PDF 1.5 (from tiff to PDF with ImageMagick) convert to PDF/A

I need to create a PDF/A from a Folder of Tiff Files.
Creating a PDF (1.5) is working with ImageMagick.
But Converting this PDF to a PDF/A using Ghostscript is a problem.
My GhostScript cmd:
-dPDFA=2 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o "C:\Temp\TestData\TIFF to PDF Imagemagick\pdfa.pdf" "C:\Temp\TestData\TIFF to PDF Imagemagick\PDFA_def.ps" -dPDFACompatibilityPolicy=1 "C:\Temp\TestData\TIFF to PDF Imagemagick\test.pdf"
Also tryed:
-dPDFA=2 -dBATCH -dNOPAUSE -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile="C:\Temp\TestData\TIFF to PDF Imagemagick\pdfa.pdf" "C:\Temp\TestData\TIFF to PDF Imagemagick\PDFA_def.ps" "C:\Temp\TestData\TIFF to PDF Imagemagick\test.pdf"
my PDFA_def.ps is the GS standard with:
/ICCProfile (AdobeRGB1998.icc) % Customise
The created PDF/? is not passing the "Verify compliance with PDF/A-2b" preflight in Adobe Acrobat:
Error
Metadata missing (XMP)
PDF/A entry missing
Syntax problem: Indirect object “endobj” keyword not preceded by an EOL marker
Syntax problem: Stream dictionary improperly formatted
Also not the https://www.pdf-online.com/osa/validate.aspx validator:
File pdfa.pdf
Compliance pdf1.5
Result Document does not conform to PDF/A.
Details
Validating file "pdfa.pdf" for conformance level pdf1.5
XML line 10:212: xmlParseCharRef: invalid xmlChar value 0.
The document does not conform to the requested standard.
The document's meta data is either missing or inconsistent or corrupt.
The document does not conform to the PDF 1.5 standard.
Done.
Also tryed VeraPDF ....
What kind of settings have I forgotten?
Well there's quite a few problems here.
You haven't said what version of Ghostscript you are using, nor have you supplied an example file to experiment with. You also haven't given the back channel output which might contain additional information.
You can't use the supplied model PFA_def.ps without modification, at the very least you need to modify the /ICCProfile entry to point to a real valid ICC profile. I suspect this has caused pdfwrite to abort PDF/A-2 production, which would normally be mentioned in the back channel output.
You haven't set -dColorConversionStrategy, just setting the ProcessColorModel is not sufficient, pdfwrite will mostly ignore that. If you don't tell pdfwrite that you want colours converted to a different space, it will preserve them unchanged, regardless of the Process color model.
With this command its now running:
-dPDFA=2 -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dNOPAUSE -dBATCH -o "C:\Temp\TestData\tiff2pdfa\pdfatest.pdf" "C:\Temp\TestData\tiff2pdfa\PDFA\PDFA_def.ps" "C:\Temp\TestData\tiff2pdfa\test.pdf"
Thanks to:
Batch Convert PDF to PDF/A - MARK BERRY
But i have still some Error:
GPL Ghostscript 9.25: UTF16BE text string detected in DOCINFO cannot be represented
in XMP for PDF/A 1, discarding DOCINFO
Processing pages 1 through 56.
Page 1
GPL Ghostscript 9.25: Setting Overprint Mode to 1
not permitted in PDF/A-2, overprint mode not set
Should I be thinking about this "Overpirnt Mode"?

Ghostscript's pdfwrite to grayscale results in wrong graylevel

I try to convert a PDF file (test.pdf, attached below) using Ghostscript (9.20 on Windows) to only use the Graylevel colorspace (not RGB or CMY):
gswin64c.exe -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -dUseCIEColor -o gray.pdf -f test.pdf
The result indeed only uses gray colors:
>gswin64c.exe -o - -sDEVICE=inkcov gray.pdf
GPL Ghostscript 9.20 (2016-09-26)
Copyright (C) 2016 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
0.00000 0.00000 0.00000 0.92673 CMYK OK
(I need to use -dUseCIEColor, otherwise CMY values are >0, this is a separate problem which I havent yet solved...)
My problem: The resulting gray.pdf uses significantly different graylevels than the original test.pdf (open in your PDF viewer and compare for yourself).
Does anyone see my mistake or what I should do differently to get the same PDF but in grayscale rather than RGB colorspace?
Thank you very much!
test.pdf: https://drive.google.com/open?id=0BzjatAIrG6P3S2F5Vng4cUhUS0U
gray.pdf: https://drive.google.com/open?id=0BzjatAIrG6P3cEtTY3JaaTJCS2c
You are doing a multiple conversion, and not managing the colour space conversions at all.
Firstly you convert the original colour into a CIEBased colour space (and the space varies depending on the number of components in the original space). Since you don't specify Colour Rendering Dictionaries, this is an uncontrolled conversion, you are using the defaults.
You then embark on another conversion from CIEBased (which cannot, in general, be represented in PDF anyway, so would always result in an additional conversion) into DeviceGray. Again you haven't supplied any ICC profiles for this conversion, so you are using the default ones.
If you insist on using -dUseCIEColor (which I would very strongly advise against, controlling this is hard) then you need to supply ColorRendering Dictionaries to control the conversion from device space into CIE space, and also ICC profiles to control the subsequent conversion from CIE space into DeviceGray.
But I strongly suspect that you will get better results by not using -dUseCIEColor, just like Ghostscript tells you.
I can only guess about what you need based on source file. There's DeviceRGB 0.5/0.5/0.5 filled rectangle, and I suspect you want it to become 0.5 DeviceGray.
The solutions and speculations below will work for that and similar cases only. (E.g., I have no idea what are "CMY values" you write about, i.e. if there are DeviceCMYK or ICC-based or anything else in your files). There're simple formulas to convert between device color spaces (see PDF Reference), one of them indeed maps from equal values in DeviceRGB to same value in DeviceGray. To make it work, use GhostScript 9.10:
"C:\Program Files\gs\gs9.10\bin\gswin32c.exe" -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dUseFastColor -o test_1.pdf -f test.pdf
Note the switch -dUseFastColor. You'll get "correct" 0.5 grayscale filled rectangle.
To make it work in versions 9.10 .. 9.20 (excluding both), I had to add another switch: -dPDFUseOldCMS. Again, 0.5 grayscale filled rectangle in result.
As last switch name indicates, simple things were probably considered deprecated, and looks like were scrapped in 9.20.
Instead, new wonderful CMS engine was introduced (since 9.10). Except, it doesn't work for high-level devices (pdfwrite included). Either switched off or broken, for many releases.
I was unable to make it work for any combination of device- or ICC-based colors in source and command line options, to make it actually use the -sOutputICCProfile option, for either DeviceCMYK or DeviceGray output (or ICC-based output, whatever). Same color values in produced files.
I'd appreciate if someone indicates I'm wrong and shows an opposite example.
It worked, actually (partly -- for device source colors only), in 9.10:
"C:\Program Files\gs\gs9.10\bin\gswin32c.exe" -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -sOutputICCProfile=sgray.icc -o test_2.pdf -f test.pdf
Using different icc profiles results in different (and correct, it looks) output. To convert from equal RGB values to same Gray values one would need grayscale profile with same gamma as (default) sRGB. Just, use free ICC Profile Inspector to extract a curve from sRGB and import it into e.g. sgray.icc (distributed with Ghostscript).
The advantage of using a profile to convert RGB to Gray, preserving gamma, opposed to "simple formula" described above, may or may not be worth the effort. Check for your files and purposes.

Ghostscript: convert PDF to EPS with embeded font rather than outlined curve

I use the following command to convert a PDF to EPS:
gswin32 -dNOCACHE -dNOPAUSE -dBATCH -dSAFER -sDEVICE=epswrite -dLanguageLevel=2 -sOutputFile=test.eps -f test.pdf
I then use the following command to convert the EPS to another PDF (test2.pdf) to view the EPS figure.
gswin32 -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH -dEPSCrop -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 -dMaxSubsetPct=100 -dSubsetFonts=true -dEmbedAllFonts=true -sOutputFile=test2.pdf -f test.eps
I found the text in the generated test2.pdf have been converted to outline curves. There is no font embedded anymore either.
Is it possible to convert PDF to EPS without convert text to outlines? I mean, to EPS with embedded font and text.
Also after the conversion (test.pdf -> test.eps -> test2.pdf), the height and width of the PDF figure (test2.pdf) is a little bit smaller than the original PDF (test.pdf):
test.pdf:
test2.pdf:
Is it possible to keep the width and height of the figure after conversion?
Here is the test.pdf: https://dl.dropboxusercontent.com/u/45318932/test.pdf
I tried KenS's suggestion:
gswin32 -dNOPAUSE -dBATCH -dSAFER -sDEVICE=eps2write -dLanguageLevel=2 -sOutputFile=test.eps -f test.pdf
gswin32 -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH -dEPSCrop -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 -dMaxSubsetPct=100 -dSubsetFonts=true -dEmbedAllFonts=true -sOutputFile=test2.pdf -f test.eps
I can see the converted test2.pdf have very weird font:
that is different from the original font in test.pdf:
When I copy the text from test2.pdf, I only get a couple of symbols like:
✕ ✖ ✗✘✙ ✚✛
Here is the test2.pdf: https://dl.dropboxusercontent.com/u/45318932/test2.pdf
I was using the latest Ghostscript 9.15. So what is the problem?
I just noticed you are using epswrite, you don't want to do that. That device is terrible and has been deprecated (and removed now). Use the eps2write device instead (you will need a relatively recent version of Ghostscript).
There's nothing you can do with epswrite except throw it away, it makes terrible EPS files. It also can't make level 2 files, no matter what you set -dLanguageLevel to
oh, and don't use -dNOCACHE, that prevents fonts being processed and decomposes everything to outlines or bitmaps.
UPDATE
You set subset fonts to true. By doing so the character codes which are used are more or less random. The first glyph in the document (say for example the 'H' in 'Hello World') gets the code 1, the second one (eg 'e') gets the code 2 and so on.
If you have a ToUnicode CMap, then Acrobat and other readers can convert these character codes to Unicode code points, without that the readers have to fall back on heuristics, the final one being 'treat it as ASCII'. Because the encoding arrangement isn't ASCII, then you get gibberish. MS Windows' PostScript output can contain additional ToUnicode information, but that's not something we try to mimic in ps2write. After all, presumably you had a PDF file already....
Every time you do a conversion you run the risk of this kind of degradation, you should really try and minimise this in your workflow.
The problem is even worse in this case, the input PDF file has a TrueType CID Font. Basic language level 2 PostScript can't handle CIDFonts (IIRC this was introduced in version 2015). Since eps2write only emits basic level 2 it cannot write the font as a CIDFont. So instead it captures the glyph outlines and stores them in a type 3 font.
However, our EPS/PS output doesn't attempt to embed ToUnicode information in the PostScript (its non-standard, very few applications can make use of it and it therefore makes the files larger for little benefit). In addition CIDFonts use multiple (2 or more) bytes for the character code, so there's no way to encode the type 3 fonts as ASCII.
Fundamentally you cannot use Ghostscript to go PDF->PS->PDF and still be able to copy/paste/search text, if the input contains CIDFonts.
By the way, there's no point in setting -dLanguageLevel at all. eps2write only creates level 2 output.
I used Inkscape To convert a .pdf to .EPS. Just upload the .pdf file to Inkscape, in the options to open chose high mesh, and save as . an EPS file.

Cropped PCL after gswin PDF to PCL conversion

I have a PDF, which I want to convert to PCL
I convert PDF to PCL using the following command:
(gs 8.70)
gswin32c.exe -q -dNOPAUSE -dBATCH \
-sDEVICE=ljetplus -dDuplex=false -dTumble=false \
-sPAPERSIZE=a4 -sOutputFile="d:\doc1.pcl" \
-f"d:\doc1.pdf" -c -quit
When I view or print the output PCL, it is cropped. I would expect the output to start right at the edge of the paper (at least in the viewer).
Is there any to way get the whole output without moving the contents of the page away from the paper edge?
I tried the -dPDFFitpage option which works, but results in a scaled output.
You are using -sPAPERSIZE=a4. This causes the PCL to render for A4 sized media.
Very likely, your input PDF is made for a non-A4 size. That leaves you with 3 options:
...you either use that exact page size for the PCL too (which your printer possibly cannot handle),
...or you have to add -dPDFFitPage (as you tried, but didn't like),
...or you skip the -sPAPERSIZE=... parameter altogether (which most likely will automatically use the same as size as the PDF, and which your printer possibly cannot handle...)
Update 1:
In case ljetplus is not a hard requirement for your requested PCL format variant, you could try this:
gs -sDEVICE=pxlmono -o pxlmono.pcl a4-fo.pdf
gs -sDEVICE=pxlcolor -o pxlcolor.pcl a4-fo.pdf
Update 2:
I can confirm now that even the most recent version of Ghostscript (v9.06) cannot handle non-Letter page sizes for ljetplus output.
I'd regard this as a bug... but it could well be that it won't be fixed, even if reported at the GS bug tracker. However, the least that can be expected is that it will get documented as a known limitation for ljetplus output...