I am attempting to convert various pdf files to match PDF/A-2b compliance. I am using this command:
gswin64c -dPDFA=2 -dBATCH -dNOPAUSE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile=output.pdf PDFA_def.ps cid0.pdf
with a default PDFA_def.ps file which uses ps_rgb.icc (although I don't think that affects this error).
This works for most files but upon trying to process this certain pdf, I get this error in ghostscript:
GPL Ghostscript 9.25: A CIDFont uses CID 0, which is not legal for PDF/A, reverting to normal PDF output.
Opening up the pdf file in the free acrobat version, I found multiple fonts that were Type 1 (CID) and Identity-H. Following the documentation on https://www.ghostscript.com/doc/current/Use.htm#CIDFonts, I tried editing the cidfmap file to replace the fonts I found using downloaded ttf fonts, but I don't think that did anything. Using https://www.ilovepdf.com/convert-pdf-to-pdfa, I was able to successfully convert the file to pdfa/2-b, but the fonts in the converted file were the same (type 1 cid and identity-h).
I also checked the converted file using VeraPDF and the errors seemed to show that the file had never been converted to pdfa in the first place.
I am just getting started with ghostscript/pdf conversion things so I am not sure what the proper way to handle this is, or if I am headed in entirely the wrong direction. Is there a general way for ghostscript to handle pdf files with CID 0 type fonts so that they can be turned into PDFA/2-B compliant files? Is there some processing/preconversion I can do on this file to edit the fonts or something else to make it convertible? Or is the noncompliance for this file caused by something else?
The file is in Google drive here: cid0.pdf
Related
I spend almost day trying to understand how to convert "print to file" PRN file to PS or PDF file. The PRN file is in PCL format and while looking on google I found that GhostPCL should do conversion job using command line
gpcl6win64.exe -dNOPAUSE -sDEVICE=ps2write -sOutputFile="c:\output.ps" "c:\input.prn"
Unfortunately that I get is binary data from PRN file is moved to ps/pdf file without rendering anything useful only garbage symbols.
Any idea how to convert PRN PCL files to PS or PDF?
I am using the newest version of GhostPCL 9.23 for Win64.
I also attaching test files.
input.prn - the print to file PCL.
output.ps - the PostScript file created using command line above with ps2write. Bad result.
output.pdf - the PDF file created using command line above with pdfwrite. Bad result.
output-correct.pdf - The PDF file created using one of online converters. Produce correct output but looking at PDF metadata it seems it is using GPL Ghostscript 9.19.
The command line you specify will create a PostScript program from the PCL input. I don't understand what you mean by:
binary data from PRN file is moved to ps/pdf file without rendering
anything
The process won't render anything it will produce a PostScript file. Note you can create a PDF file instead by using the pdfwrite device.
Perhaps if you shared an example of the input and output files it might be possible to say more. It would also be helpful to see the entire transcript of the back channel output. If nothing else it would contain the version of GhostPCL being used, which would be helpful to know.
[Edit after files supplied]
I've no idea what led you to believe your file was a PCL file, but it isn't.
The 'PRN' file turns out not to be a PCL file at all. Its an XPS file.
Unsurprisingly, when you run this through the PCL interpreter it doesn't know what to make of it. PCL interpreters treat anything they don't understand as 'text' and try to print it as such. Which is why the content of your PDF begins with 'PK'. XPS files are zip archives, and PK is the signature for a zip archive.
If you use GhostXPS instead it will read the file properly. Or, since this is presumably on Windows 10, you could just save it direct to a PDF file if that's what you want.
Add a Microsoft PS Class Driver printer on a local port FILE:. Make sure that its Print Processor is set to winprint/RAW.
I am exploring tools to convert PDF documents to PDF/A. Ghostscript seems to give out of the box support for such a conversion. One issue seems to be that some true type fonts that are a part of the original PDF document are not converted correctly. If I copy a text from the converted PDF/A document, and paste it in notepad, the copied text appears to be garbled text.
The original document text can be copied to notepad just fine.
I am using the following script:
gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput.pdf Filtered1Page.pdf
I have uploaded a sample 1 page source PDF in Google Drive:
SampleInput
A sample output PDF/A document generated from the command is in Google drive here:
SampleOutput
Running the above query on this PDF in a windows machine will reproduce the issue.
Are there any settings / commands make the PDF/A conversion to be handled properly?
Copy and paste from a PDF is not guaranteed. Subset fonts will not have a usable Encoding (such as ASCII or UTF-8), in which case they will only be amenable to cut/paste/search if they have an associated ToUnicode CMap, many PDF files do not contain ToUnicode CMaps.
Of course, the PDF/A specification states (oddly in my opinion) that you should not use subset fonts, but its not always possible to tell whether a font is subset (not all creators follow the XXXXX+ convention), and even if the font isn't subset there still isn't any guarantee that its Encoding is one that is usable.
Looking at the file you have posted, it does not contain one of the fonts it uses (Arial,Bold) and so Ghostscript substitutes with DroidSansFallback, and the font it does contain (FreeSansBold) is a subset (FWIW this font doesn't actually seem to be used....). The fallback font is a CIDFont, so there is no real prospect of the text being 'correct'.
I believe that if you make a real font available to Ghostscript to replace Arial,Bold then it will probably work correctly. This would also fix the rather more obvious problem of the spacing of the characters being incorrect (in one place, wildly incorrect), which is caused by the fallback font having different widths to the original.
NB as the warning messages have already told you don't use -dUseCIEColor.
The fact that you cannot copy/paste/search a PDF does not mean that it is not a valid PDF/A-1b file though, so thsi does not mean that the creation (NOT conversion) of the PDF/A-1b is not 'proper'.
I am using TCPDF in order to create PDF files.
Because TCPDF has a bug in the font subsetting (link to bug),
I use the following Ghostscript command to subset fonts in the TCPDF-created PDF file:
gswin64c.exe -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/prepress -dUseFlateCompression=false -dEmbedAllFonts=true \
-dSubsetFonts=true -sOutputFile="out.pdf" "input.pdf"
It works great and reduces the file size. But when I try to parse the PDF file as text (with poppler -> pdftotext) or when I open the file in PDF viewer and select text I get gibberish on UTF-8 fonts.
In order to reproduce it here is the file before ghostscript and file after ghostscript.
If you open it in Adobe reader copy the text and paste it to somewhere else, you can see that you can copy text from the file "before GS". But in the second file you get gibberish unless you copy english characters (files are in Hebrew).
Other than that the file looks great.
Do you have any idea on how to preserve the UTF8 fonts in Ghostscript?
Yes, don't subset the fonts. Subsetting the fonts causes them to be re-encoded. Because your fonts don't have a ToUnicode CMap, the copy/paste only works by heuristics (ie the character codes have to be meaningful) In your case the character codes are, or appear to be, Unicode, so you are in luck, the heuristics work.
Once you subset the fonts, Ghostscript re-encodes them. So the character codes are no longer Unicode. In the absence of a ToUnicode CMap, the copy/paste no longer works.
The only way you can get this to work is to not re-encode the fonts, which means you cannot subset them using Ghostscript's pdfwrite device. In fact, because you are using CIDFonts with TrueType outlines, you can't avoid subsetting the fonts, so basically, this won't work.
Please bear in mind that Ghostscript's pdfwrite device is not intended as a tool for manipulating PDF files!
By the way, your PDF file has other problems, It scales a font (Tf operator) to 0, and it has a BBox for a Form where all the co-ordinates are 0 (and indeed the form has no content, so pointless). This is in addition to a CIDFont with no ToUnicode CMap. Perhaps you should consider a different tool for production of PDF files.
I am looking for a way to 'outline' all text/fonts in a PDF file, i.e. convert them to curves.
I would prefer to do this without having to convert the PDF to PostScript and back. Also, I would like to use free lightweight cross-platform tools that can be automated from the command line, such as Ghostscript or MuPDF.
Yes, you can use Ghostscript to achieve what you want.
I. For Ghostscript versions up to 9.14
You need to go through 2 steps:
Convert the PDF to a PostScript file, but use the side effect of a relatively unknown parameter: it is called -dNOCACHE. This will convert all used fonts to outline shapes:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Convert the PS back to PDF (and, maybe delete the intermediate PS again):
gs -o somepdf-with-outlines.pdf -sDEVICE=pdfwrite somepdf.ps
rm somepdf.ps
This method is not reliable long-term, because the Ghostscript developers have stated that -dNOCACHE may not be present in future versions.
Note: the resulting PDF will very likely be larger than the original one. Plus, without additional command line parameters, all images in the original PDF will likely also be processed according to Ghostscript builtin defaults. This can lead to unwanted side-effects. Those side-effects can be avoided by adding more command line parameters to do otherwise.
II. Ghostscript versions 9.15 or newer
Ghostscript version 9.15 (released in September 2014) supports a new command line parameter:
-dNoOutputFonts
This will cause the output devices pdfwrite, ps2write and eps2write "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
This means: the two steps described for pre-9.15 GS versions can be avoided. The desired result can be achieved with a single command:
gs -o file-with-outlines.pdf -dNoOutputFonts -sDEVICE=pdfwrite file.pdf
Note: the same caveat is true as already noted in part I. If your PDF includes images, there may be unwanted side effects introduced by the simple command line above. To avoid these, you need to add more specific parameters.
This commit adds a new switch -dNoOutputFonts to the Ghostscript pdfwrite and ps2write devices which will produce a PDF file (or PostScript, depending on the selected device) where all the glyphs have been created as vectors, not as text.
You will need at least version 9.15 of Ghostscript to get this feature. Be aware that the PDF file will almost certainly be larger and copy/paste/search will (obviously) not work.
III. Ghostscript versions 9.54.0 (Windows 10)
I found a method that preserves all fonts flawlessly as vectors without any visual errors and with just two printing steps, after Ghostscript is first installed and configured correctly.
(Note! You must Add the Ghostscript bin-/ and lib-folder to your windows PATH in order to get Ghostscript to do anything)
Instructions here
Print your PDF-file that contains vector based fonts or other vector elements with Acrobat Reader and using Microsoft PS Class Driver to a YourFile.prn file. (To install this driver -- Control Panel - Devices - Printers & Scanners - Add a Printer or scanner -- and let first Windows to look for a while for a connected printer, and when it stops select an option -- The printer that I want is not listed - Add a local printer or network printer with manual settings - Next - Use an existing port: > File:(Print to File) - Next - Microsoft: Microsoft PS Class Driver - Next)
Open Command prompt, navigate to the folder where YourFile.prn file is located and type: "C:\Program Files\gs\gs9.54.0\bin\gswin64c.exe" -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=eps2write -sOutputFile=YourFile.eps YourFile.prn
If you have a constant need to do this you can also create prn2eps.bat file containing the following:
"C:\Program Files\gs\gs9.54.0\bin\gswin64c.exe" -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=eps2write -sOutputFile=%1.eps %1.prn
To use that bat file you just need to type: prn2eps YourFile.
(Note! you must have the bat file and Yourfile.prn in the same directory)
For some reason newest Ghostscript ps2epsi function didn't work in Windows 10, and Adobe made PDF:s had e.g. minor but consistent errors in some font characters when I imported them in non-Adobe design software as PDF:s. I have found out during the years that EPS-file format is one of the most reliable formats when vectors must be preserved from one software to another. Many times printing PDF again to PDF using just another printer driver may be enough or single file format change using Ghostscript, but not always.
We have ghostscript setup on our server to convert a PDF into separate TIFF images when it is uploaded. It's works perfectly most of the time, however sometimes it fails. I have managed to solve this on a per PDF basis by opening the problem PDF and saving it in Acrobat as an 'Optimized PDF' and specifically with JUST these two attributes checked:
'Discard unreferenced named destinations' (in Clean Up)
'Optimize page content' (in Clean Up)
(nothing else has been checked in any section, just these two)
My question is, is there a way to have ghostscript do what I am having to currently do?
The reason I need ghostscript to do this is because it has to be fully automated so users can upload a pdf and it gets converted into images.
If it helps, here are the ghostscript settings we are using:
-dQUIET
-dSAFER
-dBATCH
-dNOPAUSE
-dNOPROMPT
-sDEVICE=tiff24nc
-dUseCIEColor
-dTextAlphaBits=4
-dGraphicsAlphaBits=4
-dEPSCrop
Many thanks,
Pat
some times ghostscript fails in opening files due to XREF table corruption
try to repair problematic pdf with
pdftk
http://www.pdflabs.com/docs/install-pdftk/
pdftk file.pdf output fixed.pdf
if pdftk is able to repair pdf file, then a shellscript can be made with an
if...then..else statement (if pdf file causes ghostscript failing, then it will be automatically repaired by pdftk and then resubmitted to ghostscript)
apart all; you need to learn to READ ERROR OUTPUT, since in error output are almost the 99% of times contained the explanations of error