Ghoscript: PDF/A or PDF/X → regular PDF? - pdf

How can I use gs to convert a PDF/A or PDF/X file to a regular PDF file?

You can't use Ghostscript to 'convert' a PDF file, only to take a PDF as an input, and produce a new PDF as an output. If you simply pass the PDF as an input to Ghostscript, and use the pdfwrite device then it will produce an equivalent PDF for you, unless you specify PDF/A or PDF/X it won't be produced to either standard.
In any event, why would you want to do this ? A PDF/A or PDF/X file is simply a valid PDF file which adheres to certain additional restrictions.

I have been looking for the same answer , as if you want to print your pdf file, printers only except pdf/x format. this is the gs script i have been using;
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=new.pdf original.pdf

Related

Ghostscript pdf conversion makes ligatures unable to copy & paste

I have a pdf (created with latex with \usepackage[a-2b]{pdfx}) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:
gs -dPDFA-2 -dBATCH -dNOPAUSE -sPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite
-dPDFSETTINGS=/printer -sProcessColorModel=DeviceRGB
-sColorConversionStrategy=UseDeviceIndependentColor
-dColorImageDownsampleType=/Bicubic -dAutoRotatePages=/None
-dCompatibilityLevel=1.5 -dEmbedAllFonts=true -dFastWebView=true
-sOutputFile=main_new.pdf main.pdf
While this produces a nice, small pdf, now when I copy a word with "fi", I instead (often) get "ő".
Since the correct characters are somehow encoded in the original pdf, is there some parameter I can give ghostscript so that it simply preserves this information in the converted pdf?
I'm using ghostscript 9.27 on macOS 10.14.
Without seeing your original file, so that I can see the way the text is encoded, it's not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information'; for an explanation, see here.
If you original PDF file has a ToUnicode CMap then the pdfwrite device should use that to generate a new ToUnicode CMap in the output file, maintaining cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but it's just a guess without seeing an example.
My guess is that your original file doesn't have a ToUnicode CMap, which means that it's essentially only working by luck.

How can I disable ghostscipt rasterization of images and paths?

I need to convert a PDF to a different ICC color profile. Through different searches and tests, I found out a way to do that:
First I convert my PDF to a PS file with:
.\gswin64c.exe -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile="test.ps" "test.pdf"
Then I convert the PS back to a PDF with the following (this is to generate a valid PDF/X-3 file):
.\gswin64c.exe -dPDFX -dNOPAUSE -dBATCH -sDEVICE=pdfwrite
-sColorConversionStrategy=/UseDeviceIndependentColor -sProcessColorModel=DeviceCMYK
-dColorAccuracy=2 -dRenderIntent=0 -sOutputICCProfile="WebCoatedSWOP2006Grade5.icc"
-dDeviceGrayToK=true -sOutputFile="final.pdf" test_PDFX_def.ps test.ps
The ICC profile is embedded and all works perfectly. The only problem is that the whole final PDF is rasterized. Here I loose all the paths and other vectorial elements quality I have in the starting file. I need to keep them vectorial because this PDF will have a specific application.
First step don't convert to PostScript!!!
Any transparent marking operations will have to be rendered if you do that, because PostScript doesn't support transparency. Other features will be lost as well, so really, don't do that. The input and output ends of Ghostscript are more or less independent; the pdfwrite device doesn't know whether the input was PDF or PostScript, and doesn't care. So you don't need to convert a PDF file into PostScript before sending it as input.
You can feed the original PDF file into the second command line in place of the PostScript file.
As long as you are producing PDF/X-3 or later then the transparency will be preserved. Make sure you are using an up to date version of Ghostscript.

Ghostscript - Convert vector pdf to the raster pdf

I would like to convert the vector pdf to raster pdf by using ghostscript(i.e. rasterized the vector pdf). But I cannot find the appropriate parameters to do so even if I add the resolution parameter -r300.
The code I used is -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dGrap
hicsAlphaBits=1 -sDEVICE=pdfwrite -r300 -sOutputFile="output-raster.pdf" "input-vector.pdf"
Anyone know how to rasterized the pdf?
You can use pdftocairo from the Poppler library. It can convert a PDF to a raster image format like PNG or JPEG. Then use any image viewer or imagemagick to convert the image to a PDF file if you need a PDF as output.

Ghostscript output PDF: text can not be copied

I am using TCPDF in order to create PDF files.
Because TCPDF has a bug in the font subsetting (link to bug),
I use the following Ghostscript command to subset fonts in the TCPDF-created PDF file:
gswin64c.exe -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/prepress -dUseFlateCompression=false -dEmbedAllFonts=true \
-dSubsetFonts=true -sOutputFile="out.pdf" "input.pdf"
It works great and reduces the file size. But when I try to parse the PDF file as text (with poppler -> pdftotext) or when I open the file in PDF viewer and select text I get gibberish on UTF-8 fonts.
In order to reproduce it here is the file before ghostscript and file after ghostscript.
If you open it in Adobe reader copy the text and paste it to somewhere else, you can see that you can copy text from the file "before GS". But in the second file you get gibberish unless you copy english characters (files are in Hebrew).
Other than that the file looks great.
Do you have any idea on how to preserve the UTF8 fonts in Ghostscript?
Yes, don't subset the fonts. Subsetting the fonts causes them to be re-encoded. Because your fonts don't have a ToUnicode CMap, the copy/paste only works by heuristics (ie the character codes have to be meaningful) In your case the character codes are, or appear to be, Unicode, so you are in luck, the heuristics work.
Once you subset the fonts, Ghostscript re-encodes them. So the character codes are no longer Unicode. In the absence of a ToUnicode CMap, the copy/paste no longer works.
The only way you can get this to work is to not re-encode the fonts, which means you cannot subset them using Ghostscript's pdfwrite device. In fact, because you are using CIDFonts with TrueType outlines, you can't avoid subsetting the fonts, so basically, this won't work.
Please bear in mind that Ghostscript's pdfwrite device is not intended as a tool for manipulating PDF files!
By the way, your PDF file has other problems, It scales a font (Tf operator) to 0, and it has a BBox for a Form where all the co-ordinates are 0 (and indeed the form has no content, so pointless). This is in addition to a CIDFont with no ToUnicode CMap. Perhaps you should consider a different tool for production of PDF files.

Print a file (pdf) to a printer with PS driver, grab PS-file and convert to searchable pdf with ghostscript

When I print a PDF-file with a PS-driver and then convert the PS-file to a searchable PDF with ghostscript (pdfwrite device) something is wrong with the final pdf file. It becomes corrupt.
In some cases, the space character disappears, and in other cases the text width becomes too large so text overlap text.
The settings for gs is -dNOPAUSE -dBatch -sDEVICE=pdfwrite -dEmbedAllFonts=true -dSubsetFonts=false -sOutputFile=output.pdf input.ps
I am wondering if it is ghostscript that just cant produce a good output when the input file is a pdf.
If I print a word-document everything works fine!
Are there any other solutions like using a xps-driver and convert the xps file to a searchable pdf instead? are there any solutions out there that can do this?
I use gs 9.07.
Best regards
Joe
Why are you going through the step of printing the PDF file to a PostScript file? Ghostscript is already capable of accepting a PDF file as input.
This simply adds more confusion, it certainly won't add anything useful.
Its not possible to say what the problem 'might' be without seeing the original PDF file and the PostScript file produced by your driver. My guess would be that whatever application is processing the PDF hasn't embedded the font, or that the PostScript driver hasn't been able to convert the font into something suitable for PostScript, resulting in the font being missing in the output, and the pdfwrite device having to substitute 'something else' for the missing font.
Ghostscript (more accurately the pdfwrite device) is perfectly capable of producing a decent PDF file when the input is PDF, but your input isn't PDF, its PostScript!
To be perfectly honest, if your original PDF file isn't 'searchable' its very unlikely that the PDF file produced by pdfwrite will be either, no matter whether you use the original PDF or mangle it into PostScript instead.
The usual reasons why a PDF file are not 'searchable' are because there is no ToUnicode information and the font is encoded with a custom encoding and deos not use standard glyph names. If this is the case there is nothing you can do with the PDF file except OCR it.