Ghostscript pdf conversion makes ligatures unable to copy & paste

Ghostscript pdf conversion makes ligatures unable to copy & paste - pdf

I have a pdf (created with latex with \usepackage[a-2b]{pdfx}) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:
gs -dPDFA-2 -dBATCH -dNOPAUSE -sPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite
-dPDFSETTINGS=/printer -sProcessColorModel=DeviceRGB
-sColorConversionStrategy=UseDeviceIndependentColor
-dColorImageDownsampleType=/Bicubic -dAutoRotatePages=/None
-dCompatibilityLevel=1.5 -dEmbedAllFonts=true -dFastWebView=true
-sOutputFile=main_new.pdf main.pdf
While this produces a nice, small pdf, now when I copy a word with "fi", I instead (often) get "ő".
Since the correct characters are somehow encoded in the original pdf, is there some parameter I can give ghostscript so that it simply preserves this information in the converted pdf?
I'm using ghostscript 9.27 on macOS 10.14.

Without seeing your original file, so that I can see the way the text is encoded, it's not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information'; for an explanation, see here.
If you original PDF file has a ToUnicode CMap then the pdfwrite device should use that to generate a new ToUnicode CMap in the output file, maintaining cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but it's just a guess without seeing an example.
My guess is that your original file doesn't have a ToUnicode CMap, which means that it's essentially only working by luck.

Related

How can I disable ghostscipt rasterization of images and paths?

I need to convert a PDF to a different ICC color profile. Through different searches and tests, I found out a way to do that:
First I convert my PDF to a PS file with:
.\gswin64c.exe -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile="test.ps" "test.pdf"
Then I convert the PS back to a PDF with the following (this is to generate a valid PDF/X-3 file):
.\gswin64c.exe -dPDFX -dNOPAUSE -dBATCH -sDEVICE=pdfwrite
-sColorConversionStrategy=/UseDeviceIndependentColor -sProcessColorModel=DeviceCMYK
-dColorAccuracy=2 -dRenderIntent=0 -sOutputICCProfile="WebCoatedSWOP2006Grade5.icc"
-dDeviceGrayToK=true -sOutputFile="final.pdf" test_PDFX_def.ps test.ps
The ICC profile is embedded and all works perfectly. The only problem is that the whole final PDF is rasterized. Here I loose all the paths and other vectorial elements quality I have in the starting file. I need to keep them vectorial because this PDF will have a specific application.

First step don't convert to PostScript!!!
Any transparent marking operations will have to be rendered if you do that, because PostScript doesn't support transparency. Other features will be lost as well, so really, don't do that. The input and output ends of Ghostscript are more or less independent; the pdfwrite device doesn't know whether the input was PDF or PostScript, and doesn't care. So you don't need to convert a PDF file into PostScript before sending it as input.
You can feed the original PDF file into the second command line in place of the PostScript file.
As long as you are producing PDF/X-3 or later then the transparency will be preserved. Make sure you are using an up to date version of Ghostscript.

Ghostscript output PDF: text can not be copied

I am using TCPDF in order to create PDF files.
Because TCPDF has a bug in the font subsetting (link to bug),
I use the following Ghostscript command to subset fonts in the TCPDF-created PDF file:
gswin64c.exe -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/prepress -dUseFlateCompression=false -dEmbedAllFonts=true \
-dSubsetFonts=true -sOutputFile="out.pdf" "input.pdf"
It works great and reduces the file size. But when I try to parse the PDF file as text (with poppler -> pdftotext) or when I open the file in PDF viewer and select text I get gibberish on UTF-8 fonts.
In order to reproduce it here is the file before ghostscript and file after ghostscript.
If you open it in Adobe reader copy the text and paste it to somewhere else, you can see that you can copy text from the file "before GS". But in the second file you get gibberish unless you copy english characters (files are in Hebrew).
Other than that the file looks great.
Do you have any idea on how to preserve the UTF8 fonts in Ghostscript?

Yes, don't subset the fonts. Subsetting the fonts causes them to be re-encoded. Because your fonts don't have a ToUnicode CMap, the copy/paste only works by heuristics (ie the character codes have to be meaningful) In your case the character codes are, or appear to be, Unicode, so you are in luck, the heuristics work.
Once you subset the fonts, Ghostscript re-encodes them. So the character codes are no longer Unicode. In the absence of a ToUnicode CMap, the copy/paste no longer works.
The only way you can get this to work is to not re-encode the fonts, which means you cannot subset them using Ghostscript's pdfwrite device. In fact, because you are using CIDFonts with TrueType outlines, you can't avoid subsetting the fonts, so basically, this won't work.
Please bear in mind that Ghostscript's pdfwrite device is not intended as a tool for manipulating PDF files!
By the way, your PDF file has other problems, It scales a font (Tf operator) to 0, and it has a BBox for a Form where all the co-ordinates are 0 (and indeed the form has no content, so pointless). This is in addition to a CIDFont with no ToUnicode CMap. Perhaps you should consider a different tool for production of PDF files.

Is there a way to ignore the watermark when using Ghostscript to convert PDF to TIFF

I'm using gs9.10 and have successfully converted my PDF to TIFF using this command line:
gswin64c -dNOPAUSE -q -r300x300 -sDEVICE=tifflzw \
-dBATCH -sCompression=lzw -dFirstPage=1 -dLastPage=5 \
-sOutputFile=TEST.TIFF \
TEST.PDF
However, I don't want the TIFF to have the watermark that is on every page of the PDF. Is there an option to ignore the watermark layer when writing out to a TIFF?

To be honest, this sounds suspiciously like trying to circumvent copyright. Obviously I can't tell since I haven't seen your original PDF file but watermarks are often applied to 'demo' or paid-for PDF files.
In any event, without seeing the file its impossible to say whether a watermark can be removed, because it depends on how the watermark has been applied, there are at least 3 different ways that I can think of off-hand and 2 of those I could eliminate the watermark later. There is unlikely to be a 'watermark layer' in the PDF file.
If you post a URL to the original PDF file I can look at it.

If it is just about text extraction, this command should do it:
pdftotext \
-layout \
input.pdf
output.txt
Now, if your "watermark" is no text, but some sort of image or vector graphic, it will not be part of your output.txt
Now, if your "watermark" indeed is also text, that watermark string will also appear in your output text for each page. It should be easy to remove that string from the text (and replace it by nothing if unwanted).
If your "watermark" text appears as gobble-di-gook in output.txt, then the font type ("Type 3"?) or font encoding ("custom"?) used for the watermark text does not allow for easy text extraction, or a valid "ToUnicode" map is missing for the font used.
If your main text did not extract successfully from the PDF, and if your main text is "gobble-di-gook", it very likely won't extract any better when removing the "watermark" from the original PDF file before applying pdftotext either...

Print a file (pdf) to a printer with PS driver, grab PS-file and convert to searchable pdf with ghostscript

When I print a PDF-file with a PS-driver and then convert the PS-file to a searchable PDF with ghostscript (pdfwrite device) something is wrong with the final pdf file. It becomes corrupt.
In some cases, the space character disappears, and in other cases the text width becomes too large so text overlap text.
The settings for gs is -dNOPAUSE -dBatch -sDEVICE=pdfwrite -dEmbedAllFonts=true -dSubsetFonts=false -sOutputFile=output.pdf input.ps
I am wondering if it is ghostscript that just cant produce a good output when the input file is a pdf.
If I print a word-document everything works fine!
Are there any other solutions like using a xps-driver and convert the xps file to a searchable pdf instead? are there any solutions out there that can do this?
I use gs 9.07.
Best regards
Joe

Why are you going through the step of printing the PDF file to a PostScript file? Ghostscript is already capable of accepting a PDF file as input.
This simply adds more confusion, it certainly won't add anything useful.
Its not possible to say what the problem 'might' be without seeing the original PDF file and the PostScript file produced by your driver. My guess would be that whatever application is processing the PDF hasn't embedded the font, or that the PostScript driver hasn't been able to convert the font into something suitable for PostScript, resulting in the font being missing in the output, and the pdfwrite device having to substitute 'something else' for the missing font.
Ghostscript (more accurately the pdfwrite device) is perfectly capable of producing a decent PDF file when the input is PDF, but your input isn't PDF, its PostScript!
To be perfectly honest, if your original PDF file isn't 'searchable' its very unlikely that the PDF file produced by pdfwrite will be either, no matter whether you use the original PDF or mangle it into PostScript instead.
The usual reasons why a PDF file are not 'searchable' are because there is no ToUnicode information and the font is encoded with a custom encoding and deos not use standard glyph names. If this is the case there is nothing you can do with the PDF file except OCR it.

PDF Optimization Acrobat vs. Ghostscript

I have a PDF file that I would like to optimize. I am receiving the file from an outside source so I don't have the means to recreate it from the beginning.
When I open the file in Acrobat and query the resources, it says that the fonts in the file take up 90%+ of the space. If I save the file as postscript and then save the postscript file to an optimized PDF, the file is significantly smaller (upwards of 80% smaller) and the fonts are still embedded.
I am trying to recreate these results with ghostscript. I have tried various permutations of options with pswrite and pdfwrite but what happens is when I do the initial conversion from PDF to Postscript, the text gets converted to an image. When I convert back to PDF the font references are gone so I end up with a PDF file that has 'imaged' text rather than actual fonts.
The file contains 22 embedded custom Type1 fonts which I have. I have added the fonts to the ghostscript search path and proved that ghostscript can find them with:
gs \
-I/home/nauc01
-sFONTPATH=/home/nauc01/fonts/Type1 \
-o 3783QP.pdf \
-sDEVICE=pdfwrite \
-g5950x8420 \
-c "200 700 moveto" \
-c "/3783QP findfont 60 scalefont setfont" \
-c "(TESTING !!!!!!) show showpage"
The resulting file has the font correctly embedded.
I have also tried using ghostscript to go from PDF to PDF like this:
gs \
-sDEVICE=pdfwrite \
-sNOPAUSE \
-I/home/nauc01 \
-dBATCH \
-dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/printer \
-CompressFonts=true \
-dSubsetFonts=true \
-sOutputFile=output.pdf \
input.pdf
but the output is usually larger than the input and I can't view the file in anything but ghostscript (adobe reader gives "Object label badly formatted").
I can't provide the original file because they contain confidential information but I will try to answer any questions that need to be answered regarding them.
Any ideas? Thanks in advance.

Don't use pswrite. As you've discovered this will render text. instead use the ps2write device which retains fonts and text.
You don't say which version of Ghostscript you are using but I would recommend you use a recent one.
One point; Ghostscript isn't 'optimising' the PDF the way Acrobat does, its re-creating it. The original PDF is fully interpreted to produce a sequence of operations that mark the page, pdfwrite (and ps2write) then make a new file which only has those operations inside.
If you choose to subset fonts, then only the required glyphs will be included. If the original PDF contains extraneous information (Adobe Illustrator, for example, usually embeds a complete copy of the .ai file) then this will be discarded. This may result in a smaller file, or it may not.
Note that pdfwrite does not support compressed xref and some other later features at present, so some files may well get bigger.
I would personally not go via ps2write, since this just adds another layer of prcoessing and discarding of information. I would just use pdfwrite to create a new PDF file. If you find files for which this does not work (using current code) then you should raise a bug report at http://bugs.ghostscript.com so that someone can address the problem.

You might want to try the Multivalent Compress tool. It has an (experimental) option to subset embedded fonts that might make your PDF much smaller. It also contains a lot of switches that allow for better compression, sometimes at the cost of quality (JPEG compression of bitmaps, for example).
Unfortunately, the most recent version of Multivalent does no longer include the tools. Google for Multivalent20060102.jar, that version still includes them. To run Compress:
java -classpath /path/to/Multivalent20060102.jar tool.pdf.Compress [options] <pdf file>

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas