I am using TCPDF in order to create PDF files.
Because TCPDF has a bug in the font subsetting (link to bug),
I use the following Ghostscript command to subset fonts in the TCPDF-created PDF file:
gswin64c.exe -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/prepress -dUseFlateCompression=false -dEmbedAllFonts=true \
-dSubsetFonts=true -sOutputFile="out.pdf" "input.pdf"
It works great and reduces the file size. But when I try to parse the PDF file as text (with poppler -> pdftotext) or when I open the file in PDF viewer and select text I get gibberish on UTF-8 fonts.
In order to reproduce it here is the file before ghostscript and file after ghostscript.
If you open it in Adobe reader copy the text and paste it to somewhere else, you can see that you can copy text from the file "before GS". But in the second file you get gibberish unless you copy english characters (files are in Hebrew).
Other than that the file looks great.
Do you have any idea on how to preserve the UTF8 fonts in Ghostscript?
Yes, don't subset the fonts. Subsetting the fonts causes them to be re-encoded. Because your fonts don't have a ToUnicode CMap, the copy/paste only works by heuristics (ie the character codes have to be meaningful) In your case the character codes are, or appear to be, Unicode, so you are in luck, the heuristics work.
Once you subset the fonts, Ghostscript re-encodes them. So the character codes are no longer Unicode. In the absence of a ToUnicode CMap, the copy/paste no longer works.
The only way you can get this to work is to not re-encode the fonts, which means you cannot subset them using Ghostscript's pdfwrite device. In fact, because you are using CIDFonts with TrueType outlines, you can't avoid subsetting the fonts, so basically, this won't work.
Please bear in mind that Ghostscript's pdfwrite device is not intended as a tool for manipulating PDF files!
By the way, your PDF file has other problems, It scales a font (Tf operator) to 0, and it has a BBox for a Form where all the co-ordinates are 0 (and indeed the form has no content, so pointless). This is in addition to a CIDFont with no ToUnicode CMap. Perhaps you should consider a different tool for production of PDF files.
Related
I have a pdf (created with latex with \usepackage[a-2b]{pdfx}) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:
gs -dPDFA-2 -dBATCH -dNOPAUSE -sPDFACompatibilityPolicy=1 -sDEVICE=pdfwrite
-dPDFSETTINGS=/printer -sProcessColorModel=DeviceRGB
-sColorConversionStrategy=UseDeviceIndependentColor
-dColorImageDownsampleType=/Bicubic -dAutoRotatePages=/None
-dCompatibilityLevel=1.5 -dEmbedAllFonts=true -dFastWebView=true
-sOutputFile=main_new.pdf main.pdf
While this produces a nice, small pdf, now when I copy a word with "fi", I instead (often) get "ő".
Since the correct characters are somehow encoded in the original pdf, is there some parameter I can give ghostscript so that it simply preserves this information in the converted pdf?
I'm using ghostscript 9.27 on macOS 10.14.
Without seeing your original file, so that I can see the way the text is encoded, it's not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information'; for an explanation, see here.
If you original PDF file has a ToUnicode CMap then the pdfwrite device should use that to generate a new ToUnicode CMap in the output file, maintaining cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but it's just a guess without seeing an example.
My guess is that your original file doesn't have a ToUnicode CMap, which means that it's essentially only working by luck.
Referring to this post, GhostScript Conversion Font Issues, is it safe to assume that GhostScript's PS-to-PDF conversions still do not guarantee cut-&-paste text from the converted document? Because I too am getting garbled copy-&-paste results with formatted documents, although it works with plain text files.
sample Word document .DOC
printed to PostScript by MS PS Driver
converted to PDF by GhostScript
On the color issue, I am using the Microsoft PS Class Driver to print documents to PostScript format files, and then convert them to PDF format with the GhostScript v9.20 DLL (sample source and outputs attached above). The options used are as follows:
-dNOPAUSE
-dBATCH
-dSAFER
-sDEVICE=pdfwrite
-sColorConversionStrategy=/RGB
-dProcessColorModel=/DeviceRGB
However, it is converted without color. Have I missed some option?
You can never guarantee getting a PDF file with text you can cut and paste from a PostScript program. There is no guarantee that there is any ToUnicode information in the PostScript program, and without that, if the font is subset as here, then there is no way to know what the Unicode code point for a given glyph is.
Regarding colour, the PostScript file you have supplied contains no colour, so its not Ghostscript, the problem is in the way you have produced the PostScript. At a guess you have used a Printer Definition (PPD file) which is for a monochrome printer.
You might be able to improve the text by playing with the options for downloading fonts, the basic problem is that your PostScript program doesn't contain the information we need to be able to construct a ToUnicode CMap. Without that we are forced to assume that the character codes are ASCII, and in your case, because the fonts are subset, they are not ASCII.
For some reason the content of your PostScript appears to be downloading the font as bitmaps. This is ugly, doesn't scale well, and might be the source of your inability to get ToUnicode data inserted. It may also be caused by the fonts you are using, you might try some standard system fonts (if you aren't already) like TimesNewRoman.
While its great that you supplied an example to look at, I'd suggest that in future you make the example smaller, much smaller.... There's really no need for 13 pages of multiply repeated content in this case. More content means it takes more time to decipher, try and keep example files to the minimum required to demonstrate the problem.
In short, it looks like both your problems are due to the way you are (or the application) generating the PostScript.
I am exploring tools to convert PDF documents to PDF/A. Ghostscript seems to give out of the box support for such a conversion. One issue seems to be that some true type fonts that are a part of the original PDF document are not converted correctly. If I copy a text from the converted PDF/A document, and paste it in notepad, the copied text appears to be garbled text.
The original document text can be copied to notepad just fine.
I am using the following script:
gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput.pdf Filtered1Page.pdf
I have uploaded a sample 1 page source PDF in Google Drive:
SampleInput
A sample output PDF/A document generated from the command is in Google drive here:
SampleOutput
Running the above query on this PDF in a windows machine will reproduce the issue.
Are there any settings / commands make the PDF/A conversion to be handled properly?
Copy and paste from a PDF is not guaranteed. Subset fonts will not have a usable Encoding (such as ASCII or UTF-8), in which case they will only be amenable to cut/paste/search if they have an associated ToUnicode CMap, many PDF files do not contain ToUnicode CMaps.
Of course, the PDF/A specification states (oddly in my opinion) that you should not use subset fonts, but its not always possible to tell whether a font is subset (not all creators follow the XXXXX+ convention), and even if the font isn't subset there still isn't any guarantee that its Encoding is one that is usable.
Looking at the file you have posted, it does not contain one of the fonts it uses (Arial,Bold) and so Ghostscript substitutes with DroidSansFallback, and the font it does contain (FreeSansBold) is a subset (FWIW this font doesn't actually seem to be used....). The fallback font is a CIDFont, so there is no real prospect of the text being 'correct'.
I believe that if you make a real font available to Ghostscript to replace Arial,Bold then it will probably work correctly. This would also fix the rather more obvious problem of the spacing of the characters being incorrect (in one place, wildly incorrect), which is caused by the fallback font having different widths to the original.
NB as the warning messages have already told you don't use -dUseCIEColor.
The fact that you cannot copy/paste/search a PDF does not mean that it is not a valid PDF/A-1b file though, so thsi does not mean that the creation (NOT conversion) of the PDF/A-1b is not 'proper'.
I'm using gs9.10 and have successfully converted my PDF to TIFF using this command line:
gswin64c -dNOPAUSE -q -r300x300 -sDEVICE=tifflzw \
-dBATCH -sCompression=lzw -dFirstPage=1 -dLastPage=5 \
-sOutputFile=TEST.TIFF \
TEST.PDF
However, I don't want the TIFF to have the watermark that is on every page of the PDF. Is there an option to ignore the watermark layer when writing out to a TIFF?
To be honest, this sounds suspiciously like trying to circumvent copyright. Obviously I can't tell since I haven't seen your original PDF file but watermarks are often applied to 'demo' or paid-for PDF files.
In any event, without seeing the file its impossible to say whether a watermark can be removed, because it depends on how the watermark has been applied, there are at least 3 different ways that I can think of off-hand and 2 of those I could eliminate the watermark later. There is unlikely to be a 'watermark layer' in the PDF file.
If you post a URL to the original PDF file I can look at it.
If it is just about text extraction, this command should do it:
pdftotext \
-layout \
input.pdf
output.txt
Now, if your "watermark" is no text, but some sort of image or vector graphic, it will not be part of your output.txt
Now, if your "watermark" indeed is also text, that watermark string will also appear in your output text for each page. It should be easy to remove that string from the text (and replace it by nothing if unwanted).
If your "watermark" text appears as gobble-di-gook in output.txt, then the font type ("Type 3"?) or font encoding ("custom"?) used for the watermark text does not allow for easy text extraction, or a valid "ToUnicode" map is missing for the font used.
If your main text did not extract successfully from the PDF, and if your main text is "gobble-di-gook", it very likely won't extract any better when removing the "watermark" from the original PDF file before applying pdftotext either...
When I print a PDF-file with a PS-driver and then convert the PS-file to a searchable PDF with ghostscript (pdfwrite device) something is wrong with the final pdf file. It becomes corrupt.
In some cases, the space character disappears, and in other cases the text width becomes too large so text overlap text.
The settings for gs is -dNOPAUSE -dBatch -sDEVICE=pdfwrite -dEmbedAllFonts=true -dSubsetFonts=false -sOutputFile=output.pdf input.ps
I am wondering if it is ghostscript that just cant produce a good output when the input file is a pdf.
If I print a word-document everything works fine!
Are there any other solutions like using a xps-driver and convert the xps file to a searchable pdf instead? are there any solutions out there that can do this?
I use gs 9.07.
Best regards
Joe
Why are you going through the step of printing the PDF file to a PostScript file? Ghostscript is already capable of accepting a PDF file as input.
This simply adds more confusion, it certainly won't add anything useful.
Its not possible to say what the problem 'might' be without seeing the original PDF file and the PostScript file produced by your driver. My guess would be that whatever application is processing the PDF hasn't embedded the font, or that the PostScript driver hasn't been able to convert the font into something suitable for PostScript, resulting in the font being missing in the output, and the pdfwrite device having to substitute 'something else' for the missing font.
Ghostscript (more accurately the pdfwrite device) is perfectly capable of producing a decent PDF file when the input is PDF, but your input isn't PDF, its PostScript!
To be perfectly honest, if your original PDF file isn't 'searchable' its very unlikely that the PDF file produced by pdfwrite will be either, no matter whether you use the original PDF or mangle it into PostScript instead.
The usual reasons why a PDF file are not 'searchable' are because there is no ToUnicode information and the font is encoded with a custom encoding and deos not use standard glyph names. If this is the case there is nothing you can do with the PDF file except OCR it.