I was trying to convert a pdf document into text file. everything works until i open the output file to see its unreadable the characters are in some Chinese font
" 琀攀猀琀 "
this is my command line
gswin64c.exe -ps2ascii -sDEVICE=txtwrite -sOutputFile=outputtext.txt test.pdf
im i doing something wrong?
You haven't posted the file, so its not possible to be absolutely certain, however....
Almost certainly the text in your PDF file is not encoded using an ASCII encoding scheme (possibly contains sunset fonts), and does not contain a ToUnicode CMap for the font in question. Additionally the glyph names are not standard names (or its a TrueType font, which don't have named glyphs).
Without any of the above information txtwrite doesn't have any clue what the character codes represent, and so simply emits them verbatim.
Given that you are seeing Chinese glyphs, I would suspect that the original font is a CIDFont, probably a TrueType font, subset and has no ToUnicode CMap.
In this case, the only way to get the text out will be to use OCR.
Related
EDITED at 17:54 26.02.2020
PDFs in the following link look sane but give garbled text when copying because they lack cmap. I don't understand why rendering is performed properly and want to know which information is used to determine characters to render.
https://github.com/angea/PDF101/tree/master/handcoded/textextract
Please note that I'm asking for mechanism but not for solution.
ACCORDING TO https://tex.stackexchange.com/questions/526157/what-is-identity-h-encoding-should-it-be-avoided-and-if-so-how
CMAP is a mapping between CID and GID. It can render when CMAP and file is available.
TOUNICODE CMAP is a mapping between CID and GID. It makes you can extract information from PDF.
There are two fonts on page 1, and both are missing any encoding information in the PDF metadata. In particular there is no ToUnicode map.
Therefore PDF readers have to rely on the font itself, and possibly the character codes used in the content stream.
In the screenshot below, the left side is the font data in the PDF, and on the right is the content stream of the first page. As you can see, the first character code is 0x2e, which maps to the glyph "T" but in Unicode U+002e is "period". The next character code is 0x08, which is a control character. This is why, if you select text from the PDF, the first character would be "." and the second would be garbage.
Why PDF without cmap can render characters?
Because the fonts internal CMap maps the character codes in the PDF page content stream to the correct glyph in the font, so you see glyphs that make sense. However, both the PDF and the font itself, are missing any sensible Unicode mappings, hence you get garbage when you copy+paste the text.
Referring to this post, GhostScript Conversion Font Issues, is it safe to assume that GhostScript's PS-to-PDF conversions still do not guarantee cut-&-paste text from the converted document? Because I too am getting garbled copy-&-paste results with formatted documents, although it works with plain text files.
sample Word document .DOC
printed to PostScript by MS PS Driver
converted to PDF by GhostScript
On the color issue, I am using the Microsoft PS Class Driver to print documents to PostScript format files, and then convert them to PDF format with the GhostScript v9.20 DLL (sample source and outputs attached above). The options used are as follows:
-dNOPAUSE
-dBATCH
-dSAFER
-sDEVICE=pdfwrite
-sColorConversionStrategy=/RGB
-dProcessColorModel=/DeviceRGB
However, it is converted without color. Have I missed some option?
You can never guarantee getting a PDF file with text you can cut and paste from a PostScript program. There is no guarantee that there is any ToUnicode information in the PostScript program, and without that, if the font is subset as here, then there is no way to know what the Unicode code point for a given glyph is.
Regarding colour, the PostScript file you have supplied contains no colour, so its not Ghostscript, the problem is in the way you have produced the PostScript. At a guess you have used a Printer Definition (PPD file) which is for a monochrome printer.
You might be able to improve the text by playing with the options for downloading fonts, the basic problem is that your PostScript program doesn't contain the information we need to be able to construct a ToUnicode CMap. Without that we are forced to assume that the character codes are ASCII, and in your case, because the fonts are subset, they are not ASCII.
For some reason the content of your PostScript appears to be downloading the font as bitmaps. This is ugly, doesn't scale well, and might be the source of your inability to get ToUnicode data inserted. It may also be caused by the fonts you are using, you might try some standard system fonts (if you aren't already) like TimesNewRoman.
While its great that you supplied an example to look at, I'd suggest that in future you make the example smaller, much smaller.... There's really no need for 13 pages of multiply repeated content in this case. More content means it takes more time to decipher, try and keep example files to the minimum required to demonstrate the problem.
In short, it looks like both your problems are due to the way you are (or the application) generating the PostScript.
I am trying to convert a postscript file which contains some telugu Font (i.e Vani Bold). After converting the file into pdf I am not able to copy the text from generated pdf file .When I see the properties of pdf file in centos document viewer it is showing like below
I am using below command to convert postscript file to pdf
bin/gs -dBATCH -sDEVICE=pdfwrite -sNOPAUSE -dQUITE -sOutputFile=/home/cloudera/Desktop/PrintTest/telugu.pdf /home/cloudera/Desktop/PrintTest/VirtualPrinter_27_09_2016_19_11_41_691.ps
I tried with ghostscript 9.19 and 9.20 as well,but no change.
Following is the link to my postscript file which I am trying to convert into pdf.
click here for postscript file
I have been struggling with this since 10 days .Please provide some solution for this.
I can tell you why you can't copy & paste the text, but I'm not sure I can provide an acceptable solution.
First, not all pdf viewers can deal with unicode characters (for example,xpdf can't, it just ignores them, while mudpf and qpdfview work).
Second, to be able to convert font glyphs to unicode characters, the font object in the PDF file must contain a /ToUnicode property. If you look at the generated PDF after decompression (mutool clean -d), you can see that the Vani font in object 8 0 doesn't have it, while both the Arial font in object 10 0 and the Calibri font in object 12 0 do.
So very likely the Vani font is missing this unicode translation information, you need to either add this information (e.g. with fontforge), or choose a different font that has this information.
Related question:
https://superuser.com/questions/1124583/text-in-pdf-turns-gibberish-on-copying-but-displays-fine/1124617#1124617
I am exploring tools to convert PDF documents to PDF/A. Ghostscript seems to give out of the box support for such a conversion. One issue seems to be that some true type fonts that are a part of the original PDF document are not converted correctly. If I copy a text from the converted PDF/A document, and paste it in notepad, the copied text appears to be garbled text.
The original document text can be copied to notepad just fine.
I am using the following script:
gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput.pdf Filtered1Page.pdf
I have uploaded a sample 1 page source PDF in Google Drive:
SampleInput
A sample output PDF/A document generated from the command is in Google drive here:
SampleOutput
Running the above query on this PDF in a windows machine will reproduce the issue.
Are there any settings / commands make the PDF/A conversion to be handled properly?
Copy and paste from a PDF is not guaranteed. Subset fonts will not have a usable Encoding (such as ASCII or UTF-8), in which case they will only be amenable to cut/paste/search if they have an associated ToUnicode CMap, many PDF files do not contain ToUnicode CMaps.
Of course, the PDF/A specification states (oddly in my opinion) that you should not use subset fonts, but its not always possible to tell whether a font is subset (not all creators follow the XXXXX+ convention), and even if the font isn't subset there still isn't any guarantee that its Encoding is one that is usable.
Looking at the file you have posted, it does not contain one of the fonts it uses (Arial,Bold) and so Ghostscript substitutes with DroidSansFallback, and the font it does contain (FreeSansBold) is a subset (FWIW this font doesn't actually seem to be used....). The fallback font is a CIDFont, so there is no real prospect of the text being 'correct'.
I believe that if you make a real font available to Ghostscript to replace Arial,Bold then it will probably work correctly. This would also fix the rather more obvious problem of the spacing of the characters being incorrect (in one place, wildly incorrect), which is caused by the fallback font having different widths to the original.
NB as the warning messages have already told you don't use -dUseCIEColor.
The fact that you cannot copy/paste/search a PDF does not mean that it is not a valid PDF/A-1b file though, so thsi does not mean that the creation (NOT conversion) of the PDF/A-1b is not 'proper'.
I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)