Ghostscript mangles umlauts when reading PDFs - pdf

I use this on Linux
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -o res.txt 1.pdf
when extracting text from some hundred PDFs, however, umlauts and other special chars up to ASCII 255 get mangled. Any ideas?
cf https://archive.org/download/bnmm_gmx_1/1.pdf (contains two "ä")
Partial translation table: (the last one and all other special letters of the Turkish alphabet are mangled using non-printable chars, else I could help myself)
ä = À¤
é = À©
ç = À§

Looks like it ought to work as the fonts have a ToUnicode CMap. I'd suggest you open a bug report.
Note, you are not using ASCII, the embedded and subset fonts are CIDFonts and the CMap in use creates 2-byte character codes (though ridiculously all the high bytes are 0). But for example, the space is actually encoded as character code 0x0003, the '0' is code 0x0013 etc.
By the way a simple example would be useful, its quite hard to pick out the accented glyphs from the regular text in this file.

Related

asking for the unicode of letter conjunctions

I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.

How can I properly create multilingual metadata in pdftk

pdftk let's you set the title of a PDF with the following command:
pdftk input.pdf update_info metadata.txt output output.pdf
However, if I use special characters in the metadata.txt file (such as German characters or chinese characters) then it doesn't seem to work.
Here's an example of changing the title:
InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.
However, the PDF ends up with a strange character for the ü
In the documentation of pdftk it says that non-ASCII characters should be encoded as XML numerical entities. However, I Googled myself silly but couldn't find anything that works.
The best reference I've found is Numerical Character Reference, which is applicable to XML (and XHTML and SGML).
This is generally used to represent characters that are not directly encodable.
In your case, the character is U+252, ü which can be substituted with ü (Decimal), &0374; (Octal), or ü (Hexidecimal).
Using a decimal reference, your file should be encoded as:
InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.
Note:
If you're on 'Nix, you can use recode to encode the file.
% cat metadata.txt | recode ..xml
This answer seems better as there is no need to install extra tools. Instead, it uses PDFtk’s built-in flag dump_data_utf8 and update_info_utf8:
pdftk input.pdf update_info_utf8 metadata.txt output output.pdf
It works perfect for Chinese.

How can I change type 3 font using ghostscript?

I have a postscript file which contains Type 3 Font.After converting that postscript to pdf using "gs" command ,I am unable to extract the text from pdf file.Is there any possibility to avoid change Type 3 Fonts to some other font, by substituting or some other way ,so that I can copy the text?
This is another case of miscomprehension regarding type 3 fonts. The fact that a font is a type 3 font has little to do with whether a PostScript program or PDF file using the font is 'searchab;e' or not.
Fonts in PostScript and PDF have an 'Encoding' which maps the character codes 0-255 to a named procedure in the font. Executing that procedure draws the glyph. The character codes can be anything, but are often (for Latin fonts) chosen to match the ASCII encoding.
PDF has the additional concept of a ToUnicode CMap, additional information which maps a character code in a font to a Unicode code point. PostScript has no such analogue, that's not what PostScript is for (its also not what PDF was originally for, which is why ToUnicode CMaps are a later addition to the PDF standard).
In the absence of a ToUnicode CMap Acrobat uses undocumented heuristics to try and guess what the text is. The obvious one (and the only one we know of) is that it treats the character codes as ASCII.
Now, if your original PostScript program has an encoding that maps the character codes as if they were ASCII< then provided you do not subset the font, the resulting PDF file should also contain ASCII character codes. If you do subset the font then the pdfwrite device will reorder the glyphs and the character codes will no longer be ASCII.
If your original PostScirpt file does not order the glyphs in the font using ASCII character codes then there is nothing you can do other than apply OCR, the information simply is not present.
But forget about altering the font type, not only is it not likely to be possible, it isn't the problem.

Adding encoding in postscript, ghostscript renders text correctly, but converting to PDF does not show the characters

We have to construct a postscript file that contains Arabic text, so as English text.
GhostScript shows the Arabic text correctly, but converting it to pdf does not show the Arabic letters.
PS file contains the following:
/TraditionalArabic findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /kafinitialarabic put
Encoding 2 /behinitialarabic put
Encoding 3 /yehmedialarabic put
Encoding 4 /seenfinalarabic put
Encoding 5 /eacute put
Encoding 6 /a put
/ArabicTradDict currentdict definefont pop
end
%%Page: 1 1
%%BeginPageSetup
%%PageMedia: Color Weight Type
<< /MediaColor (Blue)/MediaWeight 75 /MediaType () /xx {2.803464567 mul} def /xx {2.83464567 mul} def /PageSize [240 xx 345 xx]>> setpagedevice
%%EndPageSetup
/ArabicTradDict 18 selectfont
72 xx 300 xx moveto
(\004\003\002\001) show
showpage
To run ghostScript: running it from command line to include all windows fonts:
gswin64.exe -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true
To convert the PS file to PDF file: running the following command:
gswin64.exe -dBATCH -dNOPAUSE - sOutputFile=c:/Users/mob/Desktop/TimesNewRomanPSMT.pdf -sDEVICE=pdfwrite - dPDFSETTINGS=/prepress -dCompressFonts=false -dSubsetFonts=false -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true -dEmbedAllFonts=true -f c:/Users/mob/Desktop/TimesNewRomanPSMT.ps
So when converting to PDF, the Arabic characters are not showing correctly, but showing as squares that are of no meaning...
If I use Adobe tool to convert to PDF, the PDF we get is same, except the "eacute -(005) " if included in the PS file, will show after conversion, where as when I convert with the previous command line, all characters that are added from the Encoding are not shown correctly.
Any help with that?
Thanks to KenS hints I was able to solve my problem. The encoding used wrong character names like kafinitialarabic (i mean by wrong, pdf could not understand that), everything that ended with arabic was wrong. The Traditional Arabic font does not have those names for characters. In order to know what it really understood, have converted the ttf font to afm and pfa using the following command, that is converting the true type font to type 42 font which will be understood once embed in postscript file at conversion to pdf
C:\Program Files\gs\gs9.10\bin>gswin64c.exe -dNODISPLAY -q -- ttf2pf.ps times tim
esPS timesAFM
where times is the ttf font name. I then checked the generated pfa file for the characters I wanted to add, instead of kafinitialarabic, there was kafinitial, and for kafmedialarabic there was kafmedial and so on...
It works fine now to add those in encoding, but I want to find a way instead of adding all those characters in the dictionary, I want to use the font like we use with setfont in postscript normally - if that is possible...
As already suggested, you need to ensure the glyph names you use are in the font you use, or create a new font.
I haven't found anything that will choose the correct glyph from the set of initial, medial, final, isolated, depending on context, though.
I resorted to writing a program which takes unicode arabic, reverses it the arabic characters, and then decides which tone of character to use based on it's position in a word, and whether the previous or next characters are forced into isolated or final forms. Unfortunately had to embed quite some intrinsic knowledge about the font in use and the glyph names it has, as well as typos in them, into the program.
If that's of interest, I've stuck it on github, but it's very raw and initial.
It does work, though.
https://github.com/gbjk/arabic2ps
The font I used was a traditional arabic font, with quite a few idiosyncrasies.

Programmatic extraction of Unicode character values from True type font file in C/C++

I am trying to extract the UTF-8 character value from an embedded true type font file contained in a PDF. Is anyone aware of a method of doing this? The values in the PDF might be something like '2%dd! w!|<~' and this would end up as 'Hello World' in the PDF represented by the corresponding glyphs from the TTF. I'd like to be able to extract the wchar values here. Is this possible? Does the UTF-8 value for each character exist in the TTF?
Glyph ID's do not always correspond to Unicode character values - especially with non latin scripts that use a lot of ligatures and variant glyph forms where there is not a one-to-one correspondance between glyphs and characters.
Only Tagged PDF files store the Unicode text - otherwise you may have to reconstruct the characters from the glyph names in the fonts. This is possible if the fonts used have glyphs named according to Adobe's Glyph Naming Convention or Adobe Glyph List Specification - but many fonts, including the standard Windows fonts, don't follow this naming convention.
UTF-8 is an encoding that allows UTF8 encoded streams to be decoded to reveal a sequence of unicode char points. In any case, PDF does not encode using UTF-8. For true type text, each glyph is encoded using 8 bits.
To decode:
Read the differences array and encoding from the font definition
Read 8 bits at a time and produce an "AdobeGlyphId" using the encoding and differences array read in step 1.
Use the adobe glyph id to look up the unicode value
This is detailed in section 9.10 of the PDF Specification