Interpretation of CID Characters in text extracted from PDF - pdf

I use pdfminer in Python to extract text form PDF documents. Some special characters are now represented as (cid:xxx). Here two examples of a line extracted from a german text on physics:
(cid:129) der Fragestellungen
or
um a(cid:4) Atomradius(cid:4) 0;05 nm kommt die Formel der Realität sehr nahe. Resultat
Is there any way to figure out, what these codes stand for? In the ideal case, they should be replaced by a unicode character.

Related

asking for the unicode of letter conjunctions

I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.

Convert pdf to text returns weird escape sentences

I am trying to extract text from pdf to text. The pdf contains text in Czech, which includes characters such as ščřžý...
I tried numerous approaches including tika, textract, texttopdf, calibre, PDFMiner and so on.
However, I am getting many undefined characters and some characters are incorrectly decoded.
I also tried to encode and decode the text with different codecs, but got no luck.
Could you suggest possible solutions to this problem?
So far, OCR worked best, but mistakes o (the letter) for 0 (zero) and some letters are capitalised.

Ghostscript mangles umlauts when reading PDFs

I use this on Linux
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -o res.txt 1.pdf
when extracting text from some hundred PDFs, however, umlauts and other special chars up to ASCII 255 get mangled. Any ideas?
cf https://archive.org/download/bnmm_gmx_1/1.pdf (contains two "ä")
Partial translation table: (the last one and all other special letters of the Turkish alphabet are mangled using non-printable chars, else I could help myself)
ä = À¤
é = À©
ç = À§
Looks like it ought to work as the fonts have a ToUnicode CMap. I'd suggest you open a bug report.
Note, you are not using ASCII, the embedded and subset fonts are CIDFonts and the CMap in use creates 2-byte character codes (though ridiculously all the high bytes are 0). But for example, the space is actually encoded as character code 0x0003, the '0' is code 0x0013 etc.
By the way a simple example would be useful, its quite hard to pick out the accented glyphs from the regular text in this file.

How can I change type 3 font using ghostscript?

I have a postscript file which contains Type 3 Font.After converting that postscript to pdf using "gs" command ,I am unable to extract the text from pdf file.Is there any possibility to avoid change Type 3 Fonts to some other font, by substituting or some other way ,so that I can copy the text?
This is another case of miscomprehension regarding type 3 fonts. The fact that a font is a type 3 font has little to do with whether a PostScript program or PDF file using the font is 'searchab;e' or not.
Fonts in PostScript and PDF have an 'Encoding' which maps the character codes 0-255 to a named procedure in the font. Executing that procedure draws the glyph. The character codes can be anything, but are often (for Latin fonts) chosen to match the ASCII encoding.
PDF has the additional concept of a ToUnicode CMap, additional information which maps a character code in a font to a Unicode code point. PostScript has no such analogue, that's not what PostScript is for (its also not what PDF was originally for, which is why ToUnicode CMaps are a later addition to the PDF standard).
In the absence of a ToUnicode CMap Acrobat uses undocumented heuristics to try and guess what the text is. The obvious one (and the only one we know of) is that it treats the character codes as ASCII.
Now, if your original PostScript program has an encoding that maps the character codes as if they were ASCII< then provided you do not subset the font, the resulting PDF file should also contain ASCII character codes. If you do subset the font then the pdfwrite device will reorder the glyphs and the character codes will no longer be ASCII.
If your original PostScirpt file does not order the glyphs in the font using ASCII character codes then there is nothing you can do other than apply OCR, the information simply is not present.
But forget about altering the font type, not only is it not likely to be possible, it isn't the problem.

Programmatic extraction of Unicode character values from True type font file in C/C++

I am trying to extract the UTF-8 character value from an embedded true type font file contained in a PDF. Is anyone aware of a method of doing this? The values in the PDF might be something like '2%dd! w!|<~' and this would end up as 'Hello World' in the PDF represented by the corresponding glyphs from the TTF. I'd like to be able to extract the wchar values here. Is this possible? Does the UTF-8 value for each character exist in the TTF?
Glyph ID's do not always correspond to Unicode character values - especially with non latin scripts that use a lot of ligatures and variant glyph forms where there is not a one-to-one correspondance between glyphs and characters.
Only Tagged PDF files store the Unicode text - otherwise you may have to reconstruct the characters from the glyph names in the fonts. This is possible if the fonts used have glyphs named according to Adobe's Glyph Naming Convention or Adobe Glyph List Specification - but many fonts, including the standard Windows fonts, don't follow this naming convention.
UTF-8 is an encoding that allows UTF8 encoded streams to be decoded to reveal a sequence of unicode char points. In any case, PDF does not encode using UTF-8. For true type text, each glyph is encoded using 8 bits.
To decode:
Read the differences array and encoding from the font definition
Read 8 bits at a time and produce an "AdobeGlyphId" using the encoding and differences array read in step 1.
Use the adobe glyph id to look up the unicode value
This is detailed in section 9.10 of the PDF Specification