PDF extracted text seems to be unreadable

PDF extracted text seems to be unreadable - pdf

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...

I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

Related

asking for the unicode of letter conjunctions

I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:

First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for ﬁ, so for your example:
<012E> <FB01>
But remember, their use is discouraged.

Writing Unicode into PDF

I have Unicode text (a sequence of Unicode codes) and a TTF font (bytes of a TTF file). I would like to write that text into a PDF file using that font.
I understand PDF quite well. I don't mind using two bytes per character. I would like to attach the TTF file as it is (charcode-to-glyf map should be used from a TTF file).
What font Subtype and Encoding value should I use? Is it possible to avoid having ToUnicode record?
I tried to use Subtype = "/TrueType", but it requires to specify FirstChar, LastChar and Widths (which are already inside TTF).

You cannot use Unicode with a Font, at all (except in the limited case of Latin, or nearly Latin, languages), because Fonts use an Encoding, and an Encoding is a single byte array. So you can't reference more than 256 characters from a Font, and a character code can't be more than a single byte.
The first problem with 'using Unicode' is that Unicode is not a simple 2-byte Encoding, its a multi-byte format, with variable lengths and sometimes a single glyph is represented by multiple Unicode code points.
So, in order to deal with this you need to use a CIDFont, not a Font. You cannot 'use the charcode-to-glyf map', by which I assume you mean the CMAP subtable in the TTF font. You must compose the CIDFont with a CMap in order to map the multiple bytes in the text string into the character codes for lookup in the CMap, which gives you the CID to reference the precise character program in the font.
It may be possible to construct a single CMap which would cover every Unicode code point, but I have my doubts, it would certainly be a huge task. However certain CMaps already exist. Adobe publish a standard list on their web site which includes CMaps such as UniCNS-UCS2-H and UniCNS-UCS2-V or UniGB-UTF8-H etc.
You can probably use one of the standard CMaps.
Note that it doesn't matter that the FirstChar, LastChar etc are already stored in the TrueType font, you still need to specify them in the PDF Font object. That's because a PDF consumer might not be rendering the text at all, it could (for example) be extracting the text, in which case it doesn't need to interpret the font provided this information is available.

Encoding of PDF dictionaries

I need to know the encoding of the values of PDF dictionaries (not the text displayed to the user but the "code behind").
I plan not to use any library for that.
Where can I find it?

the encoding of the values of PDF dictionaries
Values of PDF dictionaries are PDF objects.
You should take a look at the PDF specification ISO 32000-1, in particular chapter 7 Syntax, to find out about PDF objects. You will find:
The tokens that delimit objects and that describe the structure of a PDF file shall use the ASCII character
set. In addition all the reserved words and the names used as keys in PDF standard dictionaries and
certain types of arrays shall be defined using the ASCII character set.
Thus, most of the time you have to deal with ASCII values.
The situation is tricky with strings, though, because there are several types of strings which use the same string syntax options, so you have to interpret their contents according to their context.
Table 35 – String Object Types
Type Description
text string Shall be used for human-readable text, such as text
annotations, bookmark names, article names, and
document information. These strings shall be encoded
using either PDFDocEncoding or UTF-16BE with a
leading byte-order marker.
This type is described in 7.9.2.2, "Text String Type."
PDFDocEncoded string Shall be used for characters and glyphs that are
represented in a single byte, using PDFDocEncoding.
This type is described in 7.9.2.3, "PDFDocEncoded String
Type."
ASCII string Shall be used for characters that are represented in a
single byte using ASCII encoding.
byte string Shall be used for binary data represented as a series of
bytes, where each byte can be any value representable in
8 bits. The string may represent characters but the
encoding is not known. The bytes of the string need not
represent characters. This type shall be used for data
such as MD5 hash values, signature certificates, and Web
Capture identification values.
This type is described in 7.9.2.4, "Byte String Type."
If a string is the value e.g. of the Author metadata, it is a text string, so it is encoded using either PDFDocEncoding or UTF-16BE with a leading byte-order marker.
If on the other hand a string is the value e.g. of Contents in a signature dictionary, it is a byte string holding a binary object, any attempt to interpret it according to some encoding will fail.
The situation is even more tricky with streams.
First of all the stream content may be somehow processed, e.g. it may be compressed. To get to the actual stream contents, you first have to undo this processing.
The the content may either be binary, e.g. a font program, or text, e.g. JavaScript, or it may be a content stream, e.g. the page contents.
A content stream is a PDF stream object whose data consists of a sequence of instructions describing the
graphical elements to be painted on a page. The instructions shall be represented in the form of PDF objects,
using the same object syntax as in the rest of the PDF document.
Thus, they are mostly ASCII values. The exception again are string arguments to text drawing instructions. Their encoding depends entirely on the font currently selected when the string is drawn, and fonts may use standard encodings, but they may also use completely chaotic, ad-hoc encodings.
PS: If you happen to try and analyze an encrypted PDF, you will find that Encryption
applies to all strings and streams in the document's PDF file, with very few exceptions. In particular encryption does not apply to dictionary and array structures, numbers and names. Thus, someone not aware of this might not recognize that the PDF is encrypted but instead assume that strings and streams are encoded in a very weird way.

You find that in the PDF specification (http://www.adobe.com/devnet/pdf/pdf_reference.html). To elaborate a bit on the most important points in your question...
1) PDF dictionaries can contain a variety of value types (booleans, numbers, strings...). The encoding you are going to encounter depends on the type of value.
2) Mostly, the interesting and complex case is that where the type of object is a string.
3) For a string, read section 7.9.2 in the PDF specification. That explains what encodings can be used for such strings (PDFDocEncoding, Unicode encoding...) and how to recognise what encoding you have for a particular string.

To complement #mkl's and #DavidvanDriessche's excellent answers...
Here are three OpenSource command line tools which can help you to transform any PDF into different forms which expand/uncompress/decode object streams (Note, there is not one single, "the-one-and-only-correct" way to do this -- so the outputs of each of the tools will be different):
pdftk
mutool
qpdf
Each of these should be available via your favorite operating systems package manager.
pdftkexample usage:
pdftk in.pdf cat output out1.pdf uncompress
mutool example usage:
mutool clean -d in.pdf out2.pdf
qpdf example usage (my favorite tool for this purpose):
qpdf --qdf --object-streams=disable in.pdf out3.pdf
You should try each of these, compare their outputs for different input PDFs and then decide which one is your favorite (but never forget to remember the other tools when you encounter a case where your favorite shows unexpected results).

How get symbols of glyphs in CFF font

I am parsing a CFF file. I get table with glyphs, whick looks like
GC4G38GFCGD7G70G4BGEAG39GFDG4CGEBGFEG72G4DGEC... with names.
I have offsets to each glyph and have offsets to CharStringINDEX. I have to associate glyphs manes with symbols. What do I have to do? All offsets don't explain where is the symbol.

It depends on what kind of encoding the CFF relies on: does it follow any of the predefined charset/encodings, or is it a free charset/encoding CFF block? (which type it is is recorded in the TOP DICT structure)
If the first, the glyph names are simply defined by Adobe's charset/encoding specifications and not stored in the font itself. If you're writing a parser, you'll need to code up those predefined encodings with their associated glyph names. If a font follows one of these, it must contain every character in the set, in predefined glyph ordering (so a CFF following encoding X will always have a glyph named by the 23rd name string in glyph position 23).
If the second, the names for each glyph are stored in the String block, like every other string in a CFF (but offset by 390, because there are that many documentation-specified strings that apply to all CFF and thus do not need to be stored explicitly in the CFF data). To find the glyph's name, you use the Charset structure, which tells you which string ID belongs to which glyph. You then look in the String INDEX to find the offset that id will be at, and then you grab the string from the String data block on that offset.
I wrote http://pomax.github.io/CFF-glyphlet-fonts specifically for the purposes of explaining the CFF layout, so you'll probably want to give that a look-over.

Programmatic extraction of Unicode character values from True type font file in C/C++

I am trying to extract the UTF-8 character value from an embedded true type font file contained in a PDF. Is anyone aware of a method of doing this? The values in the PDF might be something like '2%dd! w!|<~' and this would end up as 'Hello World' in the PDF represented by the corresponding glyphs from the TTF. I'd like to be able to extract the wchar values here. Is this possible? Does the UTF-8 value for each character exist in the TTF?

Glyph ID's do not always correspond to Unicode character values - especially with non latin scripts that use a lot of ligatures and variant glyph forms where there is not a one-to-one correspondance between glyphs and characters.
Only Tagged PDF files store the Unicode text - otherwise you may have to reconstruct the characters from the glyph names in the fonts. This is possible if the fonts used have glyphs named according to Adobe's Glyph Naming Convention or Adobe Glyph List Specification - but many fonts, including the standard Windows fonts, don't follow this naming convention.

UTF-8 is an encoding that allows UTF8 encoded streams to be decoded to reveal a sequence of unicode char points. In any case, PDF does not encode using UTF-8. For true type text, each glyph is encoded using 8 bits.
To decode:
Read the differences array and encoding from the font definition
Read 8 bits at a time and produce an "AdobeGlyphId" using the encoding and differences array read in step 1.
Use the adobe glyph id to look up the unicode value
This is detailed in section 9.10 of the PDF Specification

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas