Undefined characters replacing text ("ti", "fb" for example) in Apache Tika output - pdfbox

I'm trying to parse a document using Apache Tika that unfortunately replaces some character sequences - "ti", "fb" for example - with the an unknown Unicode symbol. I don't see a way to manage this using Tika itself, as the replacement character seems to be coming from PDFBox.
I also noticed that the character sequences in question are not part of the GlyphList. Would it be possible to add the sequences and a mapping to the GlyphList to get the expected output? I'm using Tika 1.21 with PDFBox 2.0.15.

Related

Where are named pdf characters defined like "f_f", "uni00D0" and "a204"?

I'm trying to read the official pdf specification "Document management — Portable document format — Part 1: PDF 1.7" (PDF32000_2008.pdf) as bytes and then interpret them according to that specification.
In Annex D, Character Sets and Encodings, there is a list of all named characters, like:
or
When I parse PDF32000_2008.pdf, there are also named characters like "f_f", "uni00D0" and "a204", which are missing in that specification.
My guess is that "f_f" is a symbol for two 'f' characters, which might get printed with a special glyph. There is a unicode "Latin Small Ligature Ff" for 'ff'.
For example, there is also "f_i" in that file, which I expect to mean 'fi', one glyph showing the 2 characters 'f' and 'i'. However, the pdf specification has 'fi' as named character "fi" and what is the point for having 2 named characters pointing to the same symbol ?
I can imagine that "uni00D0" means the unicode character 'Ð'. However, pdf defines it already as named character "Eth"
What could be "a204" ? Maybe Ansi 204 'Ì', which also has already a named character "Igrave" ?
Why do they use also "a62", which would be just a '<' ?
However, my main question is: Where can I find a specification for these additional named characters ?
Of course, Adobe Acrobat understands them, but also Gmail seems not to have a problem with them. So I guess, their meaning must be specified somewhere.

Convert pdf to text returns weird escape sentences

I am trying to extract text from pdf to text. The pdf contains text in Czech, which includes characters such as ščřžý...
I tried numerous approaches including tika, textract, texttopdf, calibre, PDFMiner and so on.
However, I am getting many undefined characters and some characters are incorrectly decoded.
I also tried to encode and decode the text with different codecs, but got no luck.
Could you suggest possible solutions to this problem?
So far, OCR worked best, but mistakes o (the letter) for 0 (zero) and some letters are capitalised.

What do the ASCII characters preceding a carriage return represent in a PDF page?

This is probably a rather basic question, but I'm having a bit of trouble figuring it out, and it might be useful for future visitors.
I want to get at the raw data inside a PDF file, and I've managed to decode a page using the Python library PyPDF2 with the following commands:
import PyPDF2
with open('My PDF.pdf', 'rb') as infile:
mypdf = PyPDF2.PdfFileReader(infile)
raw_data = mypdf.getPage(1).getContents().getData()
print(raw_data)
Looking at the raw data provided, I have began to suspect that ASCII characters preceding carriage returns are significant: every carriage return that I've seen is preceded with one. It seems like they might be some kind of token identifier. I've already figured out that /RelativeColorimetric is associated with the sequence ri\r. I'm currently looking through the PDF 1.7 standard Adobe provides, and I know an explanation is in there somewhere, but I haven't been able to find it yet in that 756 page behemoth of a document
The defining thing here is not that \r – it is just inserted instead of a regular space for readability – but the fact that ri is an operator.
A PDF content stream uses a stack based Polish notation syntax: value1 value2 ... valuen operator
The full syntax of your ri, for example, is explained in Table 57 on p.127:
intent ri (PDF 1.1) Set the colour rendering intent in the graphics state (see 8.6.5.8, "Rendering Intents").
and the idea is that this indeed appears in this order inside a content stream. (... I tried to find an appropriate example of your ri in use but cannot find one; not even any in the ISO PDF itself that you referred to.)
A random stream snippet from elsewhere:
q
/CS0 cs
1 1 1 scn
1.5 i
/GS1 gs
0 -85.0500031 -14.7640076 0 287.0200043 344.026001 cm
BX
/Sh0 sh
EX
Q
(the indentation comes courtesy of my own PDF reader) shows operands (/CS0, 1 1 1, 1.5 etc.), with the operators (cs, scn, i etc.) at the end of each line for clarity.
This is explained in 7.8.2 Content Streams:
...
A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical Conventions." It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See EXAMPLE 4 in 7.4, "Filters," for an example of a content stream.
(my emphasis)
7.2.2 Character Set specifies that inside a content stream, whitespace characters such as tab, newline, and carriage return, are just that: separators, and may occur anywhere and in any number (>= 1) between operands and operators. It mentions
NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
– to which I can add that most PDF creating software indeed attempts to delimit 'lines' consisting of an operands-operator sequence with returns.

PDF extracted text seems to be unreadable

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

Inserting special character in Redmine wiki page

I'm using Redmine and I'm trying to insert the special character | inside a table in a Redmine wiki page. I don't want this character to be parsed as a column separator.
I've achieved this by doing a <code>|</code> around this character, but I don't want to use the code tag, since this character will gain code attributes, namely the courier new font.
Is there a tag for displaying plain text and avoid the parsing from the Redmine wiki engine?
I'm reading the redmine wiki formatting documentation but it is very poor and points me to textile formatting which doesn't seem to include this special case.
I could not get the exclimation point to work, but this works for me.
<notextile>|</notextile>
The only way I found out to overcome this problem is to insert the HTML code for the character I want to isolate. For instance, instead of putting an underscore and make the wiki think I'm starting an italic word, I have to put the HTML code for it:
_
Example:
this is a _test - _text comment here_
Without the underscore code (_) redmine wiki engine will think that italic starts at test and this is the wrong result:
this is a test - text comment here
So, putting the ASCII code for the underscore corrects this problem. Unfortunately, this parsing is not very clever (yet I hope).
Here is a link for an ASCII code table with many symbols and characters:
http://www.ascii.cl/htmlcodes.htm