Convert pdf to text returns weird escape sentences - pdf

I am trying to extract text from pdf to text. The pdf contains text in Czech, which includes characters such as ščřžý...
I tried numerous approaches including tika, textract, texttopdf, calibre, PDFMiner and so on.
However, I am getting many undefined characters and some characters are incorrectly decoded.
I also tried to encode and decode the text with different codecs, but got no luck.
Could you suggest possible solutions to this problem?
So far, OCR worked best, but mistakes o (the letter) for 0 (zero) and some letters are capitalised.

Related

Custom Speech: "normalized text is empty"

I uploaded a wav and text file to the custom speech portal. I got the following error: “Error: normalized text is empty.”
The text file is UTF-8 BOM, and is similar in format to a file that did work.
How I can trouble-shoot this?
There can be several reasons for a normalized text to be empty, e.g. if there are words of Latin and non-Latin characters in a sentence (depending on the locale). Also, words that are repeated multiple times in a row may cause this. Can you share which locale you're using to import the data? If you could share the text we can find the reason. Otherwise you could try to reduce the input text (no need to cut the audio for this) to find out what causes the normalization to discard the sentence.

PDF extracted text seems to be unreadable

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

How to convert the PDF content code to the type like "(<0034>) Tj"?

PDF content are saved as several ways, "(abc) Tj", "(<0035><0035>) Tj" or "\u065".
I want to know if there is a way to convert the PDF code to one type, no matter direct text "(abc) Tj", or hexadecimal "(<0035><0035>) Tj", or Octal "\u065".
I think if convert and encode the PDF to one type, will be easier to analyse the content.
Is it possible to use Ghostscript or something to do that? Thanks
Essentially, no, there is no way to do so. There are two kinds of string, regular strings '(' and ')' delimited, and hex strings '<' and '>' delimited. Hex strings need not be escaped whereas regular text strings do need to be for 'special' characters, like carriage return and linefeed. Octal is also permitted in regular strings.
PDF producers are free to mix and match these all they like, but in general a given PDF producer will usually use one technique throughout.
Because Ghostscript's pdfwrite device is a PDF producer, it will (I believe) generally produce all its output the same way.
What it won't do is 'convert' your original PDF file. It produces a brand new PDF file which should look visually identical but whose internals bear no resemblance to your original PDF. In addition some metadata or fidelity may be lost.

What does an /ActualText of FEFF0009 mean in a PDF?

I've been looking into a PDF file to understand how it is built.
I noticed that InDesign has created PDFs with text as below (after decompression using pdftk).
0 Tc /Span<</ActualText<FEFF0009>>> BDC
4.018 -0.2 Td
( )Tj
I understand the role of ActualText (for copy/paste/searching) but I'm wondering exactly how I should be interpreting the FEFF0009. It looks like a UTF-16 string with BOM chars to represent a tab character. This seems incorrect as it's really a space. I'm wondering if there is a special meaning here?
.. This seems incorrect as it's really a space.
No, it's really a tab.
14.9.4 Replacement Text
NOTE 1: Just as alternate descriptions can be provided for images and other items that do not translate naturally into text (as described in the preceding sub-clause), replacement text can be specified for content that does translate into text but that is represented in a nonstandard way.
(PDF 32000-1:2008)
The PDF text engine does not support the concept of 'tabs'. In this case, InDesign mimicked the function of a tab character by inserting a space in the text stream, and it could set the space width to match the distance spanned by the original tab or use a large relative positioning for the rest of the text (which it did here: the horizontal displacement of 4.018 in your code snippet).
The general idea is that a space is rendered on the position of the tab, but when you copy this text and paste somewhere else you get a tab character. I suppose the 'space' is only inserted to have something to copy.

Superscript greater than three does not work. Only one two and three available

I'm querying a database and trying to get superscripts. The first three work fine, others don't though. What are the valid superscript characters for anything over three?
1: NCHAR(185)
2:NCHAR(178)
3:NCHAR(179)
4: NCHAR(8308)
5:NCHAR(8309)
6:NCHAR(8310)
7:NCHAR(8311)
8:NCHAR(8312)
9:NCHAR(8313)
Simply search the web for characters named SUERSCRIPT FOUR, etc.
You will find pages such as this, for superscript 4:
http://www.fileformat.info/info/unicode/char/2074/index.htm
The characters you have seem to be at the correct decimal codepoints; perhaps the issue is with the fonts. You'll just have to find fonts that support these characters.
Also check that the character encoding of your database is UTF-8 and not, say, Latin-1.