Extracting text from garbled PDF [closed]

Extracting text from garbled PDF [closed] - pdf

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I have a PDF file with valuable textual information.
The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails.
I have used all tools I could get my hands on and the result is the same. I believe that this has something to do with fonts embedding, but I don't know what exactly?
My questions:
What is the culprit of this weird text garbling?
How to extract the text content from the PDF (programmatically, with a tool, manipulating the bits directly, etc.)?
How to fix the PDF to not garble on copy?

Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.
Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes).
For example, Distiller produces such files when "Smallest File Size" preset is used.
Other than OCR there is no other way to retrieve text from such files, I'm afraid. We recently published a guide for how to OCR PDFs in .NET.
Also we have a sample code that shows how to perform OCR for unmapped characters and then replace them with correct Unicode values.
Supplementing the original answer
The original answer mentioned the "information about meaning of used glyphs/shapes". This information should be contained in a PDF structure called a /ToUnicode table. Such a table is required for each and every font which is embedded as a subset and uses non-standard (Custom) encoding.
In order to quickly evaluate the chances for extractability of text contents, you can use the pdffonts command line utility. This prints in tabular form a series of items about each font used by the PDF. The presence of a /ToUnicode table is indicated by column headed uni.
A few example outputs:
$ kp#mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes yes 13 0
$ kp#mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes no 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
$ kp#mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
The good.pdf lets you extract the text contents for both fonts correctly, because both fonts have an accompanying /ToUnicode table.
For the bad1.pdf and the bad2.pdf the text extraction succeeds only for one of the two fonts, and fails for the other, because only one font has a /ToUnicode table.
I, Kurt Pfeifle, have recently created a series of hand-coded PDF files to demonstrate the influence of existing, buggy, manipulated or missing /ToUnicode tables in the PDF source code. These PDFs are extensively-commented and suitable to be explored with the help of a text editor. Above pdffonts output examples were created with the help of these hand-coded files. (There are a few more PDFs showing different results, which an interested reader may want to explore...)

I went to a lot of people for help and OCR is the only solution to this problem

I had the same problem. Uploading it to Google Drive, opening with Google Docs and copying the text from there worked for me.

Related

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here

The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction

Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/

Are byte order marks allowed in PDF document?

I'm having an issue with a filter program I wrote. It detects if a file is a PDF document by reading the first 5 bytes of the file and comparing it to a fixed buffer :
25 50 44 46 2D
This works fine except that I'm seeing a few files that starts with a byte order mark instead:
EF BB BF 25 50 44 46 2D
^-------^
I'm wondering if that is actually allowed by the PDF specs. If I check section 7.5 of that documentation, I read it as "no":
The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7
Yet, I see these documents in the wild and the users gets confused because PDF reader programs can open these documents by my filter reject them.
So: are BOM markers allowed at the start of PDF documents ? (I'm NOT talking about string objects here but the PDF file itself)

So: are BOM markers allowed at the start of PDF documents ?
No, just like you read in the specification, nothing is allowed before the "%PDF" bytes.
But Adobe Reader has a long history of accepting files in spite of some leading or trailing trash bytes.
Cf. the implementation notes in Appendix H of Adobe's pdf_reference_1-7:
3.4.1, “File Header”
Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m
...
3.4.4, “File Trailer”
Acrobat viewers require only that the %%EOF marker appear somewhere
within the last 1024 bytes of the file.
And people have a tendency to think that a PDF that Adobe Reader displays as desired is valid, there are many PDFs in the wild that do have trash bytes up front.

No, a BOM is not valid at the front a PDF file.
A PDF is a binary file format so a BOM wouldn't actually make sense, it would be like having a BOM at the front of a ZIP file or a JPEG.
I'm guessing the PDFs that you are consuming are coming from misconfigured applications that either have something already at the front of their output buffer already or, more likely, are created with the incorrect assumption that a PDF is a text-based format.

How to search my PDF with grep?

I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?

If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!

PDF text extraction issue - font/capitalization inconsistencies

I am trying to extract text from a pdf book and continue to run an issue where sections of copied text fail to retain the proper capitalization properties when pasted into a text document. I have rights to reproduce the book and also have a license to use all necessary fonts. At first I thought that the issue was caused by the fonts not being embedded, but I checked and all fonts appear to be subset embedded. Within the pdf there are over 100 fonts used which have one of the following properties:
TrueType Encoding: Ansi
TrueType (CID) Encoding: Identity-H
Type 1 (CID) Encoding: Identity-H
Type 1 Encoding: Custom
The languages within the book include English, German, Spanish and Italian. In German capitalization is absolutely critical. It tends to lose the uppercase properties more than the lower.
An example of the error would be: WELD -> weld
I am really at a loss at what to do here. I have requested that the owner of the book embed the fonts which he has done as subsets but the problem continues. I have tried saving the pdf file as a postscript and then ran it through distiller which correctly much of the problem, but in some cases resulted in text being replaced with different characters or numbers showing up as skulls. I understand that CID fonts might be contributing to the issue, but I have come across instance where a non CID font had the same result.
What could be causing this issue? Is it that the fonts are subset versus fully embedded? Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction? Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?
Any and all assistance is greatly appreciated.

That's indeed funny. The sample PDF provided by the OP indeed visibly contains upper case characters, some of them in upper case only lines, some in mixed case lines, which by Adobe Reader are extracted as lower case characters.
You wonder
What could be causing this issue?
As an example how that happens let's look at Pelle Più bella
In the page content that phrase actually looks like the visual representation in capital letters:
/T1_0 1 Tf
-0.025 Tc 12 0 0 12 379.5354 554.8809 Tm
(PELLE PI\331 BELLA)Tj
Looking at the used font T1_0 (a DIN-Bold subset) we see that it claims to use WinAnsiEncoding which would also indicate an interpretation of those character codes in the page stream as capital letters
But the font also has a ToUnicode mapping, and this mapping maps
<41> <0061> — 'A' → a
<42> <0062> — 'B' → b
<43> <0043> — 'C' → C
<44> <0044> — 'D' → D
<45> <0065> — 'E' → e
<49> <0069> — 'I' → i
<4C> <006C> — 'L' → l
<4D> <004D> — 'M' → M
<4E> <006E> — 'N' → n
<50> <0050> — 'P' → P
<52> <0072> — 'R' → r
<53> <0053> — 'S' → S
<54> <0074> — 'T' → t
<D9> <00F9> — 'Ù' → ù
(I only extracted the mappings from character codes which in WinAnsiEncoding represent capital letters.)
Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?
Sorry, I'm not really into InDesign. But that software being from Adobe I would be surprised if that was a bug in InDesign or its export to PDF. Could it instead be that there are some information in the InDesign file which tag PELLE PIÙ BELLA as Pelle Più bella which InDesign then in the PDF export translates into this ToUnicode mapping?
Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?
In case of your sample document there are three fonts, all of them with an Encoding entry WinAnsiEncoding, all of them being an embedded subset, but only two have such funny ToUnicode mappings, DIN-Medium and DIN-Bold, while Helvetica has no ToUnicode mapping. So it somehow is font related. How exactly I cannot say.
A workaround in case of your sample document would be to remove the ToUnicode mapping from the font dictionaries.
For example using Java and the iText library you can do that like this:
PdfReader reader = new PdfReader(INPUT);
for (int i = 1; i <= reader.getXrefSize(); i++)
{
PdfObject obj = reader.getPdfObject(i);
if (obj != null && obj.isDictionary())
{
PdfDictionary dic = (PdfDictionary) obj;
if (PdfName.FONT.equals(dic.getAsName(PdfName.TYPE)))
{
dic.remove(PdfName.TOUNICODE);
}
}
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(OUTPUT));
stamper.close();
reader.close();
After this manipulation Adobe Reader text extraction results in
PELLE PIÙ BELLA
This obviously only works in situations like the one in your sample document.
If in your other documents there is a mixture of fonts some of which require their respective ToUnicode map for text extraction while others are like the trouble fonts above, you might want to add some extra conditions to the Java code to only remove the map in the buggy font definitions.

No need to jump through PDF hoops. It isn't even a good text interchange format to begin with.
Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?
Ask the file provider to make an RTF export. This will retain all used fonts and formatting.
Your WELD-weld problem might be because of the font (if it contains both upper- and lowercase mapped to the same glyphs), use of an OpenType feature such as All Capitals, or even something like a badly created text-only stream inside the PDF.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.

The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas