I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?
If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!
Related
I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/
I have multiple pdf files without 'toUnicode' cmap table. Absence of cmap table restricts me from copying the text from pdf files.
As far as I know, there is a possibility to add 'toUnicode' mapping in pdf file, but in my case adding static values is not an option, different files have different glyph codes.
So the question is the following. Is there any possibility to restore 'toUnicode' cmap table, perhaps with the help of Ghostscript, or are there any options at all?
Thanks.
No, you cannot add ToUnicode CMaps to an existing PDF file using Ghostscript.
In the general case, you can't do it at all, except manually. As you note in the question, different files will be constructed to use different character code->Glyph mappings, which means that the character code to Unicode mapping will also be different.
Since the character code selection is often based on the order in which glyphs are used in a file (so the fist glyph is character code 1, the second is character code 2 etc) you can see that there is no prospect of identifying a 'one size fits all' solution.
You could use some kind of OCR to scan the rendered output, identify each glyph and find the Unicode code point for it. Then you could construct a CMap by identifying the character code for the glyph and mapping it to the Unicode value.
You could, then, add the ToUnicode CMap to the PDF file, and update the Font Descriptor with the object number of the ToUnicode CMap.
Ghostscript won't do any of that for you, and I haven't heard of any tool which will.
After processing with Ghostscript, I sometimes see whitespace breaking up the words as seen with pdftotext or in a PDF viewer when searching or selecting. Possibly unrelated but the anomalies seem to correspond with kerning variations in the rendered font.
Is there a way to avoid this?
For example, from GS 9.23 (also occurred with earlier versions):
gs -sDEVICE=pdfwrite \
-dNOPAUSE -dQUIET -dPARANOIDSAFER -dBATCH \
-sOutputFile=./output.pdf input.pdf
Excerpt from pdftotext input.pdf:
Review this manual before
operating deep cleaner
while pdftotext output.pdf:
Re vie w t his m a nua l be fore
ope ra t ing de e p c le a ne r
Ghostscript and the pdfwrite device (as explained in VectorDevices.htm) does not simply 'fiddle' with the input when producing a PDF file. The input (from whatever source; PDF, PostScript, XPS, PCL, PCL-XL) is fully interpreted into marking operations, those marking operations are sent to the device which turns them back into PDF constructs.
So the low level (PDF) format describing the page need not bear any relation to the low level format of the input. In particular you cannot expect the PDF operations in the input to be reflected in the output.
The visual appearance will be the same (or should be, because that's the main goal), but the actual operations won't be.
The reason for the difference in the text output is because, basically, there is no 'metadata' in a PDF file that describes words, paragraphs, columns etc. When you extract text from a PDF file what you actually get is a series of character codes and positions.
Its up to the text extraction code to try and make some sense of that. I'd guess that pdftotext is using the rather naive approach of assuming that text strings are words.
This is problematic because there are numerous different ways to handle kerning, justification and other spacings in PDF. You could do something like :
(Te) Tj
10 0 Td
(st) Tj
Or :
[(Te) 2 (st)] TJ
The pdfwrite device doesn't know what the original was, so what it emits could be either of those, depending on some heuristics. The chances of it matching the original are low.
I suspect that pdftotext would regard the first operation as "Te st" and the second operation as "Test"
One possible solution would be to use Ghostscript's txtwrite device to extract the text, it might do a better job.
As with your other question, it would be best to supply examples when asking these kinds of questions, because without that its pretty much guesswork.
TL;DR
Is there a way to avoid this?
No.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I have a PDF file with valuable textual information.
The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails.
I have used all tools I could get my hands on and the result is the same. I believe that this has something to do with fonts embedding, but I don't know what exactly?
My questions:
What is the culprit of this weird text garbling?
How to extract the text content from the PDF (programmatically, with a tool, manipulating the bits directly, etc.)?
How to fix the PDF to not garble on copy?
Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.
Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes).
For example, Distiller produces such files when "Smallest File Size" preset is used.
Other than OCR there is no other way to retrieve text from such files, I'm afraid. We recently published a guide for how to OCR PDFs in .NET.
Also we have a sample code that shows how to perform OCR for unmapped characters and then replace them with correct Unicode values.
Supplementing the original answer
The original answer mentioned the "information about meaning of used glyphs/shapes". This information should be contained in a PDF structure called a /ToUnicode table. Such a table is required for each and every font which is embedded as a subset and uses non-standard (Custom) encoding.
In order to quickly evaluate the chances for extractability of text contents, you can use the pdffonts command line utility. This prints in tabular form a series of items about each font used by the PDF. The presence of a /ToUnicode table is indicated by column headed uni.
A few example outputs:
$ kp#mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes yes 13 0
$ kp#mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes no 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
$ kp#mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
The good.pdf lets you extract the text contents for both fonts correctly, because both fonts have an accompanying /ToUnicode table.
For the bad1.pdf and the bad2.pdf the text extraction succeeds only for one of the two fonts, and fails for the other, because only one font has a /ToUnicode table.
I, Kurt Pfeifle, have recently created a series of hand-coded PDF files to demonstrate the influence of existing, buggy, manipulated or missing /ToUnicode tables in the PDF source code. These PDFs are extensively-commented and suitable to be explored with the help of a text editor. Above pdffonts output examples were created with the help of these hand-coded files. (There are a few more PDFs showing different results, which an interested reader may want to explore...)
I went to a lot of people for help and OCR is the only solution to this problem
I had the same problem. Uploading it to Google Drive, opening with Google Docs and copying the text from there worked for me.
While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.