Extract text from PDF in respect to formatting (font size, type etc) - pdf

Is possible to extract text from a PDF file concerning specific font/font size/font colour etc.? I prefer Perl, python or *nix command-line utilities. My goal is to extract all headlines from PDF file so I will have a nice index of articles contained in a single PDF.

Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.

I have working code which extracts text from pdf with the size of the font.
with help of PDfminer, I have achieved this job. with many pdf's
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'path\whereyour pdffile'
os.chdir(path)
Extract_Data=[]
for PDF_file in os.listdir():
if PDF_file.endswith('.pdf'):
for page_layout in extract_pages(PDF_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])

I have used fitz to accomplish the required task, as it is much faster compared to pdfminer. You can find my duplicate answer to a similar question here.
An example code snippet is shown below.
import fitz
def scrape(keyword, filePath):
results = [] # list of tuples that store the information as (text, font size, font name)
pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
for page in pdf:
dict = page.get_text("dict")
blocks = dict["blocks"]
for block in blocks:
if "lines" in block.keys():
spans = block['lines']
for span in spans:
data = span['spans']
for lines in data:
if keyword in lines['text'].lower(): # only store font information of a specific keyword
results.append((lines['text'], lines['size'], lines['font']))
# lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
pdf.close()
return results
If you wish to find the font information of every line, you may omit the if condition that checks for a specific keyword.
You can extract the text information in any desired format by understanding the structure of dictionary outputs that we obtain by using get_text("dict"), as mentioned in the documentation.

Related

How does one extract the actual text from pdf lines with an unrecognized encoding?

To set the stage, I am using pikepdf. When extracting a pdf, I have first upgraded it to PDF/A using ghostscript.
In PDF/A format, I can easily render it to see text. The PDF is also a "True" Pdf in the sense that everything is structured except for the actual text, which appears to be either an image object or some sort of unrecognized encoding.
The question is: how do I determine whether it is actually an image or, if it is not an image, find the element explaining how to interpret the text encoding in a PDF/A pdf using pikepdf?
For example, a typical line of a "True" pdf will be:
'[ (C) -0.169646 (O) 0.165508 (N) -0.169646 (T) 0.16137 (A) -0.169646 (C) -0.173783 (T) 0.16137 ] TJ'
# aka "CONTACT" when parsed.
However, when inspecting the user data input in the pdf, a typical line might be:
'[ <00240007> 1 <0067003a0063> 1.00301 <0013001300130013> ] TJ'
# where I have anonymized the numbers
What I would like to do is de-mask the text, which is clearly visible in rendered state. But I am unsure where to go look for the encoding in the PDF header.
Is this information I can find in the PDF? And, if not, is there a way to determine what exactly these text snippets are? (e.g. pointers to image streams?)

Camelot in python does not behave as expected

I have two pdf documents, both in same layout with different information. The problem is:
I can read one perfectly but the other one the data is unrecognizable.
This is an example which I can read perfectly, download here:
from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe as expected:
This is an example which after I read, the information is unrecognizable, download here:
from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe with unrecognizable information:
I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.
The problem: malformed PDF
Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).
You can check this by trying to open the PDF with Google Docs.
Google Docs tries to extract the text and this is the result:.
Possible solutions
If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction.
However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.
If you have no way to recover a well-formed PDF, you could try this strategy:
print PDF to an image-based PDF
add a good text layer to your image-based PDF (using OCRmyPDF)
try using Camelot to extract tables

Struggling with PDF output of bookdown

I thought it would be a good idea to write a longer report/protocol using bookdown since it's more comfortable to have one file per topic to write in instead of just one RMarkdown document with everything. Now I'm faced with the problem of sharing this document - the HTML looks best (except for wide tables being cut off) but is difficult to send via e-mail to a supervisor for example. I also can't expect anyone to be able to open the ePub format on their computer, so PDF would be the easiest choice. Now my problems:
My chapter headings are pretty long, which doesn't matter in HTML but they don't fit the page headers in the PDF document. In LaTeX I could define a short title for that, can I do that in bookdown as well?
I include figure files using knitr::include_graphics() inside of code chunks, so I generate the caption via the chunk options. For some figures, I can't avoid having an underscore in the caption, but that does not work out in LaTeX. Is there a way to escape the underscore that actually works (preferrably for HTML and PDF at the same time)? My LaTeX output looks like this after rendering:
\textbackslash{}begin\{figure\}
\includegraphics[width=0.6\linewidth,height=0.6\textheight]{figures/0165_HMMER} \textbackslash{}caption\{Output of HMMER for PA\_0165\}\label{fig:0165}
\textbackslash{}end\{figure\}
Edit
MWE showing that the problem is an underscore in combination with out.height (or width) in percent:
---
title: "MWE FigCap"
author: "LilithElina"
date: "19 Februar 2020"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## Including Plots
You can also embed plots, for example:
```{r pressure, echo=FALSE, fig.cap="This is a nice figure caption", out.height='40%'}
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
```{r pressure2, echo=FALSE, fig.cap="This is a not nice figure_caption", out.height='40%'}
plot(pressure)
```
Concerning shorter headings: pandoc, which is used for the markdown to LaTeX conversion, does not offer a "shorter heading". You can do that yourself, though:
# Really long chaper heading
\markboth{\thechapter~short heading}{}
[...]
## Really long section heading
\markright{\thesection~short heading}
This assumes a document class with chapters and sections.
Concerning the underscore in the figure caption: For me it works for both PDF and HTML to escape the underscore:
```{r pressure2, echo=FALSE, fig.cap="This is a not nice figure\\_caption", out.height='40%'}
plot(pressure)
```

Extract text from corrupt (?) pdf document

In a project I'm working on we scrape legal documents from various government sites and then make them searchable online.
Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one.
If you open it in a PDF reader, it looks fine, but:
If you try to copy and paste it, you get corrupted text
If you run it through any tools like pdftotext, you corrupted text
If you do just about anything else to it -- you guessed it -- you get corrupted text
Yet, if you open it in a reader, it looks fine! So I know the text is there, but something is wrong, wrong wrong! The result is that on my site it looks really bad.
Is there anything I can do?
Update: I did more research today. Thanks to #Andrew Cash's observation that this is essentially a Caesar cipher, I realized I could search for the documents. This link will show you about 200 of these in my system. Looking through the larger sample set, it looks like these are all created by the same software, pdffactory v. 3.51! So I blame a bug, not deliberate obfuscation.
Update 2: The link above won't provide any results anymore. These are purged from my system using my solution below.
Tha PDF is using subsetted fonts where the characters are remapped to other characters using the same as a simple World War II substitution cipher.
A = G,
B = 1,
C = #,
D = W,
...
... and so on. Every character is remapped.
The font is mapped this way and in order to get the correct characters displaying in the PDF you need to send "G1#W" in for it to print out ABCD. Normally PDF's will have a ToUnicode table to help you with text extraction but this table has been left out on purpose I suspect.
I have seen a few of these documents myself where they are deliberately obfuscated to prevent text extraction. I have seen a document with about 5 different fonts and they were all mapped using a different sequence.
One sure way to tell if this is the problem is to load the PDF into Acrobat and copy / paste the text into a text editor. If Acrobat cannot decode the text back to English then there is no way to extract the text without remapping it manually if you know the translation mappings.
The only way to extract text easily from these types of documents is to OCR the full document and remove the original text. The OCR would convert the page to a TIFF image and then OCR it so the original garbled text shouldn't affect the OCR.
Weary of this issue and not wanting to deal with OCR, I manually sorted out the cipher. Here she be, as a python dict along with some rudimentary code that I was using to test it. I'm sure this could be improved, but it does work for all letters except uppercase Q and uppercase X, which I haven't yet been able to find.
It's missing a fair bit of punctuation too at least for now (all of these are missing, for example: <>?{}\|!~`##$%^_=+).
# -*- coding: utf-8 -*-
import re
import sys
letter_map = {
u'¿':'a',
u'regex':'b',
u'regex':'c',
u'regex':'d',
u'»':'e',
u'o':'f',
u'1':'g',
u'regex':'h',
u'·':'i',
u'¶':'j',
u'μ':'k',
u'regex':'l',
u'3':'m',
u'2':'n',
u'±':'o',
u'°':'p',
u'regex':'q',
u'®':'r',
u'-':'s',
u'¬':'t',
u'«':'u',
u'a':'v',
u'©':'w',
u'regex':'x',
u'§':'y',
u'¦':'z',
u'ß':'A',
u'Þ':'B',
u'Ý':'C',
u'Ü':'D',
u'Û':'E',
u'Ú':'F',
u'Ù':'G',
u'Ø':'H',
u'×':'I',
u'Ö':'J',
u'Õ':'K',
u'Ô':'L',
u'Ó':'M',
u'Ò':'N',
u'Ñ':'O',
u'Ð':'P',
u'':'Q', # Missing
u'Î':'R',
u'Í':'S',
u'Ì':'T',
u'Ë':'U',
u'Ê':'V',
u'É':'W',
u'':'X', # Missing
u'Ç':'Y',
u'Æ':'Z',
u'ð':'0',
u'ï':'1',
u'î':'2',
u'í':'3',
u'ì':'4',
u'ë':'5',
u'ê':'6',
u'é':'7',
u'è':'8',
u'ç':'9',
u'ò':'.',
u'ô':',',
u'æ':':',
u'å':';',
u'Ž':"'",
u'•':"'",
u'•':"'", # s/b double quote, but identical to single.
u'Œ':"'", # s/b double quote, but identical to single.
u'ó':'-', # dash
u'Š':'-', # n-dash
u'‰':'--', # em-dash
u'ú':'&',
u'ö':'*',
u'ñ':'/',
u'÷':')',
u'ø':'(',
u'Å':'[',
u'Ã':']',
u'‹':'•',
}
ciphertext = u'''YOUR STUFF HERE'''
plaintext = ''
for letter in ciphertext:
try:
plaintext += letter_map[letter]
except KeyError:
plaintext += letter
# These are multi-length replacements
plaintext = re.sub(u'm⁄4', 'b', plaintext)
plaintext = re.sub(u'g⁄n', 'c', plaintext)
plaintext = re.sub(u'g⁄4', 'd', plaintext)
plaintext = re.sub(u' ́', 'l', plaintext)
plaintext = re.sub(u' ̧', 'h', plaintext)
plaintext = re.sub(u' ̈', 'x', plaintext)
plaintext = re.sub(u' ̄u', 'qu', plaintext)
for letter in plaintext:
try:
sys.stdout.write(letter)
except UnicodeEncodeError:
continue

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.