Extracting text from a pdf...selectively - pdf

Interesting challenge...I'm looking for ways to extract data from a pdf, selectively. These are a collection of research abstracts which consistently have pieces of text I don't want (e.g. author's names and email address, weblinks, location cities etc).
The body of the text is what I want, and I'd looked at using stopwords as a way to solve the problem, but it quickly becomes counterproductive (many of the stopwords are actually necessary words within the text body I need).
So, is there a way to almost do an opposite approach to using stopwords, based on large areas of text you want, only? For example, where there is a title and block of text (e.g. object, methods, results) could these sections of text be selectively extracted?
To add a bit of a challenge further, there isn't much consistency to the documents (so they don't all have the same headings or length).
If anyone has any experience or tips and recommendations it would be really helpful, as the alternative of manual copy and paste just isn't sustainable.
Many thanks,
Graham.

Related

How to use borb and a Translate API to translate a PDF while maintaining formatting?

I found borb - a cool Python package to analyze and create PDFs.
And there are several translation APIs available, e.g. Google Translate and DeepL.
(I realize the length of translated text is likely different than the original text, but to first order I'm willing to ignore this for now).
But I'm not clear from the borb documentation how to replace all texts with their translations, while maintaining all formatting.
Disclaimer: I am Joris Schellekens, the author of borb.
I don't think it will be easy to replace the text in the PDF. That's generally something that isn't really possible in PDF.
The problem you are facing is called "reflowing the content", the idea that you may cause a line of text to be longer/shorter. And then the whole paragraph changes. And perhaps the paragraph is part of a table, and the whole table needs to change, etc.
There are a couple of quick hacks.
You could write new content on top of the pdf, in a separate layer. The PDF spec calls this "optional content groups".
There is code in borb that does this already (the code related to OCR).
Unfortunately, there is no easy free or foolproof way to translate pdf documents and maintain document formatting.
DeepL's new Python Library allows for full document translation in this manner:
import deepl
auth_key = "YOUR_AUTH_KEY"
translator = deepl.Translator(auth_key)
translator.translate_document_from_filepath(
"path/to/original/file.pdf",
"path/to/write/translation/to.pdf",
target_lang="EN-US"
)
and the company now offers a free API with a character limit. If you have a few short pdfs you'd like to translate, this will probably be the way to go.
If you have many, longer pdfs and don't mind paying a base of $5.49/month + $25.00 per 1 million characters translated, the DeepL API is still probably the way to go.
EDIT: After attempting to use the DeepL full document translation feature with Mandarin text, this method is far from foolproof/accurate. At least with the Mandarin documents I examined, the formatting of each document varied significantly, and DeepL was unable to accurately translate full documents over a wide range of formatting. If you need only the rough translation of a document, I would still recommend using DeepL's doc translator. However, if you need a high degree of accuracy, there won't be an 'easy' way to do this (read the rest of the answer). Again, however, I have only tried this feature using mandarin pdf files.
However, if you'd like to focus on text extraction, translation, and formatting without using DeepL's full document translation feature, and are able to sink some real time into building a software that can do this, I would recommend using pdfplumber. While it has a steep learning curve, it is an incredibly powerful tool that provides data on each character in the pdf, image area information, offers visual debugging tools, and has table extraction tools. It is important to note that it only works machine generated pdfs, and has no OCR feature.
Many of the pdf's I deal with are in the Mandarin language and have characters that are listed out of order, but using the data that pdfplumber provides on each character, it is possible to determine their position on the page...for instance, if character n's Distance of left side of character from left side of page (char properties section of the docs) is less than the distance for character n+1, and each has the same Distance of top of character from bottom of page, then it can be reasonably assumed that they are on the same line.
Figuring out what looks the most typical for the body of pdfs that you typically work with is a long process, but performing the text extraction while maintaining line fidelity in this manner can be done with a high degree of accuracy. After extraction, passing the strings to DeepL and writing them in an outfile is an easy task.
If you can provide one of the pdfs you work with for testing that would be helpful!

PDF file search then display that page only

I create a PDF file with 20,000 pages. Send it to a printer and individual pages are printed and mailed. These are tax bills to homeowners.
I would like to place the PDF file my web server.
When a customer inputs a unique bill number on a search page, a search for that specific page is started.
When the page within the PDF file is located, only that page is displayed to the requester.
There are other issues with security, uniqueness of bill number to search that can be worked out.
The main question is... 1: Can this be done 2: Is there third party program that is required.
I am a novice programmer and would like to try and do this myself.
Thank you
It is possible but I would strongly recommend a different route. Instead of one 20,000 page document which might be great for printing, can you instead make 20,000 individual documents and just name them with something unique (bill number or whatever)? PDFs are document presentations and aren't suited for searching or even text information storage. There's no "words" or "paragraphs" and there's even no guarantee that text is written letter after letter. "Hello World" could be written "Wo", "He", "llo", "rld". Your customer's number might be "H1234567" but be written "1234567", "H". Text might be "in-page" but it also might be in form fields which adds to the complexity. There are many PDF libraries out there that try to solve these problems but if you can avoid them in the first your life will be much easier.
If you can't re-make the main document then I would suggest a compromise. Take some time now and use a library like iText (Java) or iTextSharp (.Net) to split the giant document into smaller documents arbitrarily named. Then try to write your text extraction logic using the same libraries to find your uniqueifiers in the documents and rename each document accordingly. This is really the only way that you can prove that your logic worked on every possible scenario.
Also, be careful with your uniqueifiers. If you have accounts like "H1234" and "H12345" you need to make sure that your search algorithm is aware that one is a subset (and therefore a match) of the other.
Finally, and this depends on how sensitive your client's data is, but if you're transporting very sensitive material I'd really suggest you spot-check every single document. Sucks, I know, I've had to do it. I'd get a copy of Ghostscript and convert all of the PDFs to images and then just run them through a program that can show me the document and the file name all at once. Google Picasa works nice for this. You could also write a Photoshop action that cropped the document to a specific region and then just use Windows Explorer.

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)
To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

Obj-C / iOS: Look through a document for any one of several thousand words?

As part of a document reader I'm writing for iPhone/iPad, I need the following functionality:
Search through a document of between appx 500 and 10000 words for words and phrases that appear in one of several lists. Each list contains between 100 and 5000 words and phrases. When I find a word in the document that appears in one of those lists, I mark it and move on.
I will know the word lists ahead of time, but the documents will be unknown until the moment they need to be processed.
And this needs to be VERY FAST.
Any help would be greatly appreciated!
This presentation and paper present a fast multi-pattern string search algorithm. It also mentions some predecessors, should this one not fit your needs.
Multifast is an open source (LGPLed) C library that implements the Aho-Corasick algorithm.
I would create a huge hashmap with the phrases and words to search against at load time, since searching through hashmaps is very, very fast, especially at these sizes. Obviously a memory-hungry solution, but pretty trivial.
iOS 4 and above seems to have functionality for custom dictionaries; perhaps you could exploit that somehow?

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai