How do I extract tables from a historical PDF? - pdf

I need to extract data from similarly formatted tables from this file. There are some OCR errors but I have an automated method to correct them.
I have tried:
ABBYY Finereader table detection.
Tabula table extraction
Camelot table extraction
Custom python code
The Problem: The commercials tools are very bad with detecting the edges of the table. The tables follow a similar general format but each scan is aligned slightly differently so hard coding the boarders won't work either.
Question: Do you guys know a good way to detect where the table begins and then apply one of a few templates?
Any other tips for this kind of work are greatly appreciated.

UPDATE 2/26:
I solved my own question, though feel free to respond with fast or better solutions.
One of the main problems is that the tables are roughly similar in their dimensions but they vary from page to page. The scanned images are also slightly offset from page to page, giving two alignment problems. My current workflow solves both and is as follows.
Table Type Alignment
Solution:
Use the image editing tools in ABBYY to cut each page horizontally. This gives one table on each page.
Note that there are 4 table types. Even pages and odd pages have separate layouts. The first table on each page includes a field for date.
That gives first-table-even, first-table-odd, reg-table-even, reg-table-odd. Processing one type at a time with fixed table areas and columns fixes misalignment due to differences in the tables layouts.
Image Alignment
The images of the same table type are still not aligned so specifying a table layout in (x,y) coordinates won't work. The tables locations are in different in each image.
I needed to align the images based on the table location, but without already detecting the table there was no good way to do that.
I solved the problem in an interesting way, but I tried the following steps first.
Detect vertical lines using Opencv. Result: did not detect faint lines well. Would often miss lines making it useless for alignment.
Use Scan Tailor to detect content. Result: The detection algorithm would crop some tables too much in some files and in others include white space because of specks in the image. Despeckling didn't help.
Use Camelot with wide table areas, no column values. Result: This would probably work well in other cases but Camelot fell down here. The data is reported to down to cents and there are spaces between every three digits. This resulted in the misplacement of the 00 in several columns.
Solution:
After having cut images into tables explained in Table Type Alignment section, use the Auto align layers feature in Photoshop to align the images.
Step-by-Step Solution:
Open Photoshop
Load images of one table type into a single file using: File-Scripts-Load Files into Stack
Use: Edit-Auto-align layers
Use crop tool to make each file the same size.
Export each image as its own file: File-Export-Layers to files
Use ABBYY OCR editor on each of the 4 table types, hardcode the columns and rows using GUI editor.
Export to CSV from ABBYY
Use something like clean.py to remove spaces and bad chars.
Done! Combine the files for each table however you like. I will post my python code for doing this when I'm done with the project. Once cleaned, I will post the data too.

There is a free online tool here https://www.pdftron.com/pdf-tools/pdf-table-extraction/
The related blog https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/ references PDFGenie command line tool

Instead of Camelot table_areas parameter (which specifies fixed boundaries), you can try to use table_regions parameter to specify the regions where the tables probably are (Camelot will only analyze the specified regions to look for tables).
https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-regions
Please keep us updated.

Related

Power Automate: Is there an operation that can split PDFs based on shared text across pages?

Any advice on this would be appreciated! I'm a newbie to Power Automate and Flows, though have watched a lot of tutorial content. I haven't seen a guide for exactly what I'm looking to do, so was hoping an experienced user could provide some advice.
What I need to do is split a PDF into smaller PDFs grouped by entity ID numbers that are on each page. I can't go an split on an increment because some entities have more pages of data that others. Generally the PDF will be about 700 pages and will be split into about 300 PDFs grouped by entity. Currently this is a labor intensive process, and automating would be incredible.
I'm looking into doing it with an Encodian split PDF by text action, but that requires the text is provided. What I need is a way to identify which pages have the same ID and group those into PDFs.
Does anyone have any experience doing something similar?
I have tried putting this together, but so far have only found operations that will let me split when I find a specific text string that must be provided during the operation. What I need is a way to find the entity IDs on each page, and then group the pages for the each entity together and split into its own smaller PDF file.

Setting text to be read column-wise by screen-reader in iText 7

I have a page in my PDF that consists of several columns. I would like the screen-reader to read each column individually before moving on to the next column. Currently it just reads the text that appears from left to right. Is there any way to do this in iText 7?
The answer depends on whether you create this document by yourself with iText or you want to fix this issue in already existing PDF document.
In the first case you simply need to specify that you want to create document logical structure along with document content. In order to achieve this, you need to call PdfDocument#setTagged() method upon creation of PdfDocument instance. Document logical structure is something that tools like screen readers would rely on in order to get the correct logical order of the contents.
In the second scenario, when you already have a document with several columns, however it's reading order is messed up, it is most likely that this document doesn't have proper logical structure in it (or in other words it is not tagged properly). The task of fixing the issue you described in already existing PDF document (this task is sometimes called structure recognition) is extremely difficult in general case and cannot be performed automatically as of nowadays. There are several tools that would allow you to fix such documents manually or semi-automatically (like Adobe Acrobat) but iText 7 doesn't provide structure recognition functionality right now.

Is there a way to automatically import data into a form field in Adobe Acrobat Pro?

I'm open to other solutions as well.
My issue is this. We have about 500+ and growing different PDFs that need to have certain information (company info, phone numbers, etc.) added to form fields dynamically. The reason this needs to be dynamic is that this information changes regularly and we do not want to have to update all 500 PDFs each time it changes. So I am looking for some way to set up the PDFs so that they all read from a single external source (could be something as simple as a text file) dynamically upon opening the PDF in Acrobat Pro.
I have done some on-the-fly PDF creation in the past through PHP, however this does not seem like the best solution here as the PDFs need to be edited a lot by non-programmers and such. I'd prefer not to go that route and just stick to finding a way to get a few lines of data into the PDFs they create.
I've researched this a bit and it seems... possible, but confusing? This is the best thing I could find so far:
http://www.pdfscripting.com/public/department48.cfm
But the three solutions that it offers near the bottom all sound convoluted. Just wondering if there is something simpler that I am missing. All I really need to do is have the PDF import a few small chunks of text. Seems like it should be easy...
I think you can give http://www.codeproject.com/Tips/679606/Filling-PDF-Form-using-iText-PDF-Library a try. Hopefully it fulfills your needs.

itextsharp: solution on how to display a report

i have a report which looks like this. it will be in PDF format:
alt text http://img52.imageshack.us/img52/3324/fullscreencapture121420.png
the user will input all the different foods, thus every section like NONE, MODERATE, SEVERE will be a different size and thus i need to be able to expand the sections during run time. in order to do that i should probably slice up the image and add different sections during run time. i dont know the proper way to do it.
please help me with a suggestion on how to go about fitting the text in the appropriate sections (but also keep in mind i have no control over how many foods are in each section, the user will decide this during run time)
I would create an iTextSharp table for each of your results (None, Moderate, Severe) and write out the table sequentially, in the order you want them to appear on your PDF. Each row in your tables would have four columns.
I found these articles useful for creating tables in iTextSharp:
iTextSharp - Introducing Tables
SourceForge Table Tutorial
Edit
Sorry, I didn't see the vb.net tag on your question. The pages I linked are in C# - I hope you can translate. I found that most of the iTextSharp samples you'll find are in C#.
It might be worth using a reporting tool rather than iTextSharp for formatted/tabular data?
We use Active Reports from http://www.datadynamics.com/ but I am sure there are others.
EDIT:
It looks like iTextSharp supports html-to-pdf conversion? Maybe thats easier to render?
Just did a search and found this: http://somewebguy.wordpress.com/2009/05/08/itextsharp-simplify-your-html-to-pdf-creation/

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.