PDF data extraction - pdf

Is there a way for me to take a scanned PDF image and extract data from the image by highlighting the fields that are needed? We scan thousands of PDF images of real estate deeds daily and would like to be able to automate the data entry process. The problem that we are facing is that no two deeds are the same.

It has been said in comments that Stackoverflow is mainly about programming issues.
Nevertheless, there are possibilities, depending on the actual documents, and the volumes to be processed.
On the high end, there is a product called Teleform, originally developed by Cardiff, and now owned by HP, which is used to process paper forms; you may also look at the Business Process application Cardiff LiquidOffice, now HP LiquidOffice.
On the low end, I have developed an application in PDF, running under Acrobat, which can take a scanned and OCRd form, and transfer the data to a specially prepared fillable form, from where the data can be exported towards a database, for example. For more information, a demo and a quote, feel free to contact me in private.
If you want to develop something using Acrobat, you could also begin with a OCRd document, and then use the capabilities of the Redaction function (or use the industrial strength Redaction tool Redax by Appligent) to find keywords, and then use the positional information of those keywords to extract more data.

Related

quickly inspect OCR text layer on PDF file

Is there any program that will allow me to superimpose the text (OCR) layer of a PDF on top of the PDF rendering?
I want to quickly see if the text layer has errors or not.
It would be more convenient if that can be done with a program, if not, some cli command or script would also work.
Superimpose? It implies you'd like to add text while I believe you'd like to have access to the text for detection and possibly further analysis of the OCRed text quality. Perhaps need further clarification.
Our developers worked for some time on algorithms to detect the presence of text in PDFs and then evaluate its quality. There are many cases that can trick a basic algorithm - Bates number or imprinter added into image-only PDF makes it seem like PDF has high-quality text while it has no actual text. Some copiers produce "searchable PDFs" while using very low-quality OCR that contains many errors, but not necessarily on the first page that is typically some kind of title page with large fonts, thus the first line of the text encountered by an algorithm seems high quality. Or the first page may have text while other pages do not, yet algorithm may believe the whole PDF has text.
In our commercial high-volume server-based OCR software (used by service bureaus, SaaS platforms, libraries, backlog conversions, etc.) we now have advanced detection of PDFs with existing text layer and "smart decisions" which can filter out many of these false positive situations. Our OCR can skip re-OCRing PDFs with high-quality text in PDFs. If you are looking for a high-quality inexpensive OCR platform, such detection is a feature in it, but it can't be used separately without our OCR. OCR workflow is used as a part of that filter. Our developers wrote and integrated these algorithms without external tools.
I am with www.wisetrend.com where we provide software solutions and consulting for various OCR projects.

How do I search a scanned PDF using vb.net, and then print it?

I've been asked by the company I'm interning at to design a system that could search a scanned PDF file using the SO number and date as unique keys. How would I do this, as I believe a scanned PDF is not searchable?
You're right, you can't do this - at least not quite like that. You'd need to run the scanned PDF through an optical character recognition app first (there are plenty of free ones on the web) and convert it to text, at which point you can start manipulating data.

High quality alternatives to PDF

We are trying to figure out the best way to create a web service that delivers high quality text books to remote tablets and desktop clients. The books are copyrighted and sold to users so the delivery must be protected as much as possible against copy. The books' layout is very complicated, with lots of images, pictures, textures, tables, diagrams and the like. They are produced by InDesign in PDF format.
So far, our best guess is to store the PDF in single pages (a PDF per page) and scramble them with asymmetric keys, so all the decryption can be processed in memory with no temporary file generated.
Our concern is that PDF is a proprietary format and sometimes the file is too big (quality is an important concern for the client).
Is there any Open Source alternative to PDF, capable of delivering high quality, complicated layouts in smaller files?
Your only way around this if it is to be viewable offline is to encrypt the document and issue licence keys for it to be viewable.
There are commercial packages that will allow you to do this enabling you to limit the licence to machine, user or time period.
Ultimately you can't stop people coming up with ingenious ways of copying it, just make it more difficult.
You can use raster image with high quality as PDF alternative.

Pdf tools to analyze pdf attributes

Is there any pdf tools that generate information regarding the loading time and memory usage to display pdf in browser, and also total element inside the pdf?
Unfortunately not really. I've done some of this research, not for PDF in a browser but (and perhaps this is what you are looking at as well) PDF on mobile devices.
There are a number of factors that contribute and that to some extent can be tested for:
Whether or not big images exist in the PDF and what resolution they are. This is linked directly to memory usage.
What compression method is used for image compression. Decompressing JPEG-2000 images specifically can increase load time significantly. Even worse, as JPEG-2000 can be progressively decompressed, it can give the appearance of a really bad PDF until the images has been fully decompressed and loaded (this is ugly specifically on somewhat older tablets for example).
How complex the transparency effects are that are used in the document.
How many fonts are used in the document.
How many line-art objects (vector elements) with a large number of nodes (points) are used on a page.
You can test what is in the document using Acrobat Pro to some extent (there is a well-hidden tool when you save an optimised PDF file that can audit what objects use how much of the space in a PDF document). You can also use a preflight solution such as pdfToolbox from callas (I'm affiliated with this company) or pitstop from enfocus; these tools would allow you to get a report with the results of custom checks such as image resolution, compression, vector objects, color spaces etc.

Overlaying one PDF over another

I am working on a new in-house automated artwork workflow system. The new system delivers a stepped up PDF ready for offset print. The artwork is black only with barcodes and variable data text. There might be between 4 and 50 stepped up artworks per SRA3 sheet.
However, the new system is unable to add tick marks, gripper marks and other information around the edge of the sheet that our production teams would like. Our attempts to add these are not always fit for purpose as the of the shelf software used to generate the variable data was intended for thermal printers.
What would be better is if we had the tick marks saved on another series of template PDF's. We could then superimpose these on the automated artwork.
We are talking about 100's of PDF's per day and we would need this process to be a simple and as possible requiring little skill on the part of the operator or even automated/scripted.
I have read a similar post where adding watermark with Acrobat was recommended. This worked a treat for me but considering the high volumes of artwork, even if made into a Acrobat Batch Process/Action, this would be too involved.
Any ideas welcome: Rip Software, AppleScript, MSDOS!!! Whatever!
Cheers
Tim