quickly inspect OCR text layer on PDF file - pdf

Is there any program that will allow me to superimpose the text (OCR) layer of a PDF on top of the PDF rendering?
I want to quickly see if the text layer has errors or not.
It would be more convenient if that can be done with a program, if not, some cli command or script would also work.

Superimpose? It implies you'd like to add text while I believe you'd like to have access to the text for detection and possibly further analysis of the OCRed text quality. Perhaps need further clarification.
Our developers worked for some time on algorithms to detect the presence of text in PDFs and then evaluate its quality. There are many cases that can trick a basic algorithm - Bates number or imprinter added into image-only PDF makes it seem like PDF has high-quality text while it has no actual text. Some copiers produce "searchable PDFs" while using very low-quality OCR that contains many errors, but not necessarily on the first page that is typically some kind of title page with large fonts, thus the first line of the text encountered by an algorithm seems high quality. Or the first page may have text while other pages do not, yet algorithm may believe the whole PDF has text.
In our commercial high-volume server-based OCR software (used by service bureaus, SaaS platforms, libraries, backlog conversions, etc.) we now have advanced detection of PDFs with existing text layer and "smart decisions" which can filter out many of these false positive situations. Our OCR can skip re-OCRing PDFs with high-quality text in PDFs. If you are looking for a high-quality inexpensive OCR platform, such detection is a feature in it, but it can't be used separately without our OCR. OCR workflow is used as a part of that filter. Our developers wrote and integrated these algorithms without external tools.
I am with www.wisetrend.com where we provide software solutions and consulting for various OCR projects.

Related

How to increase the page turn speed

Are there some general rules how to create pdf documents in order to achieve an optimal page turn speed?
I recently created few pdf documents (without graphics) using Microsoft Word and realized that my ebook reader (SONY) leafs through them at a slower pace compared with some pdf documents containing more pages and graphics.
What features of pdf documents or ebook readers affect the page turn speed?
How these feature should be configured to increase the page turn speed?
Thanks
There are a few major items that could affect PDF document rendering speeds in the general case. These items are, in order of severity:
Vector graphics - this is by far the worst offender. Too many vector graphics could bring document rendering speeds to a crawl
High-resolution images - even if the images in the actual document appear small, the originals, stored in the document, could be high resolution. This leads to slow rendering speeds as well
Font problems - this category is expanded on below
I imagine issues with fonts is what comes into play here. The rendering speed can be affected by the number of fonts in a PDF document. Also the use of multi-byte and cid fonts sometimes could trip up a mobile viewer.
Another problem could be the version of word used. Older versions don't do as good of a job generating PDF documents, as newer versions.

PDF data extraction

Is there a way for me to take a scanned PDF image and extract data from the image by highlighting the fields that are needed? We scan thousands of PDF images of real estate deeds daily and would like to be able to automate the data entry process. The problem that we are facing is that no two deeds are the same.
It has been said in comments that Stackoverflow is mainly about programming issues.
Nevertheless, there are possibilities, depending on the actual documents, and the volumes to be processed.
On the high end, there is a product called Teleform, originally developed by Cardiff, and now owned by HP, which is used to process paper forms; you may also look at the Business Process application Cardiff LiquidOffice, now HP LiquidOffice.
On the low end, I have developed an application in PDF, running under Acrobat, which can take a scanned and OCRd form, and transfer the data to a specially prepared fillable form, from where the data can be exported towards a database, for example. For more information, a demo and a quote, feel free to contact me in private.
If you want to develop something using Acrobat, you could also begin with a OCRd document, and then use the capabilities of the Redaction function (or use the industrial strength Redaction tool Redax by Appligent) to find keywords, and then use the positional information of those keywords to extract more data.

Pdf tools to analyze pdf attributes

Is there any pdf tools that generate information regarding the loading time and memory usage to display pdf in browser, and also total element inside the pdf?
Unfortunately not really. I've done some of this research, not for PDF in a browser but (and perhaps this is what you are looking at as well) PDF on mobile devices.
There are a number of factors that contribute and that to some extent can be tested for:
Whether or not big images exist in the PDF and what resolution they are. This is linked directly to memory usage.
What compression method is used for image compression. Decompressing JPEG-2000 images specifically can increase load time significantly. Even worse, as JPEG-2000 can be progressively decompressed, it can give the appearance of a really bad PDF until the images has been fully decompressed and loaded (this is ugly specifically on somewhat older tablets for example).
How complex the transparency effects are that are used in the document.
How many fonts are used in the document.
How many line-art objects (vector elements) with a large number of nodes (points) are used on a page.
You can test what is in the document using Acrobat Pro to some extent (there is a well-hidden tool when you save an optimised PDF file that can audit what objects use how much of the space in a PDF document). You can also use a preflight solution such as pdfToolbox from callas (I'm affiliated with this company) or pitstop from enfocus; these tools would allow you to get a report with the results of custom checks such as image resolution, compression, vector objects, color spaces etc.

Reducing the size of pdf generated from software using proprietary fonts

I am trying to bring an Indian Magazine online. This magazine is typed in CorelDraw using the proprietary Devenagari font (http://www.modular-infotech.com/html/shreelipi.html). So these guys have provided a USB dongle that you have to have attached to the machine when you want to access the fonts, and this software has been in use for past 10 years.
To put the magazine online, we've tried to convert it to pdf (by printing). The resultant pdf size is of the order of 30-50MB, even when the pdf does not have even a single image. I am guessing it converts the whole text into an image
It would be really difficult for users to read this magazine given its size. Though when I convert it to .swf format (for add flipbook kind of functionality) - the size reduces to 5-6MB. But there are people who like to download the magazine and then read. I have had no luck reducing the size of pdf.
I have done lot of research on web. The postscript, primo pdf do not help much. The best I could get was 30% reduction using DocuCom pdf printer. But it is still 20MB. I have tried to play with resolution, compression and quality but the best I could get was 18MB.
Ideally I would like to reduce it to less than 2MB.
I would be really grateful if you could help me reduce the size of the pdf! Considering that it has no images, I am hopeful that I can get some really good compression.
The (35MB) magazine can be downloaded from: http://merajhola.in/jin-march.pdf
I can't see any easy way to reduce the size of this PDF. There are no embedded fonts and all the text is drawn using vector graphics primitives. No amount of tweaking the resolution, compression and quality will have a significant improvement.
One possible option would be to embed the font as a subset rather than use vector graphics. That will almost certainly make a big difference, however I doubt the proprietary font license will allow it.
I'm sorry, but this Shree-Lipi thing just sounds wrong in 2012. It would be much better to use proper OpenType fonts with modern (say InDesign) or free (say LuaTeX) software.

PostScript versus PDF as an output format

I'm currently writing a typesetting application and I'm using PSG as the backend for producing postscript files. I'm now wondering whether that choice makes sense. It seems the ReportLab Toolkit offers all the features PSG offers, and more. ReportLab outputs PDF however.
Advantages PDF offers:
transparancy
better support for character encodings (Unicode, for example)
ability to embed TrueType and even OpenType fonts
hyperlinks and bookmarks
Is there any reason to use Postscript instead of directly outputting to PDF? While Postscript is a full programming language as opposed to PDF, as a basic output format for documents, that doesn't seem to offer any advantage. I assume a PDF can be readily converted to PostScript for printing?
Some useful links:
Wikipedia: PDF
Adobe: PostScript vs. PDF
If you're planning on only outputting to a PostScript printer, then use PostScript. Otherwise, use PDF.
PDF is more widely supported by non-printer devices. And for your purposes, there aren't any technical advantages of PS over PDF (other than not being able to dump the file directly to a printer).
Here are some things to consider:
gzipped postscript is often much smaller than an equivalent PDF
PDF is basically a generalized container format, if you didn't know that you can embed videos in PDF, that should give you pause
PDF contains scripts that have been used for exploits (though this may be more the fault of bad PDF reader software)
PDF is a much more self-contained format and a high level of functionality. It also has more tools. UNless you specifically need Postscript, stick to PDF.
Avoid PDF like the plague. Adobe invented PDF and pushed PDF to the consumers to make more money from suckers who believed all the hype about PDF that Adobe told its users. PDF is a bloated format that requires a slow and non-free reader to read and process correctly. Most free readers do not support 100% of Adobe features and likely support a subset of features that is are found in Postscript. For instance reportlab does not support 100% of PDF features.
Historical fake technical arguments to use PDF have been
No loops in PDF which stops helps processing, False as other formats such as XML without loops have memory and processing issues.
More fully feature, False argument as Postscript is more powerful and can do what PDF can do with less features.
Postscript has to load all of the pages as it is a language. This is of course not true as C,C++, Java and many other language can load code at runtime.
Postscript is missing feature X, True but mostly because of
Adobe inventing a new format to make money not because feature X cannot be
added to Postscript.
The real reason to use PDF instead of Postscript is that PDF readers are more common than Postscript readers.