OCR PDF Files Using Google Cloud Vision? - pdf

Are there currently any services or software tools that use Google Cloud Vision as backend for OCRing scanned PDF files?
If not, how would one be able to use Google Cloud Vision to turn PDFs into OCRed PDFs? As far as I know, Cloud Vision currently supports PDF files, but it will output recognized text only as a JSON file. So it seems one would need to do the additional step of placing this converted text on top of the image inside the PDF outside of Google Cloud Vision, in a separate step.
Background:
I often have to convert scanned-document PDF files into PDF files containing an OCRed text layer. So far, I've been using Software like OCRKit or ABBYY FineReader. I tested the accuracy of these solutions against the text recognition abilities of Google Cloud Vision, and the latter came out far ahead.

As others have mentioned, you need to use third party tools to do this.
First convert the google cloud vision response json to a hocr file using gcv2hocr:
gcv2hocr test.jpg.json output.hocr
Then use hocr-tools to stitch the hocr data to the pdf file. The below command will look in the 'imgdir' folder and merge .hocr and .jpg with the same name into pages in out.pdf.
hocr-pdf --savefile out.pdf <imgdir>

As you well mentioned, the responses retrieved by Vision API are available only on a JSON format; therefore, it is required to include an additional step within your solution, by using third-party libraries, in order to create a PDF file based on the response's content.
In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Vision API feature request and notify to Google about this desired functionality.

Solution for starting with a PDF and using Vision's document text detection:
gcv2hocr works for a very specific vision json format not the output from document text detection. I refactored that code to create the correct hocr.
Second issue is that hocr-pdf takes a jpeg image not a pdf to start with. I refactored hocr-pdf and included pdf2image.
image=convert_from_path(imageFilePath)[0]
image_obj=io.BytesIO()
image.save(image_obj,format='jpeg')
image_obj.seek(0)
image_obj=ImageReader(image_obj)
can=Canvas(savefile,pagesize=letter)
width,height=image.size
width *= (72 / 200)
height *= (72 / 200)
can.setPageSize((width, height))
can.drawImage(image_obj, 0, 0, width=width, height=height)
load_invisible_font()
add_text_layer(can,height)
can.showPage()
can.save()
This is better than a pypdf2 solution because it deletes any existing hidden text layers first by converting it to an image.

Related

How do I use Google's Vision API to convert a PDF (non-searchable) to a searchable PDF?

From what I've seen, Google's Vision API lets you perform OCR on a PDF, but it returns only the detected text in a JSON format. What I need is a searchable (OCR'd) PDF file in return. Is this possible?
Notice that the OutputConfig type doesn't have any metadata field to configure the resulting file's format. As you are already aware, the API returns a JSON response. You could either first get the JSON data with the API and explore the use of any of the following repositories for JSON to PDF conversion or directly use any specialized module such as OCRmyPDF that specifically serves this purpose on your source PDF and avoid the use of the API altogether.

Multiple fileType for Google Custom Search API during image searching

Currently, I'm using Google Custom Search API, to perform image searching using REST
https://developers.google.com/custom-search/json-api/v1/reference/cse/list#request
I was wondering, what is the correct way, to specific multiple file type?
fileType is being mentioned briefly in the API document.
For instance, the following is the GET request, to search only BMP type image.
https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_KEY&cx=GOOGLE_SEARCH_ENGINE_ID&q=picasso&searchType=image&fileType=bmp
However, what if I want to search both PNG and BMP image?
At first, I thought
fileType=png,bmp
might work as mentioned in http://codigogenerativo.com/code/google-custom-search-api/
However, if I test using
https://www.googleapis.com/customsearch/v1?key=GOOGLE_API_KEY&cx=GOOGLE_SEARCH_ENGINE_ID&q=picasso&searchType=image&fileType=png,bmp
JPEG images are being returned.
Seem like my assumption is wrong. Does anyone know how to have multiple fileType for Google Custom Search API during image searching?
p/s Similar question had been raised before few years ago but no concrete answer yet : Multiple file types search using Google Custom Search API
https://www.googleapis.com/customsearch/v1?fileType=jpg,png,AND OTHER TYPES YOU NEED&key=GOOGLE_API_KEY&cx=GOOGLE_SEARCH_ENGINE_ID&q=picasso&searchType=image

Windward Document Generation - PDF Copy Protected

In our application, we generate a few reports and documents through Windward. The documents are generated based on specific user conditions and the user is able to download the document.
As part of a new requirement, we would like to enable copy protection of the generated PDF -- basically, users would not be able to Copy the contents of the document.
Is there anyway we can achieve this through Windward? Or do we have to integrate with external third party software like LockLizard or Win2PDF?
We did think of converting the document to an image and recreating the PDF but this is unacceptable as the document formatting became off the mark.
Appreciate any insights or alternate solutions.
Thanks,
Aravind
Windward does this. If you're using the Java engine use the following calls (javadoc):
ProcessPdfAPI.setOwnerPassword()
ProcessPdfAPI.setUserPassword()
For the .NET engine use the following calls (api docs):
ReportPdf.OwnerPassword
ReportPdf.Security
ReportPdf.UserPassword
Is this what you need?

Dynamically Update a Map

I have a bit of a situation. I was assigned a task to create a system that will take a KML file and update markers dynamically on a map. I'm currently generating the KML from a Wireshark Dissection and now need a way to take said data into a mapping tool. There are a few situations:
The PC that will be running the system will not have internet access, so I will need to cache de map data.
Each marker might move location so I need to erase said marker's previous location and update it with a new marker location. I do have a sequence ID I can identify it with, but I don't know how I'll update the new location.
It needs to be dynamically updated. A system will send data, my Wireshark Dissector will dissect the data and export it into a KML. This KML will need to be dynamically loaded into the system.
The basic idea in mind is like looking at Google Maps and watching your car move as it tracks your GPS location. But I need to make this tracking system work for a lot more targets than just one.
I'm sorry I currently have no foundation on where to start, but that's why I ask for your guidance. I've researched on ArcGIS, QGIS, Google Earth and Maps, but I haven't found a way to upload dynamically nor refresh the system.
Anything that could help me start finding a solution for this task will be appreciated.
Thank you for your time.
I had experience using leaflet js which allowed you to use bing map, google map, or opensource MapQuest to display mobile-track and car-tracking (for GM OnStar). I am also coding for kml to display flight-tracking on google earth now.
First, I am not sure it is possible or not :
you have a machine not connecting to internet
you want to use those map resource on internet
SO that I will assume that your machine can access internet. Then, there are many solutions.
You may try to see the simple tutorial on http://leafletjs.com/
You will have idea how to do it.
Plus, you have search for examples for Google earth (on which, I can display 3D tracking route).
Hope this help.
Other than map, see my sample to " Dynamic update data on Google Earth " in the following :
https://sites.google.com/site/canadadennischen888/home/kml/auto-refresh-3d-tracking
hope this help ....
(The following are copy from my link which talking about KML for 3D Google Earth. But I believe you can make it into 2D if you have to "not-use-google-earth".)
...
How to make a dynamic Auto refresh 3D Tracking :
prepare a RestFul service to generate KML file from DB
(sample as in https://sites.google.com/site/canadadennischen888/home/kml/3d-tracking)
My other jsp code will generate a KMZ file which has a link to my Restful service. KMZ file has onInterval ( as in the bottom)
Jsp web page allow user to download KMZ file.
When Google Earth open KMZ file, Google Earth will auto refresh to get new data from that Restful service
Everytime refreshing, server will send the latest update KML data with new data to GE.
KMZ sample:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2"
xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<NetworkLink>
<name>Dennis_Chen_Canada#Hotmail.com</name>
<open>1</open>
<Link>
<href>http://localhost:9080/google-earth-project/rest/kml/10001/20002</href>
<refreshMode>onInterval</refreshMode>
</Link>
</NetworkLink>
</kml>
see result

PDF to HTML and OCR solution for information extraction

I'm looking for a solution for PDF to HTML and OCR service in the cloud or in the SDK format. After my searches, I see that there are bunch of services out there in the internet. I tried some of them and I got some idea. I'd like to know that if any of you use such service.
My biggest concerns are to have a automation structure to have an HTML output that I can use in the information extraction. I'd like to have structured data output like tables. (most of the services provide HTML output with the -character format (CSS/HTML tag for each char) or -paragraph format (CSS/HTML for each line).
I checked so far :
Abbyy Cloud SDK (They don't have PDF-to-HTML service but PDF-to-XML that can be covertable to HTML with XSLT support (maybe). Also OCR service with text output is quite good)
cloudconvert.org (They are providing the same results as Ubuntu pdftohtml command which is based on poppler-Xpdf3.0)
pdftohtml commamd (Tested on Ubuntu) - I got a result with full of < p >.
aspose.PDF (They don't have PDF-to-HTML service in the cloud but they have good integration with GDrive, Dropbox and Amazon s3.
PdfNET of PDFTron : I got a result with complex CSS and HTML structure with almost a tag per character.
My question is if you know any other service worth to try and get structural HTML output for data extraction.
Thanks in advance.