How do I use Google's Vision API to convert a PDF (non-searchable) to a searchable PDF? - pdf

From what I've seen, Google's Vision API lets you perform OCR on a PDF, but it returns only the detected text in a JSON format. What I need is a searchable (OCR'd) PDF file in return. Is this possible?

Notice that the OutputConfig type doesn't have any metadata field to configure the resulting file's format. As you are already aware, the API returns a JSON response. You could either first get the JSON data with the API and explore the use of any of the following repositories for JSON to PDF conversion or directly use any specialized module such as OCRmyPDF that specifically serves this purpose on your source PDF and avoid the use of the API altogether.

Related

Can't send a binary attachment to Slack using LogicApp

I'm trying to use a logicapp to get the content of an email and post it to slack. By content I mean:
the body of the email and other elements like From:, Subject:
any attachment in the email (which usually are binary like PDF, Excel, image)
the email itself saved in a blob as .eml file
Slack chat.postMessage API works without any problem to send any text element. This API has some attachment argument but doesn't seemto be designed or binary files (or not for files at all, only strings)
I've tried slack files.upload one but couldn't figure out the syntax, especially the syntax using a regular HTTP POST. Could find examples online using curl, Python, JS and C# SDK but I don't know how to translate them to HTTP POST just like I do with chat.PostMessage
I've tried the API on SOAP UI, using file as argument as per the documentation, and I've used it in different sections: in the header, in the body, and using the Attachment Tab, none of the work and always the same error message : no_file_data
Unfortuatelly slack documentation lacks of details. Here's what it says about files.upload:
You must provide either a file or content parameter.
The content of the file can either be posted using an enctype of multipart/form-data (with the file parameter named file), in the usual way that files are uploaded via the browser, or the content of the file can be sent as a POST var called content. The latter should be used for creating a "file" from a long message/paste and forces "editable" mode.
In both cases, the type of data in the file will be intuited from the
filename and the magic bytes in the file, for supported formats.
I could use alternatives like just saving the attachments in blobs and use Azure functions to send the file, but I want to understand what's the limitations before changing the method.
Any clue?

How to work with PDF in PDF/A1-a format using PHP

I'd like to generate a PDF file strictly in the PDF/A1-a format (to integrate with a government service).
This service does not support PDF/A1-b or PDF/A1-x format. Only PDF/A1-a.
Previously, I used mPDF (https://mpdf.github.io/) in my work. But this library supports maximum PDF/A1-b.
A search in Google gave no result. But I think that I am not the first one who needed this format. Please tell me if there is something convenient for working with this "rare" format (PDF/A1-a).
Regards, Alexey.

REST API method that accepts multiple file uploads and additional arguments

I'm attempting to create a REST API method that accepts multiple file uploads with some additional arguments. This API method will be called from both web forms, web services or mobile apps.
Is there a standard I should be following with regards to how the method takes these parameters in?
So far, I've considered the following two approaches:
JSON body: file data to be included as base64 encoded fields within the JSON object. Fine if being called from other web services, but troublesome when calling from a HTML form?
multipart/form-data: easy to use with HTML forms, but problematic when calling from web services or mobile apps?
I know that either of the two approaches would work, but I'd like to implement this the correct way (if there is one) according to current standards. Any ideas?
Do modern JS libraries/frameworks make it easy to POST HTML forms to web APIs as JSON objects
Yes, we have a lot of library to convert the file into base64.
In my opinion, choose what is based on your requirement. Firstly, exchanging data in multipart format should be more efficient than base64 json string. But this article show, the term of the size is little.
But if we use json, you could pass multiple other variable in the json format and we could read it easily.
Besides, if your file is image, the browser understand data URIs (base64 encoded images), there is no need to transform these if the client is a browser.

OCR PDF Files Using Google Cloud Vision?

Are there currently any services or software tools that use Google Cloud Vision as backend for OCRing scanned PDF files?
If not, how would one be able to use Google Cloud Vision to turn PDFs into OCRed PDFs? As far as I know, Cloud Vision currently supports PDF files, but it will output recognized text only as a JSON file. So it seems one would need to do the additional step of placing this converted text on top of the image inside the PDF outside of Google Cloud Vision, in a separate step.
Background:
I often have to convert scanned-document PDF files into PDF files containing an OCRed text layer. So far, I've been using Software like OCRKit or ABBYY FineReader. I tested the accuracy of these solutions against the text recognition abilities of Google Cloud Vision, and the latter came out far ahead.
As others have mentioned, you need to use third party tools to do this.
First convert the google cloud vision response json to a hocr file using gcv2hocr:
gcv2hocr test.jpg.json output.hocr
Then use hocr-tools to stitch the hocr data to the pdf file. The below command will look in the 'imgdir' folder and merge .hocr and .jpg with the same name into pages in out.pdf.
hocr-pdf --savefile out.pdf <imgdir>
As you well mentioned, the responses retrieved by Vision API are available only on a JSON format; therefore, it is required to include an additional step within your solution, by using third-party libraries, in order to create a PDF file based on the response's content.
In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Vision API feature request and notify to Google about this desired functionality.
Solution for starting with a PDF and using Vision's document text detection:
gcv2hocr works for a very specific vision json format not the output from document text detection. I refactored that code to create the correct hocr.
Second issue is that hocr-pdf takes a jpeg image not a pdf to start with. I refactored hocr-pdf and included pdf2image.
image=convert_from_path(imageFilePath)[0]
image_obj=io.BytesIO()
image.save(image_obj,format='jpeg')
image_obj.seek(0)
image_obj=ImageReader(image_obj)
can=Canvas(savefile,pagesize=letter)
width,height=image.size
width *= (72 / 200)
height *= (72 / 200)
can.setPageSize((width, height))
can.drawImage(image_obj, 0, 0, width=width, height=height)
load_invisible_font()
add_text_layer(can,height)
can.showPage()
can.save()
This is better than a pypdf2 solution because it deletes any existing hidden text layers first by converting it to an image.

Loading dynamically generated KML into google maps api

I have a bit of an issue with loading a dynamically generated KML into google maps api.
The KML file is generated by oracle and is of the format
http://server/oracleservioce.method?parm1=100&parm2=100
If I try and load that uRL (endcoded or decoded) I always get a KMLLayerStatus as INVALID_DOCUMENT.
If I save the resultant file to a local file with a KML extension it works foine, otherwise I get errors.
I even tried renaming the file to .xml and .dat (arbitrary names) and they all fail. It seems that google api need the file to have a .KML extension. This will not work in the dynamic environment. Can anybody suggest a way forward?
Thanks,
PS: I Need to use google maps API, I can not use openlayers or any other solution. The file needs to be loaded into a google.maps.kmllayer object.
I did this, no matter on the extension, but you have to set the mimetype on the http response: https://developers.google.com/kml/documentation/kml_tut#kml_server