Azure Computer Vision API - OCR to Text on PDF files

Azure Computer Vision API - OCR to Text on PDF files - pdf

I'm attempting to leverage the Computer Vision API to OCR a PDF file that is a scanned document but is treated as an image PDF.
I've tested it and it tells me that the PDF is "InvalidImageFormat", "Input data is not a valid image". When I test it on a PNG, it works perfectly.
Is there anyway to use the API against a PDF image or is there an Azure API that I could use in conjunction to go PDF > PNG > Text?

Edit
Since answering additional services have become available, although I have not personally tried some of them, they may suit this purpose.
https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-intro
And at some point in the future when It goes GA.
https://aws.amazon.com/textract/
Original Answer
Unfortunately Azure has no PDF integration for it's Computer Vision API. To make use of Azure Computer Vision you would need to change the pdf to an image (JPG, PNG, BMP, GIF) yourself.
Google do now offer pdf integration and I have been seeing some really good results from it from my testing so far.
This is done through the asyncBatchAnnotateFiles Method of the vision Client (I have been using the NodeJS Variant of the API)
It can handle files up to 2000 pages, Results are divided up into 20 page segments and output to Google Cloud Storage.
https://cloud.google.com/vision/docs/pdf

The latest OCR service offered recently by Microsoft Azure is called Recognize Text, which significantly outperforms the previous OCR engine. Recognize Text can now be used with Read, which reads and digitizes PDF documents up to 200 pages.

There is a new cognitive service API called Azure Form Recognizer (currently in preview - November 2019) available, that should do the job:
https://azure.microsoft.com/en-gb/services/cognitive-services/form-recognizer/
It can process the file formats you wanted:
Format must be JPG, PNG, or PDF (text or scanned). Text-embedded PDFs
are best because there's no possibility of error in character
extraction and location.
https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/overview
Here is the link the official Form Recognizer API documentation:
https://westus2.dev.cognitive.microsoft.com/docs/services/form-recognizer-api/operations/AnalyzeWithCustomModel
Note:
Form Recognizer is currently available in English, with additional language
availability growing (4.12.2019)
Form Recognizer is available in
the following Azure regions (4.12.2019):
Canada Central, North Europe, West Europe, UK South, Central US, East US, East US 2, South Central US, West US
https://azure.microsoft.com/en-in/global-infrastructure/services/?products=cognitive-services

Sorry you have to break the PDF pages into images (JPG and PNGs). Then send the images over to Computer Vision. It is also a good idea to break it down so that you don't have to OCR all pages, only the ones that have importance.

There is a new Read API to work with PDF
https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text
Computer Vision’s Read API is Microsoft’s latest OCR technology that extracts
printed text (seven languages), handwritten text (English only), digits, and
currency symbols from images and multi-page PDF documents.
Read API reference: https://westcentralus.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-ga/operations/5d986960601faab4bf452005
It works well enough, but does not have a lot of languages yet.

You can convert the pdf to images for each page using fitz.
# import packages
import fitz
import numpy as np
import cv2
#set path to pdf
path2doc = <path to pdf>
#open pdf with fitz
doc = fitz.open(path2doc)
# determine number of pages
pagecount = doc.pageCount
# loop over all pages and convert to image (here jpeg)
for i in range(pagecount):
page = doc[i]
pix = page.getPixmap().getImageData(output='JPEG')
jpg_as_np = np.frombuffer(pix, dtype=np.uint8)
image = cv2.imdecode(jpg_as_np, flags=1)
Once this is done, you can send them to the API

Related

API to clean scanned documents

I have a system in PHP, and i will receive photos from users, that are scanned photos of their personal documents... and i want to smart remove (auto crop) the boundaries of the photos, to leave ONLY the personal document (as a drivers license).
I spent hours looking, but were not able to find an API service that does that...i ll i can find was SDK for ios or android...
Any one has a suggestion?
I would want to replicate what dropbox did:
https://blogs.dropbox.com/tech/2016/08/fast-and-accurate-document-detection-for-scanning/
Thanks

Filestack provides these services. In particular, the Document Detection endpoint has the following description:
You can detect your document in the image, transform it to fully fit the image...

arcgis: load kml files upto 80 megbytes total(10 megabytes per file)

Hi I am using Arcgis for JS version 4.9, I am trying to load upto 8 kml files(each file is about 10 megabytes).
The loading of the kml is passing successfully but the interaction(pan&zoom) with the map is very slow and not smooth.
I have several questions regarding this issue:
Can Esri load such an amount of kml files? if not is there any other alternatives ?
Why in openlayers I can do it smoothly and in pure arcgis it more problematic?
Can I upload a RAW kml data and not diretly via hosting url?
I would appreciate any kind of help, thanks in advance!!

Kml is a vector format. One option would be to generalize the geometry of the features. Also, you could decrease the precision of the coordinates. For instance, a coordinate in meters with 5-6 digit after the coma is not necessary for visualization in a web map. You can round it to the meter.
Finally, if by raw data you mean parsing and loading a kml from a string, that's not possible with the ArcGIS API. The kml/kmz must be a separate file accessible on the internet.
The KMLLayer uses a utility service from ArcGIS.com, therefore your
kml/kmz files must be publicly accessible on the internet. If the
kml/kmz files are behind a firewall you must to set the
esriConfig.kmlServiceUrl to your own utility service (requires ArcGIS
Enterprise).
Source: the KMLLayer documentation

Streaming large PDFs from SharePoint

I have a client that wants to store large PDFs (>700MB) on SharePoint 2013. The problem is that viewing the PDF is currently requiring the entire PDF to be download before displaying the first page. I need the browser to display each page of the PDF as it downloads, a feature I believe Adobe calls "Fast Web View" or "Byte Streaming". Here is what I know:
"Fast Web View" is enabled on the PDF document in the Document Properties window.
I can verify that the PDF is "Linearized" by reading the ASCII content.
I have checked the PDF reading options from the PDF Accessibility.
The client has SharePoint 2013 on premise installed.
SharePoint's File Handling is set to permissive.
I have verified PDF is an AllowedInlinedownedMinme type of the Web Application.
Anything else I should check or configure?

It is not enough if the PDF files are linearized (technical term in PDF parlance) or optimized for fast web view (marketing term for that feature).
There need to be two conditions met before taking advantage of fast web view working for the end user:
The PDF viewer needs to be able to make use of the linearized/optimized PDF file features.
The PDF serving remote host (in this case SharePoint) needs to be properly configured to honor 'byte range requests' by the viewer, so downloading chunks of the PDF file may be delivered "out of order".
However,...
...I do not know if SharePoint servers in general do support the second requirement;
...if SharePoint is not the problem, you may want to check which PDF viewer is actually in use in that environment (test it with Adobe Reader -- that one takes advantage of linearized PDF features for sure).
See also this answer to a question from today, which gives a few more technical details:
How are PDF files able to be partially displayed while downloading?

A co-worker identified the problem after comparing the download from SharePoint to that of a working site using WireShark. The SharePoint site didn't include "Byte ranging" in the response headers. In order to enable that feature in SharePoint, you have to enable BlobCache. Beware, BlobCache is not supported in SharePoint foundations.

PDf viewer without download option

i have some pdf files which i'd like to upload on my association site thus i'd like them not to be able to download it as it may content some slightly sensitive information .
So ok they could ctrl+c but that would reduce the spreading of the information not to have them locally
php/js w/e
thanks

quoted answer from a similar question, if you' re using Adobe pdf viewer:
You can NOT prevent users from saving ANY TYPE of document from the web - PDF, HTML, JPEG, etc. It's a "feature" of the web.
What you CAN DO is prevent users from being able to use the PDF once it hits their own disk. To do this, you use powerful Digital Rights Management solutions...:
https://forums.adobe.com/message/5158866

Embedding PDF documents into websites

I need to embed some PDF documents into a website. The last time I did this, I used a jQuery lightbox to popup an iFrame with the PDF document as the URL. The client's PDF viewer would then take care of the rest.
Apparently though, that was a bit buggy on some other peoples browsers. I guess it was due to the large PDF file sizes and the effort it took for their computers to fire up Adobe.
So I'm after ideas on how to go about this. How do you guys embed your PDF's into websites? Or do you just stick to adding a download link?

I often use scribd to solve this issue.
You have to upload your document (can be PDF, DOC or something else) to your scribd account and the service makes it possible to view this (pdf) document in a flash environment (perfectly embedabble with lightbox).
For this solution, a third party service (scribd) is required for your documents, but with their API it's possible to include all scribd functionality in your own website.

We have used Docuter
They let you embed and track

I've used Google Docs in Flash: http://trajctrl.tyblu.ca/?page_id=2
It's a bit buggy, but I find it works if you wiggle the image a bit - ie: zoom, click, etc. Download link is nearby just in case, too. Not exactly sure how it was done, as its a Wordpress plugin (Google Doc Embedder), but I imagine Google has an API somewhere.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Azure Computer Vision API - OCR to Text on PDF files - pdf

The latest OCR service offered recently by Microsoft Azure is called Recognize Text, which significantly outperforms the previous OCR engine. Recognize Text can now be used with Read, which reads and digitizes PDF documents up to 200 pages.

Sorry you have to break the PDF pages into images (JPG and PNGs). Then send the images over to Computer Vision. It is also a good idea to break it down so that you don't have to OCR all pages, only the ones that have importance.

Related

API to clean scanned documents

arcgis: load kml files upto 80 megbytes total(10 megabytes per file)

Streaming large PDFs from SharePoint

PDf viewer without download option

Embedding PDF documents into websites

Categories

Resources