Using Google Cloud Document AI Processors for PDF analysis and document generation - pdf

Is it plausible to train a document AI processor to analyze a pdf file containing instructions for a document outline and content (such as a government Request for Proposals), and output a new text document with an outline and draft content based on the input document?

Kolban mentioned in the comments that it is not possible to generate semantic output by using Google Document AI which extracts structured data from dark data or unstructured data. Document AI is not for output generation but it is for analyzing and extracting/parsing data.
Regarding your another question in the comments, it may be possible with Custom Document AI, where you can build models that suit your document types. You can train custom models from scratch and evaluate your data.

Related

Google Document AI Labeling Task

Here everyone, I am fairly new google cloud console. I am trying to customize a google document ai model that will learn to extract different sections of document to various data. As you can on the image that it fails to train the model and I have been running the Labeling Task for several days now I have not seen progress. Can you please assist in telling what is the right way to customize google document ia modelenter image description here
I have tried to manually label the different sections of the document, it took me a while so I did around 20 test and training dataset which I think the model to not train then I decided to do the Labeling Task as an alternative to manually labeling the dataset.
Here is the information about Labeling Tasks for Document AI
https://cloud.google.com/document-ai/docs/workbench/label-documents#labeling-tasks
Labeling Tasks use Human-in-the-Loop to have human labelers label documents for training data or production review. You can either set up your own labeling team or apply for access to the Google-Managed Workforce.
However, it doesn't seem like this is the correct course of action for what you are trying to do, since the labeling has already been completed.
Could you provide more clarification on what you are trying to accomplish with the Custom Document Extractor?
Note: Some of the error messages that are output from Document AI Workbench are not very descriptive (e.g. Internal Error Occurred) but the product development team is working to surface more helpful errors when possible.

Extract inconsistence format PDF data into excel

I am need to build an AI model to extract data in the PDF into excel. I can use Power automate AI builder, but the issue is all the Pdfs are in different formats. There are more than 5k Pdfs
I have tried power automate Ai Builder but failed to extract the required data.
Please could you suggest the best way I could build something automate for long term usage.
Many Thanks!
Did you try to create a document processing custom model with unstructured documents and see if you have a better result ?
You can also post our message to the Power Automate Community
https://powerusers.microsoft.com/t5/AI-Builder/bd-p/AIBuilder
Hope it helps!

quickly inspect OCR text layer on PDF file

Is there any program that will allow me to superimpose the text (OCR) layer of a PDF on top of the PDF rendering?
I want to quickly see if the text layer has errors or not.
It would be more convenient if that can be done with a program, if not, some cli command or script would also work.
Superimpose? It implies you'd like to add text while I believe you'd like to have access to the text for detection and possibly further analysis of the OCRed text quality. Perhaps need further clarification.
Our developers worked for some time on algorithms to detect the presence of text in PDFs and then evaluate its quality. There are many cases that can trick a basic algorithm - Bates number or imprinter added into image-only PDF makes it seem like PDF has high-quality text while it has no actual text. Some copiers produce "searchable PDFs" while using very low-quality OCR that contains many errors, but not necessarily on the first page that is typically some kind of title page with large fonts, thus the first line of the text encountered by an algorithm seems high quality. Or the first page may have text while other pages do not, yet algorithm may believe the whole PDF has text.
In our commercial high-volume server-based OCR software (used by service bureaus, SaaS platforms, libraries, backlog conversions, etc.) we now have advanced detection of PDFs with existing text layer and "smart decisions" which can filter out many of these false positive situations. Our OCR can skip re-OCRing PDFs with high-quality text in PDFs. If you are looking for a high-quality inexpensive OCR platform, such detection is a feature in it, but it can't be used separately without our OCR. OCR workflow is used as a part of that filter. Our developers wrote and integrated these algorithms without external tools.
I am with www.wisetrend.com where we provide software solutions and consulting for various OCR projects.

PDF data extraction

Is there a way for me to take a scanned PDF image and extract data from the image by highlighting the fields that are needed? We scan thousands of PDF images of real estate deeds daily and would like to be able to automate the data entry process. The problem that we are facing is that no two deeds are the same.
It has been said in comments that Stackoverflow is mainly about programming issues.
Nevertheless, there are possibilities, depending on the actual documents, and the volumes to be processed.
On the high end, there is a product called Teleform, originally developed by Cardiff, and now owned by HP, which is used to process paper forms; you may also look at the Business Process application Cardiff LiquidOffice, now HP LiquidOffice.
On the low end, I have developed an application in PDF, running under Acrobat, which can take a scanned and OCRd form, and transfer the data to a specially prepared fillable form, from where the data can be exported towards a database, for example. For more information, a demo and a quote, feel free to contact me in private.
If you want to develop something using Acrobat, you could also begin with a OCRd document, and then use the capabilities of the Redaction function (or use the industrial strength Redaction tool Redax by Appligent) to find keywords, and then use the positional information of those keywords to extract more data.

import/embed xml ocr/text info from one pdf to a different pdf

I am trying to optimize quality/filesize of a image-scanned pdf while retaining ocr quality.
I could try and downsample after ocr of the high quality pdf document but the tools I'm using (acrobat primarily) do not create small file sizes as compared to using photoshop and exporting a lower dpi/optimized pages and using these pages to create a pdf.
A better solution, if possible would be to take a image-pdf document (800M for the current case) that has been ocred and apply the ocr layer to a lower-rez downsampled document.
I can successfully extract the OCR info with coordinates as xml with pdfminer, but I would like to take this and apply it the same file that has been downsampled using photoshop. I thought I read this was possible with pdftk but I can no longer find this information.
Any suggestions would be greatly appreciated.
jack
Can you describe what's the current way your create PDFs?
With IText it's possible to set the compression level of images added.
May be useful