PDF to HTML and OCR solution for information extraction - pdf

I'm looking for a solution for PDF to HTML and OCR service in the cloud or in the SDK format. After my searches, I see that there are bunch of services out there in the internet. I tried some of them and I got some idea. I'd like to know that if any of you use such service.
My biggest concerns are to have a automation structure to have an HTML output that I can use in the information extraction. I'd like to have structured data output like tables. (most of the services provide HTML output with the -character format (CSS/HTML tag for each char) or -paragraph format (CSS/HTML for each line).
I checked so far :
Abbyy Cloud SDK (They don't have PDF-to-HTML service but PDF-to-XML that can be covertable to HTML with XSLT support (maybe). Also OCR service with text output is quite good)
cloudconvert.org (They are providing the same results as Ubuntu pdftohtml command which is based on poppler-Xpdf3.0)
pdftohtml commamd (Tested on Ubuntu) - I got a result with full of < p >.
aspose.PDF (They don't have PDF-to-HTML service in the cloud but they have good integration with GDrive, Dropbox and Amazon s3.
PdfNET of PDFTron : I got a result with complex CSS and HTML structure with almost a tag per character.
My question is if you know any other service worth to try and get structural HTML output for data extraction.
Thanks in advance.

Related

How to work with PDF in PDF/A1-a format using PHP

I'd like to generate a PDF file strictly in the PDF/A1-a format (to integrate with a government service).
This service does not support PDF/A1-b or PDF/A1-x format. Only PDF/A1-a.
Previously, I used mPDF (https://mpdf.github.io/) in my work. But this library supports maximum PDF/A1-b.
A search in Google gave no result. But I think that I am not the first one who needed this format. Please tell me if there is something convenient for working with this "rare" format (PDF/A1-a).
Regards, Alexey.

OCR PDF Files Using Google Cloud Vision?

Are there currently any services or software tools that use Google Cloud Vision as backend for OCRing scanned PDF files?
If not, how would one be able to use Google Cloud Vision to turn PDFs into OCRed PDFs? As far as I know, Cloud Vision currently supports PDF files, but it will output recognized text only as a JSON file. So it seems one would need to do the additional step of placing this converted text on top of the image inside the PDF outside of Google Cloud Vision, in a separate step.
Background:
I often have to convert scanned-document PDF files into PDF files containing an OCRed text layer. So far, I've been using Software like OCRKit or ABBYY FineReader. I tested the accuracy of these solutions against the text recognition abilities of Google Cloud Vision, and the latter came out far ahead.
As others have mentioned, you need to use third party tools to do this.
First convert the google cloud vision response json to a hocr file using gcv2hocr:
gcv2hocr test.jpg.json output.hocr
Then use hocr-tools to stitch the hocr data to the pdf file. The below command will look in the 'imgdir' folder and merge .hocr and .jpg with the same name into pages in out.pdf.
hocr-pdf --savefile out.pdf <imgdir>
As you well mentioned, the responses retrieved by Vision API are available only on a JSON format; therefore, it is required to include an additional step within your solution, by using third-party libraries, in order to create a PDF file based on the response's content.
In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Vision API feature request and notify to Google about this desired functionality.
Solution for starting with a PDF and using Vision's document text detection:
gcv2hocr works for a very specific vision json format not the output from document text detection. I refactored that code to create the correct hocr.
Second issue is that hocr-pdf takes a jpeg image not a pdf to start with. I refactored hocr-pdf and included pdf2image.
image=convert_from_path(imageFilePath)[0]
image_obj=io.BytesIO()
image.save(image_obj,format='jpeg')
image_obj.seek(0)
image_obj=ImageReader(image_obj)
can=Canvas(savefile,pagesize=letter)
width,height=image.size
width *= (72 / 200)
height *= (72 / 200)
can.setPageSize((width, height))
can.drawImage(image_obj, 0, 0, width=width, height=height)
load_invisible_font()
add_text_layer(can,height)
can.showPage()
can.save()
This is better than a pypdf2 solution because it deletes any existing hidden text layers first by converting it to an image.

Windward Document Generation - PDF Copy Protected

In our application, we generate a few reports and documents through Windward. The documents are generated based on specific user conditions and the user is able to download the document.
As part of a new requirement, we would like to enable copy protection of the generated PDF -- basically, users would not be able to Copy the contents of the document.
Is there anyway we can achieve this through Windward? Or do we have to integrate with external third party software like LockLizard or Win2PDF?
We did think of converting the document to an image and recreating the PDF but this is unacceptable as the document formatting became off the mark.
Appreciate any insights or alternate solutions.
Thanks,
Aravind
Windward does this. If you're using the Java engine use the following calls (javadoc):
ProcessPdfAPI.setOwnerPassword()
ProcessPdfAPI.setUserPassword()
For the .NET engine use the following calls (api docs):
ReportPdf.OwnerPassword
ReportPdf.Security
ReportPdf.UserPassword
Is this what you need?

Adobe API to convert Office documents to PDF

Is there any API available by Adobe that would enable me to convert Office Documents (docx, xlsx, pptx, etc.) files to a PDF file format?
I would prefer to use .NET to do so, but if I have to I can resort to C/C++.
I've already tried using Adobe SDK, but it seems to me it works to automate the Acrobat application instead of giving me access to underlying functionality. If it's possible and anyone would care to give me an example, I'd be very thankful - after many hours googling it I was unable to find a good answer (a lot of samples doing the contrary, though - converting from PDF to Word).
One last thing, I need it to be an library from Adobe. So, PDFCreator, BCL EasyPDF, Aspose.Words/Cells/Slides etc., unfortunately, won't help me.
UPDATE 1:
I decided to ask this question in the forum because, first, I can't believe that Adobe wouldn't have a library to do this; Of course, it may be the case, but it's very strange.
UPDATE 2:
I also looked already into AdobePDFMakerX.Word interface. I tried calling the CreatePDF(string in, string out) interface, but to no avail. It always returns false, and there is no error description that I can use.
want to convert doc file to pdf file using Adobe pdf service api
In short it has two parts:
make a post request ( providing require parameters ) and from header take x-request-id
make a get request ( providing require parameters ) and as responce you will get your pdf documet
it is working fine
Are you sure Aspose.Words didn't work for you? I tested the below code sample and works fine.
string filePdf = #"D:\\Projects\\original.pdf";
string fileDocX = #"D:\\Projects\\New.docx";
Aspose.Words.Document doc = new Aspose.Words.Document(fileDocX);
doc.Save(filePdf, Aspose.Words.SaveFormat.Pdf);

What is the formatting of Solr CEL/Tika output? And how to fix it?

I am using Solr to index DOC, DOCX and PDF files. I had enabled stored for the text and I checked it out. Here's the result from a sample DOC file:
, a mobile user interface (UI) software development company, based in Cambridge,
UK. After integrating the company, Qualcomm re-branded their interface
markup language and its accompanying integrated development
environment (IDE) as HYPERLINK
"http://en.wikipedia.org/w/index.php?title=UiOne&action=edit&redlink=1"
*\o "UiOne (page does not exist)" uiOne** . In March 2009, Qualcomm
informed their Cambridge engineering staff, mostly from the division
working on HYPERLINK "http://en.wikipedia.org
The Doc contains material from Wikipdia. I captured a full output on http://pastebin.com/8FL9eHJv
So Solr CEl/Tika inserts its own formatting, and the results of the formatting show up in the search output. How can I fix the problem so that the search results (text snippets) will not contain the formatting?
Googling around tells me that TIKA has several output formats, so is that the approach? Or is there a plugin that can filter the text before rendering the results?
Relevant details: My configuration is close to stock:
My upload command is a python variation of
curl
"http://localhost:8983/solr/update/extract?literal.id=doc-qualcomm&commit=true"
-F "myfile=#11qualcomm.doc"
My schema.xml http://pastebin.com/VLz2uuDQ
My SolrConfig.xml http://pastebin.com/X2J2jj64
Are you asking about the extra hyperlink items in the search results. If yes, try updating the extract request handle in your solrconfig.xml to
<str name="captureAttr">false</str><str name="fmap.a">ignored_</str>