How to work with PDF in PDF/A1-a format using PHP - pdf

I'd like to generate a PDF file strictly in the PDF/A1-a format (to integrate with a government service).
This service does not support PDF/A1-b or PDF/A1-x format. Only PDF/A1-a.
Previously, I used mPDF (https://mpdf.github.io/) in my work. But this library supports maximum PDF/A1-b.
A search in Google gave no result. But I think that I am not the first one who needed this format. Please tell me if there is something convenient for working with this "rare" format (PDF/A1-a).
Regards, Alexey.

Related

How do I use Google's Vision API to convert a PDF (non-searchable) to a searchable PDF?

From what I've seen, Google's Vision API lets you perform OCR on a PDF, but it returns only the detected text in a JSON format. What I need is a searchable (OCR'd) PDF file in return. Is this possible?
Notice that the OutputConfig type doesn't have any metadata field to configure the resulting file's format. As you are already aware, the API returns a JSON response. You could either first get the JSON data with the API and explore the use of any of the following repositories for JSON to PDF conversion or directly use any specialized module such as OCRmyPDF that specifically serves this purpose on your source PDF and avoid the use of the API altogether.

OCR PDF Files Using Google Cloud Vision?

Are there currently any services or software tools that use Google Cloud Vision as backend for OCRing scanned PDF files?
If not, how would one be able to use Google Cloud Vision to turn PDFs into OCRed PDFs? As far as I know, Cloud Vision currently supports PDF files, but it will output recognized text only as a JSON file. So it seems one would need to do the additional step of placing this converted text on top of the image inside the PDF outside of Google Cloud Vision, in a separate step.
Background:
I often have to convert scanned-document PDF files into PDF files containing an OCRed text layer. So far, I've been using Software like OCRKit or ABBYY FineReader. I tested the accuracy of these solutions against the text recognition abilities of Google Cloud Vision, and the latter came out far ahead.
As others have mentioned, you need to use third party tools to do this.
First convert the google cloud vision response json to a hocr file using gcv2hocr:
gcv2hocr test.jpg.json output.hocr
Then use hocr-tools to stitch the hocr data to the pdf file. The below command will look in the 'imgdir' folder and merge .hocr and .jpg with the same name into pages in out.pdf.
hocr-pdf --savefile out.pdf <imgdir>
As you well mentioned, the responses retrieved by Vision API are available only on a JSON format; therefore, it is required to include an additional step within your solution, by using third-party libraries, in order to create a PDF file based on the response's content.
In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Vision API feature request and notify to Google about this desired functionality.
Solution for starting with a PDF and using Vision's document text detection:
gcv2hocr works for a very specific vision json format not the output from document text detection. I refactored that code to create the correct hocr.
Second issue is that hocr-pdf takes a jpeg image not a pdf to start with. I refactored hocr-pdf and included pdf2image.
image=convert_from_path(imageFilePath)[0]
image_obj=io.BytesIO()
image.save(image_obj,format='jpeg')
image_obj.seek(0)
image_obj=ImageReader(image_obj)
can=Canvas(savefile,pagesize=letter)
width,height=image.size
width *= (72 / 200)
height *= (72 / 200)
can.setPageSize((width, height))
can.drawImage(image_obj, 0, 0, width=width, height=height)
load_invisible_font()
add_text_layer(can,height)
can.showPage()
can.save()
This is better than a pypdf2 solution because it deletes any existing hidden text layers first by converting it to an image.

Windward Document Generation - PDF Copy Protected

In our application, we generate a few reports and documents through Windward. The documents are generated based on specific user conditions and the user is able to download the document.
As part of a new requirement, we would like to enable copy protection of the generated PDF -- basically, users would not be able to Copy the contents of the document.
Is there anyway we can achieve this through Windward? Or do we have to integrate with external third party software like LockLizard or Win2PDF?
We did think of converting the document to an image and recreating the PDF but this is unacceptable as the document formatting became off the mark.
Appreciate any insights or alternate solutions.
Thanks,
Aravind
Windward does this. If you're using the Java engine use the following calls (javadoc):
ProcessPdfAPI.setOwnerPassword()
ProcessPdfAPI.setUserPassword()
For the .NET engine use the following calls (api docs):
ReportPdf.OwnerPassword
ReportPdf.Security
ReportPdf.UserPassword
Is this what you need?

PDF to HTML and OCR solution for information extraction

I'm looking for a solution for PDF to HTML and OCR service in the cloud or in the SDK format. After my searches, I see that there are bunch of services out there in the internet. I tried some of them and I got some idea. I'd like to know that if any of you use such service.
My biggest concerns are to have a automation structure to have an HTML output that I can use in the information extraction. I'd like to have structured data output like tables. (most of the services provide HTML output with the -character format (CSS/HTML tag for each char) or -paragraph format (CSS/HTML for each line).
I checked so far :
Abbyy Cloud SDK (They don't have PDF-to-HTML service but PDF-to-XML that can be covertable to HTML with XSLT support (maybe). Also OCR service with text output is quite good)
cloudconvert.org (They are providing the same results as Ubuntu pdftohtml command which is based on poppler-Xpdf3.0)
pdftohtml commamd (Tested on Ubuntu) - I got a result with full of < p >.
aspose.PDF (They don't have PDF-to-HTML service in the cloud but they have good integration with GDrive, Dropbox and Amazon s3.
PdfNET of PDFTron : I got a result with complex CSS and HTML structure with almost a tag per character.
My question is if you know any other service worth to try and get structural HTML output for data extraction.
Thanks in advance.

Adobe API to convert Office documents to PDF

Is there any API available by Adobe that would enable me to convert Office Documents (docx, xlsx, pptx, etc.) files to a PDF file format?
I would prefer to use .NET to do so, but if I have to I can resort to C/C++.
I've already tried using Adobe SDK, but it seems to me it works to automate the Acrobat application instead of giving me access to underlying functionality. If it's possible and anyone would care to give me an example, I'd be very thankful - after many hours googling it I was unable to find a good answer (a lot of samples doing the contrary, though - converting from PDF to Word).
One last thing, I need it to be an library from Adobe. So, PDFCreator, BCL EasyPDF, Aspose.Words/Cells/Slides etc., unfortunately, won't help me.
UPDATE 1:
I decided to ask this question in the forum because, first, I can't believe that Adobe wouldn't have a library to do this; Of course, it may be the case, but it's very strange.
UPDATE 2:
I also looked already into AdobePDFMakerX.Word interface. I tried calling the CreatePDF(string in, string out) interface, but to no avail. It always returns false, and there is no error description that I can use.
want to convert doc file to pdf file using Adobe pdf service api
In short it has two parts:
make a post request ( providing require parameters ) and from header take x-request-id
make a get request ( providing require parameters ) and as responce you will get your pdf documet
it is working fine
Are you sure Aspose.Words didn't work for you? I tested the below code sample and works fine.
string filePdf = #"D:\\Projects\\original.pdf";
string fileDocX = #"D:\\Projects\\New.docx";
Aspose.Words.Document doc = new Aspose.Words.Document(fileDocX);
doc.Save(filePdf, Aspose.Words.SaveFormat.Pdf);