tesseract ocr multipage pdf hangs - pdf

We are using Tesseract's Java library, Called Tess4j to convert PDF files to text.
It works nicely with Tiff files as well as one page PDF files. But with multi-page PDF's it does generate the output file, when it comes to the last page, the control doesn't seem to come back to the original application which invoked the doOCR call. It just stays/hangs there without doing anything.
Is it an issue with the native call not returning back.i have no clue,
Please let me know if there is a solution to this issue, as soon as possible.
Regards
Vish

Tess4J does support multi-page PDF and multi-page TIFF. Substitute with your PDF file in the unit test case and give it a try.

Related

Google Cloud Vision PDF Gibberish

I'm trying to extract text from PDF files using the Google Cloud Vision API. It works most of the times, but I get gibberish in a few cases. I tried both DOCUMENT_TEXT_DETECTION and TEXT_DETECTION, I tried forcing the language in the languageHints but it didn't help.
Then I tried with a screenshot saved as tiff and this did work, so I'm guessing that Google tries to use the text in the PDF if it's not just a picture. Indeed, when I select all "text" in the PDF, I get gibberish.
When I print the tiff back into PDF, text extraction works. So it's really something weird with the PDF. But other extraction software (such as abbyy) work well with the original PDF.
Has anyone had the same kind of issues?
One thing that could help would be an option to force treat the PDF as an "image PDF". Is there such an option?
Thanks for your help!
FYI, I am unfortunately not allowed to show the PDF, and I use the dotnet library.
Edit:
The info on the PDF is:
Creator: "PScript5.dll Version 5.2.2"
Producer: Acrobat Distiller 10.1.16 (Windows)

How can I scrape PDF created using PDF.js using selenium?

I have successfully managed to download the PDF file from a site that uses PDF.js to create and show PDFs (using selenium)
The downloaded PDF file does not open on my desktop (mac & linux).
It seems like the PDF is encoded, or encrypted.
On closer inspection, right after the PDF is downloaded, the network tab also shows pdf.js.worker. It seems like pdf.js.worker is decoding this file to show on the site.
How can I replicate, or follow the same flow of pdf.js.worker and decode this PDF?
Update
I have tried looking at the pdf.js.worker code to follow the code execution, but it seems like a really hard task, hoping there is a simpler way.

How to generate PDF file using X++?

Can I create simple pdf file in x++? In this pdf I would like to have for example select from one table or simple static text.
MorphX reports can be saved to PDF by using the proper print settings beforehand.
SSRS reports can do this also using similar tricks.
Another way is to generate RTF, then let Word do the PDF creation. Silly, but maybe the PDF is smaller or better looking.
It is possible, but not simple, to generate PDF directly by using third party .Net components.
Some weeks ago, I used the Evo HTML to PDF library http://www.evopdf.com/ to convert simple HTML templates to PDF and it worked great. It can convert plain text as well, so maybe it could be useful for you.
Natively, AX hasn't anything to create PDF files.

Convert Pdf to Bitmap via code

I know there's tons of threads about this "out there" but all I can find is bitmap to pdf and how do add images to a PDF.
I have a PDF which I would like to convert to JPEG. I've tried to use the iTextSharp but I can only find info about making a pdf, not the other way araound. Any ideas or links to actual code?
ImageMagick uses Ghostscript to handle PDFs so if this is your only task I'd recommend just using Ghostscript. There's a managed wrapper here and you can get the Ghostscript binaries from here. They come in an installer but you can just extract them using 7-Zip. See this discussion on what you need to deploy in your app. You might have to play around with 32-bit vs 64-bit. Also, on the Ghostscript download page please read the "Which license is right for me?" section.

Print pdf in applet using itext

I have an issue printing pdf file in applet. I got input from http and the stream is consutructed using the pdfstamper. The problem is that i want to send the resulted stream to printer, but i did not find how to do that.
UNless the printer supports PDF you cannot send it directly to the printer. You need to rasterize it. I wrote a blog article on printing PDFs from Java at http://www.jpedal.org/PDFblog/2010/01/printing-pdf-files-from-java/
PDFBox might manage it. I'm not aware of any other Java-specific PDF renderers out there, though I wouldn't be shocked to find there's a couple more out there.
Basically, any app that can convert a PDF to an image can probably act as a print driver.
GhostScript perhaps?