Read from a searchable pdf, without ocr - pdf

I'm currently using my scanner to turn my PDFs into searchable PDFs. The OCR is already taken care of, since I can use ctrl-f within the PDF.
How can I get at the OCR'd content from my program though.
I'm open to using java, ruby, the question is kind of programming language agnostic. Is the OCR'd text openly accessible by reading the file?

Not sure how your OCR software creates the PDF, but could you use a third-party library (jPedal) or tool such as iText or XPDF to extract the text from the resulting PDF?

Related

How to convert a Scanned PDF file using Foxit Libraries

My question is simple and clear. I find that nobody else seems to have this problem. I insist to use Foxit libraries because it supports Farsi text recognition. My programming language is VBA.
It seems that there is no method neither in PhantomPDF nor in FoxIt SDK Libraries to only OCR a PDF file without saving it as Excel file afterwards.

Include custom fonts in PDF

I have a question about generating PDFs with wkhtmltopdf. I know it's possible to use custom fonts in my html. But I think it's required that the operating system viewing the pdf has installed these fonts. Correct?
My question is whether it's possible to include these fonts in the PDF? So when the PDF is generated I can send it to a print office to print 50 copies. And they see the pdf exactly the same as I, without having these fonts installed.
This is certainly possible.
It's called "embedding a font" in pdf lingo.
Most pdf generation libraries should support this.
Pdf comes in different flavors (standards). One of the standards pdf/A is meant for long term storage (the A stands for archiving). The idea being that the document look and feel should be preserved as much as possible. In order to achieve this without depending on the operating system (and the fonts it may be shipped with), it is required that the fonts are embedded to fulfill the pdf/A standard.
https://en.wikipedia.org/wiki/PDF/A
I don't know how to do this in the library you are using. But I do know it's possible with iText.
This is a great tutorial on it, which aside from giving you more information about iText, will also illustrate the problem with custom fonts in a very visual way.
https://developers.itextpdf.com/tutorial/using-fonts-pdf-and-itext

Creating a PDF viewer using iTextSharp

I am trying to create a PDF viewer using the iTextSharp library, but there doesn't seem to be any documentation anywhere about how I can accomplish this. I don't need to create a PDF file, just display one and give users the option to save the file or export it to a CSV file.
Can somebody please point me in the right direction?
iText is not a PDF viewer (nor iTextSharp) for that matter, but it could be used to examine a PDF document. See for instance iText RUPS. iText RUPS is a tool that allows you to look under the hood of a PDF, more specifically at the PDF objects stored in a PDF as well as at the content streams.
This would be the first step towards writing a PDF viewer. However, iTextSharp doesn't interpret the content stream of a page, nor the resources that belong to that page (such as image streams, glyph descriptions, etc). If that's what you want to build, you need to consult ISO-32000-1. Note that it will probably take several man years to create a decent viewer.
As for the requirement to export a PDF document to a CSV, this may be possible if your original PDF is a Tagged PDF, but it will be impossible for the majority of PDF documents, including documents that consist of scanned images and documents with no machine-recognizable structure.
Please understand that this is a general answer. A more specific answer can not be given since your question is too broad for StackOverflow. All the answers you need can be found by using iText RUPS and reading ISO-32000-1 (there's a copy of ISO-32000-1 available on Adobe's web site).

Using Extracttext from cfpdf to get text of a pdf in coldfusion

Currently, I have a pdf that is not searchable and I am wondering what the best process is for preparing the file for coldfusion so I can index the file.
In particular, I am wondering whether a pdf file needs to be readable before using extracttext in cfpdf to pull the text from it.
I really appreciate the advice and I hope it helps other people who are interested in indexing pdf files with coldfusion.
I was considering extracting the text with Tesseract as suggested here
Performing Optical Character Recognition on PDF's from ColdFusion using a Java or .NET Library?
but if there is a built in feature in coldfusion, I would much rather use that and I think it would be more helpful to other people to know whether coldfusion can natively handle this task.

HowTo extract embedded OCR data from a PDF?

I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data.
So my question is, is it possible to extract this embedded OCR-Data from the pdf Files?
It would be nice to get files with coordinates. But it would also be sufficient to get plaintext files.
You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.
PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.