Where do I find the field that determines whether a PDF has or is a fillable form or a scanned image? - pdf

A very quick question but I am at wits end trying to find it so I hope someone might have crossed this path. I have thousands of PDF's and need to know which of them are fillable forms or scanned images. I know I have seen a field that shows form : Acroform or something to that effect. Might this be in the info dictionary or do I need to go the XMP route? Is this the same location I might be able to tell if a PDF was a scanned image or page (I recognize that might have to be split into a separate question.
The goal is to loop through a series of pdfs and extract that data for a table.
Thank you

Related

Adding a pdf footer conditionally on certain pages in a multi-page pdf document

I have one web application which generates pdfs for each request.The data would be different in these pdfs based on user information.The number of pages can vary from 6 to 9.To construct the pdfs,i have multiple PdfPTables and each table has its own cells.Once i construct all the PdfPTables,as a final step i am adding the tables to the document.
Recently i have a requirement as,when ever there is a particular text then we need to add the footer to indicate the occurrence of this text in the respective pages.This can in 3 page or this can be 6 page or in both.I was thinking to figure out a way for this.
One of the approach i have is to identify this text at the time of adding to the PdfPCell and then generate a footer.But at this stage i dont have an idea as which page this would be in the document.I am letting the table to grow to the next page if it doesn't fit to the current page.
Another approach is to parse the entire pdf before sending the response back.Take one by one page,get the text and compare against the search text and if exists add a footer.Some how i feel this is a costly operation.
Please let me know if any of you have any suggestion to this.
Any help would be highly appreciated.
Thanks,

How can I edit the search text of a searchable PDF?

I have access to a scanner at my library which can create "searchable PDFs." These are PDFs that show the exact image of a scanned document, but there is a kind of hidden text in the PDF that can be selected when you try to select a portion of the image that contains text. In this way you can copy and paste text or search for text in the scanned document. This is VERY useful. It's an awesome improvement over raw scanned images. I also have several apps on my mac that can create this kind of searchable PDF from a scanned document or a raw image.
Now it's obvious from any who has ever used OCR that the process of converting images to text is not 100% accurate, so the text that you search or copy will not be correct in some places.
So I search for quite some time to find an application that would load a searchable PDF and allow me to repair the hidden searchable text without reformatting or modifying the original scanned image.
Does anyone know of a tool (or library API) that would allow this?
It's worth saying here that I tried the latest version of Adobe Acrobat DC for Mac, and it doesn't seem to even allow me to view the hidden searchable text, much less edit it. It does allow me to replace scanned image with the results of it's own OCR process so that I could edit and save the document. But this would produce horrible results for any of the scanned documents that I am using. It seems designed for editing a "native PDF" not editing a scanned document.
I have also tried ABBYY FineReader with no luck.
i'm using ABBYY FineReader 12 Professional. (not open source)
Just open a scanned image or scanned pdf and press Verify Text(or Ctrl + F7), than you go over all the spelling errors or low-confidence charachters and fix them.
The program is very good, it shows you the exact place in image/pdf to correct and the OCR guessing side by side for convenience. It iterates all of them.
[By the way, I'm using the shortcuts to speed up things:
Alt+Enter to add the unrecognized word to dictionary.
Ctrl+Delete to skip word or confirm in case you fixed it.]
Than save the document as a pdf file Menu:File>Save Document As> PDF File, and you can search it on every pdf reader. The saved file look the same as the scanned one, but 'behind' it there text.
It's weird you tried ABBYY with no luck... it's working great for me. maybe you tried not the Professional version.
Hope it helps you.
It is not creating a searchable pdf from images the poster is after, he wants to start with an already searchable pdf and modify its text (e.g. because intially a searchable pdf was made but later an overlooked error in recognition was found and needs correction). I see no way and no tool that assists in doing this.

Is it possible to make a Telerik report based on an existing PDF?

I have a lengthy PDF time tracking document that was printed out and used in a paper process to schedule appointments. Now this paper process is being converted to an online application and this application needs to generate reports in the same format as the PDF document (this time programatically inserting values into rows instead of having someone write them on the piece of paper).
My question is this, is possible to somehow import the layout of that PDF document into Telerik reporter's designer? Otherwise, is there some sort of an intermediary tool that I can use to make the layout more exportable?
Just to clarify, I am not trying to save my reports as PDF but trying to use a given PDF's layout to create a similar looking report in Telerik.
Any tips would be very welcome.
Thank you very much!
There are numerous tools for extracting text or images from pdf files, but I am pretty sure nothing exists to extract the layout of a pdf. The pdf format is just text and symbols with coordinates. There is no layout to extract.

How can I programmatically verify that a PDF file is first-generation?

I'm working on a project that involves the Fannie Mae/Freddie Mac Uniform Appraisal Dataset. The specification requires that the embedded appraisal PDF file be first-generation.
I understand conceptually what a first-generation PDF file is (printing of a document directly to PDF, rather than a scanned copy or printed and scanned copy). However, I've done some research and haven't found anything that specifies the properties of a first-generation PDF that could be verified programmatically.
I found a product that allows one to check if a PDF contains text, images, or both: Apose.Pdf.Kit for .NET, but I'm looking for a way to program this myself, for budgetary and other reasons. Also, I'm not sure that determining that the file contains text will be sufficient to verify that it's first-generation.
Given that this is an industry requirement of a very large industry, I feel like someone must have already tackled this issue, but I'm having a hard time finding anything.
Thanks in advance for any help.
There is no way to know for certain if a PDF is "first generation". Technically, a scanned PDF is just a PDF that contains images and perhaps OCR'ed text on top of that. A "first generation" PDF could easily have the same characteristics, so you have to use some heuristics.
For example, a PDF that contains only images and invisible text (from OCR) is likely to be scanned, a PDF that has visible text or vector graphics is probably "first generation" (OCR for scanned PDFs works by overlaying invisible text on top of the original image, so that text selection works, but the original document's fidelity is preserved).
Open pdf, ctrl "f" type in Appraisal. If you have a hit for the word, you have a first generation apprsl. Rather, the dataset exist.

is it possible to select ,copy a text in pdf?

i want to create a pdf page where i want to copy some text and paste in other document. i have gone through many pdf examples but i havent seen any app with selecting text in pdf.so i want to know whether it is possible or do i need to try with some other formats other than pdf
For this you need CGPDF class for this purpose here is link this might help you
http://www.random-ideas.net/posts/42