parse pdf and identify page a phrase is on - pdf

I want to programmatically parse a pdf file, look for certain phrases and find out the page number that each phrase is on. Is this possible (I understand that pdf is not like a text file)? Is so, are there libraries out there that can help?

Apache Tika, which you can find at the Apache Lucene project, includes PDFBox, which will pull out the text where you can work with it.

Related

How to get text from all pages in a PDF using textract?

I have to use python to automate getting text from files, which could have any number of different file extensions, which I don't know ahead of time. Textract seems to be very functional for this purpose, as it accepts a lot of file extensions, but when I use it to read PDFs, it only returns text from the first page. How do I get it to return the text from all pages in the PDF in one string?

Using PDFBox or something else, is it possible to know if a pdf contains no scanned pages?

I'm looking for a solution to detect if a pdf document contains some non-searchable text, I'm thinking about a scenario where a multi-page pdf contains some plain text pages, with or without images it doesn't matter, and one or some pages containing non-searchable texts.
So I would like a method returning true/false which is able to detect if a pdf contains some non-searchable text (or viceversa), in your opinion is it possible with PDFBox or something else?
Thx

Grails find/read text from pdf file

we are using grails 2.1.1 and we want to search for contact numbers from a uploaded pdf file. We have already done this with doc files but now we want to search and extract contacts from pdf file as well.
Is there any way to search and extract text from pdf files in grails.
have you looked at apache tika?
it should handle both these formats and save you time handling each type separately

How to delete first page from muliple PDF's

I have a collection of PDF's that sometimes have a info page for the first page of the document that I want to remove.
If there a quick way to delete this info page from all of my pdf's or at least a way to show all pdf's that have more than one page so I can better find the ones that need to be fixed?
Do you know of any program that can do this? Or way to do this with python?
Note: The info page has text on it that that always remains the same "LAND TITLE OFFICE"
Using Windows 7 OS
Thanks
Some Research turned up the following:
http://www.python.org/workshops/2002-02/papers/17/index.htm
http://www.unixuser.org/~euske/python/pdfminer/index.html
https://pypi.org/project/pypdf/
You can try these two ways:
PdfTK is an utility to manipulate PDFs. Check this link, they are doing something similar to what you need (in the comments someone also posted a script for windows)
PDFsam is a graphical powerful tool to manipulate PDFs in bulk. The split+merge sections should do the trick.
Both of them are free, I'd suggest to study the first if you want to write a "recipe" that you can use often, but the later if you have to do it once.
You can use the opensource PDFBox as a command line utility to split PDF's.
The link for PDFBox is here: link
The documentation for splitting a PDF using PDFBox is here: link
You could use the PDFBox extract text functionality from a batch script and combine with grep to identify pages that contain the text you are looking for. The extract text documentation is here: link

Extract "cover image" from CHM and EPUB files

How can I programmatically and reliably create PNG images from CHM and EPUB files? The page that's needed is only the first one, as in "cover image thumbnail generation".
Could this even be done just from the command line?
I have already looked at the open-source CHM QuickLook plug-in for MacOSX for source that does this and at Calibre, the latter to no avail.
In CHM, the default page is a webpage (a .html file). Of course it can only contain one page.
An extracter program is easy to do based on chmlib or Free Pascal's libs, but it will need the html parsed to also find names of other programs to extract. Roughly the algorithm would be:
use some "list" function of a cmdline extract tool to get the default page's name. (this is stored in an internal record)
extract it, and parse it for img and other referencing tags.
extract those.
The biggest picture downloaded in the previous step is probably "it"!