we are using grails 2.1.1 and we want to search for contact numbers from a uploaded pdf file. We have already done this with doc files but now we want to search and extract contacts from pdf file as well.
Is there any way to search and extract text from pdf files in grails.
have you looked at apache tika?
it should handle both these formats and save you time handling each type separately
Related
Is it possible to use Meilisearch to search contents of PDF and DOCX files? If yes What is the process of indexing and searching?
It's currently not possible to index PDF or DOCS files with MeiliSearch, you have to extract the text from your file and push the content into MeiliSearch. The current content types accepted are JSON, CSV, and NDJSON.
Here you can find a discussion where a user explains his approach: https://github.com/meilisearch/product/discussions/164
I'm writing a Markdown document and want to make references back to specific pages in a local PDF document. I can achieve this with PDF documents on the web by appending #page=<page number> to the end of the URL. Is the an analogous way to do this with a local PDF file? I've got this Markdown document and the corresponding PDF in a repo on Github. I'd love to be able to examine the Markdown file on there and click on the links to the corresponding PDF and have the referenced page load. Appreciate any suggestions you may have!
Is it possible to convert MHTML(.mht) files to pdf with proper css rendering.
Tried using pd4ml but the extenal css and links refered in .mht file fails to get loaded in the pdf genrated.
You could try unpacking the MHTML to HTML and separate files, then using your pd4ml method to generate the PDF.
Chilkasoft Java MHT is one solution you can look into, although after the 30 day trial you will need a license.
I need to find an open source or linux based utility that allows me to set an x,y coordinate in a setup file. I would like to then sequentially open pdf's and look in the documents for first, last name and account number and save the file with a file name consisting of last name and file number.
You may want to read some of these answers first :
A Java Library for text extraction from PDF documents preserving empty spaces and lines
How to extract text from a PDF?
How-to extract text from a pdf doc within a specific rectangular region?
The answers above are not Linux specific.
Most PDF documents do not need to be OCR'ed as the text is contained within the PDF. The hard part is extracting in. The Java version of iText (http://itextpdf.com/) is probably the best toolkit under Linux to extract the PDF text strings. Another option may be http://pdfbox.apache.org/
If the text you need to extract is actually an image then you will probably need to convert the whole PDF page to image format such as TIFF and pass that into an OCR engine such as Google Tesseract OCR.
I want to programmatically parse a pdf file, look for certain phrases and find out the page number that each phrase is on. Is this possible (I understand that pdf is not like a text file)? Is so, are there libraries out there that can help?
Apache Tika, which you can find at the Apache Lucene project, includes PDFBox, which will pull out the text where you can work with it.