MeiliSearch for PDF and Docx files - indexing

Is it possible to use Meilisearch to search contents of PDF and DOCX files? If yes What is the process of indexing and searching?

It's currently not possible to index PDF or DOCS files with MeiliSearch, you have to extract the text from your file and push the content into MeiliSearch. The current content types accepted are JSON, CSV, and NDJSON.
Here you can find a discussion where a user explains his approach: https://github.com/meilisearch/product/discussions/164

Related

Link to load specific page of local PDF from Markdown document

I'm writing a Markdown document and want to make references back to specific pages in a local PDF document. I can achieve this with PDF documents on the web by appending #page=<page number> to the end of the URL. Is the an analogous way to do this with a local PDF file? I've got this Markdown document and the corresponding PDF in a repo on Github. I'd love to be able to examine the Markdown file on there and click on the links to the corresponding PDF and have the referenced page load. Appreciate any suggestions you may have!

Is there way to convert docx to image-only pdf with convertapi?

I'm trying to find a way to convert docx to image-only pdf, so I could put a watermark on the pdf document right after conversion.
I've looked through convertapi documentation and I can't find any available options.
First convert the .docx file to a .jpg:
https://www.convertapi.com/docx-to-jpg
then feed the .jpg to
https://www.convertapi.com/jpg-to-pdf
to generate .pdf
You can chain the API calls to get your desired output.

Grails find/read text from pdf file

we are using grails 2.1.1 and we want to search for contact numbers from a uploaded pdf file. We have already done this with doc files but now we want to search and extract contacts from pdf file as well.
Is there any way to search and extract text from pdf files in grails.
have you looked at apache tika?
it should handle both these formats and save you time handling each type separately

Show MigraDoc/PdfSharp Document on screen

I want to use MigraDoc/PdfSharp to create and store PDF documents.
Is there a way to show these documents in an application on-screen? I'd like to show the print in my program rather than starting Acrobat Reader with the document name.
I considered storing the print using XPS instead of PDF, but then I'd need to way to convert XPS to PDF for mailing it to customers. And I don't want to save the same print in two formats for space reasons.
MigraDoc can save files in its own format "MigraDoc DDL". You can preview MDDDL on the screen, create PDF or RTF from it or print it.
Disadvantage: images are not included in the MDDDL file (OTOH this can be an advantage as images can be shared between several documents).
You can ZIP document plus images for storage.
PDFsharp can create PDF files from XPS (but this is in a beta state and not fully operational).

OCR within an x,y window of a pdf

I need to find an open source or linux based utility that allows me to set an x,y coordinate in a setup file. I would like to then sequentially open pdf's and look in the documents for first, last name and account number and save the file with a file name consisting of last name and file number.
You may want to read some of these answers first :
A Java Library for text extraction from PDF documents preserving empty spaces and lines
How to extract text from a PDF?
How-to extract text from a pdf doc within a specific rectangular region?
The answers above are not Linux specific.
Most PDF documents do not need to be OCR'ed as the text is contained within the PDF. The hard part is extracting in. The Java version of iText (http://itextpdf.com/) is probably the best toolkit under Linux to extract the PDF text strings. Another option may be http://pdfbox.apache.org/
If the text you need to extract is actually an image then you will probably need to convert the whole PDF page to image format such as TIFF and pass that into an OCR engine such as Google Tesseract OCR.