Indexing Word Documents and PDFs with Sphinx

Indexing Word Documents and PDFs with Sphinx - pdf

I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?

The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.

Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

Has anyone used Tika to index other types of documents, much like the SOLR plugin?
Apache Tika
Some links:
PDF2TEXT is in poppler or poppler-utils on Linux
ANTIWORD -- seems to be for old .doc, not newer .docx

Related

Lucene search query in different file formats

I'm using Apache's Lucene 3.0.3 on Windows 7. I'm able to index files successfully given any file extensions (.doc, .ppt, .pdf, .txt, .rtf etc). But, I'm able to search for a word(s) in any spoken human language(Indian/foreign) from only the indexed text document(s) but not from indexed Word/Powerpoint/PDF documents. Why is this? Is it possible for Lucene to do this directly?
Do I need to use a higher version of Lucene? I'm aware of Lucene 4.8.1. Do I need to use that to achieve my task stated above or is not possible for Lucene 3 to achieve the same?

Lucene doesn't interpret content. It indexes the content you give it and makes it searchable. If you hand it binary garbage, it will happily index it and make it searchable, it just won't be in searchable via human language. .doc, .ppt, .pdf, and .rtf formats are not plain-text, and so won't index well by just reading and chucking them directly into lucene.
You need to extract the content from the documents in order to search them meaningfully. I'd recommend using Tika.

Suggestions on extracting text from uploaded documents

I currently have a number of documents uploaded to my website on a daily basis (.doc, .docx, .odt, pdf) and these docs are stored in a sql database (mediumblob).
Currently I open the docs from the database and cut and paste a text version into a field in the database for a quick reference and search function.
I'm looking to automate this "cut & paste" process - formatting isn't a real concern just as long as I can extract the text - and was hoping that some people may be able to suggest a good route to go down?
I've tried manipulating the content of the blob field using regex but it is not really working.
I've been looking at Apache POI with a view to extracting the text at the point of upload but I can't help thinking that this maybe a bit of an overkill given my relatively simple needs.
Given the various document formats I encounter and the current storing of the content in a blob field would Apache POI be the best solution to use in this instance or can anybody suggest an alternative?
Help and suggestions greatly appreciated.
Chris

Apache POI will only work for the Microsoft Office formats (.xls, .docx, .msg etc). For these formats, it provides classes for working with the files (always read, for many write support too), as well as text extractors.
For a general text extraction framework, you should look at Apache Tika. Tika uses POI internally to handle the Microsoft formats, and uses a number of other libraries to handle different formats. Tika will, for example, handle both PDF and ODF/ODT, which are the other two file formats you mentioned in the question.
There are some quick start tutorials and examples on the Apache Tika website, I'd suggest you have a look through. It's very quick to get started with, and you should be able to easily change your code to send the document through Tika during upload to get a plain text version, or event XHTML if that's more helpful to you.

Is there a way to index CHM files in Lucene?

Can anyone please suggest me a method by which a chm file can be indexed in such as pdfbox for pdf.

If you have also other document formats which you need to index, you might find a better and more general solution in Apache Tika
They just added a CHM Parser recently (for reference: Support of CHM Format) and it will be in the next version.

If you're talking about Microsoft Compiled HTML Help files, you can just extract text from them with JChm and then index it in a normal way.

how to give batch of documents to lucene?

I have lots of ".txt" files in a single directory and I wants to give it to lucene for indexing.
I read all files of directory and for each file make its document and then use indexwriter.addDocument(Document) to give these files to lucene.
Is it possible to make all documents and give all of them to lucene?? I mean does lucene support this feature?

This feature was added in lucene 3.2

No, you will have to add each document on its own.
Furthermore I recommend using a configurable batch size to load just as many txt files and index them and carry on as long as there are more text files. This way you will not run into memory problems when you have bigger files.

OCR library that can insert OCR'd text back into the source PDF

Is there a library (or executable) that can OCR a PDF (typically a PDF created by scanning a paper), and inject the recognized text back into the PDF? Probably as invisible text behind the scanned images.
Preferably open source.
(Goal: I have a huge library of PDF files indexed by Lucene. It would be much easier for Lucene to find what PDFs are relevant if the PDFs contained text.)

One of the best options is to probably use Abbyy FineReader as it will give you lots of options including the creation of hidden text. www.abbyy.com I had a quick look at their site and also came across their Transformer product which is probably even more suitable for your needs.
http://www.abbyy.com.au/pdftransformer/product_features/

If PDFs doesn't contain text, what is indexed by Lucene?
Take a look at Docsplitt (https://github.com/documentcloud/docsplit) it can use Tesseract to perform OCR. You will get a plain text files, which reflects the content of PDFs. You can than build your Lucene index on top of these text files and store reference to PDF in Lucene index. After querying Lucene index you will get the list of Documents with references to original PDFs.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Indexing Word Documents and PDFs with Sphinx - pdf

I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?

The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.

Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

Has anyone used Tika to index other types of documents, much like the SOLR plugin? Apache Tika Some links: PDF2TEXT is in poppler or poppler-utils on Linux ANTIWORD -- seems to be for old .doc, not newer .docx

Related

Lucene search query in different file formats

Suggestions on extracting text from uploaded documents

Is there a way to index CHM files in Lucene?

how to give batch of documents to lucene?

OCR library that can insert OCR'd text back into the source PDF

Categories

Resources