Lucene search query in different file formats - apache

I'm using Apache's Lucene 3.0.3 on Windows 7. I'm able to index files successfully given any file extensions (.doc, .ppt, .pdf, .txt, .rtf etc). But, I'm able to search for a word(s) in any spoken human language(Indian/foreign) from only the indexed text document(s) but not from indexed Word/Powerpoint/PDF documents. Why is this? Is it possible for Lucene to do this directly?
Do I need to use a higher version of Lucene? I'm aware of Lucene 4.8.1. Do I need to use that to achieve my task stated above or is not possible for Lucene 3 to achieve the same?

Lucene doesn't interpret content. It indexes the content you give it and makes it searchable. If you hand it binary garbage, it will happily index it and make it searchable, it just won't be in searchable via human language. .doc, .ppt, .pdf, and .rtf formats are not plain-text, and so won't index well by just reading and chucking them directly into lucene.
You need to extract the content from the documents in order to search them meaningfully. I'd recommend using Tika.

Related

How do I use MoreLikeThis function on Solr to find similar documents to a text file?

I am trying to use solr to do the following:
Read some text from a txt file, and use MoreLikeThis on the text to find similar documents to that text. How can I do this with Solr?
From what I know so far I think I have to use content stream, but I do not know how to configure it...
If you were forming a MoreLikeThisQuery from a document stored in the index, it would have formed the query by retrieving the TermVector info from the Index.
Since you are willing to find documents similar to a text file you have, you've got to iterate the text file and form the BooleanQuery, with the terms in the text file, the way you want to match.
The above is true for Lucene, and I believe it's the same for Solr as well, considering that the MoreLikeThisQuery works based on TermVector info.

how to give batch of documents to lucene?

I have lots of ".txt" files in a single directory and I wants to give it to lucene for indexing.
I read all files of directory and for each file make its document and then use indexwriter.addDocument(Document) to give these files to lucene.
Is it possible to make all documents and give all of them to lucene?? I mean does lucene support this feature?
This feature was added in lucene 3.2
No, you will have to add each document on its own.
Furthermore I recommend using a configurable batch size to load just as many txt files and index them and carry on as long as there are more text files. This way you will not run into memory problems when you have bigger files.

OCR library that can insert OCR'd text back into the source PDF

Is there a library (or executable) that can OCR a PDF (typically a PDF created by scanning a paper), and inject the recognized text back into the PDF? Probably as invisible text behind the scanned images.
Preferably open source.
(Goal: I have a huge library of PDF files indexed by Lucene. It would be much easier for Lucene to find what PDFs are relevant if the PDFs contained text.)
One of the best options is to probably use Abbyy FineReader as it will give you lots of options including the creation of hidden text. www.abbyy.com I had a quick look at their site and also came across their Transformer product which is probably even more suitable for your needs.
http://www.abbyy.com.au/pdftransformer/product_features/
If PDFs doesn't contain text, what is indexed by Lucene?
Take a look at Docsplitt (https://github.com/documentcloud/docsplit) it can use Tesseract to perform OCR. You will get a plain text files, which reflects the content of PDFs. You can than build your Lucene index on top of these text files and store reference to PDF in Lucene index. After querying Lucene index you will get the list of Documents with references to original PDFs.

Search MS Word binary file for specific content

I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word "hello" for instance.
Is there any way to do this search in the binary file?
You could go down the route of using commercial tools. Aspose.Words can load a document from a stream and has all sorts of methods for finding text within the document.
If you have the stream from the DB, then you code would look like this:
Aspose.Words.Document doc = new Aspose.Words.Document(streamObjectFromDatabase);
if (doc.GetText().ToLower().Contains("hello world"))
MessageBox.Show("Hello World exists");
Note: The benefit of this tool is that it does not require Word objects to be installed and it can work with streams in memory.
Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.
Of course you could just search for the text you're looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that's also stored in plain text? And there's always the off chance that the binary data will match your text pattern.
If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.

Indexing Word Documents and PDFs with Sphinx

I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?
The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.
Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.
Has anyone used Tika to index other types of documents, much like the SOLR plugin?
Apache Tika
Some links:
PDF2TEXT is in poppler or poppler-utils on Linux
ANTIWORD -- seems to be for old .doc, not newer .docx