opening lucene index stored in hdfs - lucene

How to read a lucene index directory stored over HDFS i.e. How to get IndexReader for the index stored over HDFS. The IndexReader is to opened in a map task.
Something like: IndexReader reader = IndexReader.open("hdfs/path/to/index/directory");
Thanks,
Akhil

If you want to open a Lucene index that's stored in HDFS for the purpose of searching, you're out of luck. AFAIK, there is no implementation of Directory for HDFS that allows for search operations. One reason this is the case is because HDFS is optimized for sequential reads of large blocks, not small, random reads which Lucene incurs.
In the Nutch project, there is an implementation of HDFSDirectory which you can use to create an IndexReader, but only delete operations work. Nutch only uses HDFSDirectory to perform document deduplication.

I think the Katta project might be what you are looking for. I haven't used it myself but been researching these kind of solutions recently and this seems to fit the bill.
It's a distributed version of lucene using sharded indexes.
http://katta.sourceforge.net/

Related

Solr indexing parquet file

I have a solr instance up and running and it should read parquet files to index. Right now, I am converting the parquet to flat text file and then having solr index them. I'd like to know if it is possible to read the parquet file directly for Solr to consume?
Thanks
Directly: no, not possible.
If you want something more integrated than what you are actually doing (converting to text and indexing might be good enough already), you can follow two ways:
Create an specialized code around DIH, you probably can write a specialized DataSource, so you could use DIH to do the indexing.
Just write some java code using SolrJ that reads your file and indexes to Solr

Creating Lucene Index in a Database - Apache Lucene

I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?
Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.
I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.
As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.

how lucene use skip list in inverted index?

In some blogs and lucene website,I know lucene use data structure "skip list" in inverted index. But I have some puzzle about it.
1:In general,skip list maybe used in memory ,but inverted index is stored in disk. So how lucene use it when search on the index? just scanning it on disk or load it to memory?
2:skip list's insert operator often use random(0,1) to decide whether insert to next level,but in luncene introdution,it seems a fixed interval in every terms,so how lucene create the "skip list" different or not?
Please correct me if I am wrong.
Lucene uses memory in a couple different ways, even though the index is persisted on a disk when the IndexReader is created for searching and for operations like sorting (field cache):
http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
Basically those binary files get copied into RAM for much faster scanning and reducing I/O. You get a hint in the above link how searching with some parameters can force Lucene to "skip terms in searching" Hence, where that data structure can be used.
Lucene is open source, so you can see the code for yourself what is being used in Java or Lucene.NET for the C# implementation.
see To accelerate posting list skips, Lucene uses skip lists

Hive with Lucene

Is it possible to use Hive for querying Lucene index which is distributed over Hadoop???
Hadapt is a startup whose software bridges Hadoop with a SQL front-end (like Hive) and hybrid storage engines. They offer a archival text search capability that may meet your needs.
Disclaimer: I work for Hadapt.
As far as I know you can essentially write custom "row-extraction" code in Hive so I would guess that you could. I've never used Lucene and barely used Hive, so I can't be sure. If you find a more conclusive answer to your question, please post it!
I know this is a fairly old post, but thought I could offer a better alternative.
In your case, instead of going through the hassle of mapping your HDFS Lucene index to hive schema, it's better to push them into pig, because pig can read flat files. Unless you want a Relational way of storing your data, you could probably process them through Pig and use, Hbase as your DB.
You could write a custom input format for Hive to access lucene index in Hadoop.

Working with Katta ( Lucene, Hadoop )

Can any one provide me with some sample Java code as how to go about storing the Lucene index in a HDFS( Hadoop File Sytem ), using Katta.
Katta is wrapper for Lucene indexes. A folder containing several lucene indexes form a katta index. I had worked on this long back and dont have code handy.
You have to install katta in each nodes along with Hadoop. It wouldnt be that difficult if you try. One of my colleague has written a article on lucene indexing and searching using hadoop. It may help you.