Can any one provide me with some sample Java code as how to go about storing the Lucene index in a HDFS( Hadoop File Sytem ), using Katta.
Katta is wrapper for Lucene indexes. A folder containing several lucene indexes form a katta index. I had worked on this long back and dont have code handy.
You have to install katta in each nodes along with Hadoop. It wouldnt be that difficult if you try. One of my colleague has written a article on lucene indexing and searching using hadoop. It may help you.
Related
I want to upgrade my Alfresco server to 5.2 and in all my custom webscripts am using lucene queries. Since from Alfresco 5.x lucene indexing has been removed and solr indexing is not instantaneous, am planing to use fts_alfresco search. While testing i found that few lucene queries can be used for fts_alfresco search without modifying. So my concern is will i be able to do fts_alfresco search using lucene query? If no, is there any better way to migrate all my lucene queries to fts_alfresco?
Thanks in advance.
You will need to test/check your queries since there are small differences (for instance, date range query is not the same), but in general there's no reason why you would not be able to use FTS.
I'm not sure a comprehensive documentation exists where you would see all those small differences, though. If you find it, please share.
"Alfresco FTS is compatible with most, if not all of the examples here.."
https://community.alfresco.com/docs/DOC-4673-search
I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?
Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.
I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.
As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.
Is it possible to use Hive for querying Lucene index which is distributed over Hadoop???
Hadapt is a startup whose software bridges Hadoop with a SQL front-end (like Hive) and hybrid storage engines. They offer a archival text search capability that may meet your needs.
Disclaimer: I work for Hadapt.
As far as I know you can essentially write custom "row-extraction" code in Hive so I would guess that you could. I've never used Lucene and barely used Hive, so I can't be sure. If you find a more conclusive answer to your question, please post it!
I know this is a fairly old post, but thought I could offer a better alternative.
In your case, instead of going through the hassle of mapping your HDFS Lucene index to hive schema, it's better to push them into pig, because pig can read flat files. Unless you want a Relational way of storing your data, you could probably process them through Pig and use, Hbase as your DB.
You could write a custom input format for Hive to access lucene index in Hadoop.
How to read a lucene index directory stored over HDFS i.e. How to get IndexReader for the index stored over HDFS. The IndexReader is to opened in a map task.
Something like: IndexReader reader = IndexReader.open("hdfs/path/to/index/directory");
Thanks,
Akhil
If you want to open a Lucene index that's stored in HDFS for the purpose of searching, you're out of luck. AFAIK, there is no implementation of Directory for HDFS that allows for search operations. One reason this is the case is because HDFS is optimized for sequential reads of large blocks, not small, random reads which Lucene incurs.
In the Nutch project, there is an implementation of HDFSDirectory which you can use to create an IndexReader, but only delete operations work. Nutch only uses HDFSDirectory to perform document deduplication.
I think the Katta project might be what you are looking for. I haven't used it myself but been researching these kind of solutions recently and this seems to fit the bill.
It's a distributed version of lucene using sharded indexes.
http://katta.sourceforge.net/
I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.