I have a system that uses lucene. Now for a few reasons I would like to add a distributed search feature over it.
The question is can I use the existing lucene index created by the IndexWriter of lucene, for searching with elastic search or should I create a new index using ES's IndexWriter.
P.S I discovered over the web that this is possible with solr, but afaik couldn't find anything tangible for es. Any help would be appreciated.
You need to reindex into ElasticSearch, you can't reuse an existing Lucene index.
Related
I want to upgrade my Alfresco server to 5.2 and in all my custom webscripts am using lucene queries. Since from Alfresco 5.x lucene indexing has been removed and solr indexing is not instantaneous, am planing to use fts_alfresco search. While testing i found that few lucene queries can be used for fts_alfresco search without modifying. So my concern is will i be able to do fts_alfresco search using lucene query? If no, is there any better way to migrate all my lucene queries to fts_alfresco?
Thanks in advance.
You will need to test/check your queries since there are small differences (for instance, date range query is not the same), but in general there's no reason why you would not be able to use FTS.
I'm not sure a comprehensive documentation exists where you would see all those small differences, though. If you find it, please share.
"Alfresco FTS is compatible with most, if not all of the examples here.."
https://community.alfresco.com/docs/DOC-4673-search
I've been working with Nutch 1.10 to make some small web crawls and indexing the crawl data using Elasticsearch 1.4.1 - it seems that the only way to optimize the index mapping is to crawl first, review the mapping that ES did on its own and then change it accordingly (if necessary) with the mapping API.
Does anyone know of a more effective solution to optimize the mappings within an ES index for web crawling?
UPDATE:
Is it even possible to update an ES mapping from a Nutch web crawl?
There are two things to consider here:
What is the data that is Indexed?
How to index it correctly to es
Regarding the indexed data, the index-plugins you use affect this. For example, the basic-index will add content, host, url, etc. for every doc. You can either check the plugins' documentation or to simply see what is the output (like you did).
After you know the indexed data and how to you want to approach it in the es cluster, you can create a new index in es with the correct/ optimized mappings, and make sure Nutch will index to that index.
Of course you can also re-index what you already crawled (see this es article).
I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?
Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.
I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.
As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.
I have setup new instance of Solr indexing on a website. I want Solr NOT to index certain URL patterns. Is there any way of mentioning such exclude-pattern?
Regards,
Paras
It can be done in the program, index only if the pattern does not match the exclude pattern.
You can do the filtering in Solr using an UpdateRequestProcessor. In that UpdateRequestProcessor, you could decide whether or not to index the document if it matches or not your regex.
Do you have a crawler going about and collecting data? I would lean towards doing that logic in the crawler. Solr is more of respository, and I don't think is the best place to put lots of indexing logic in.
Eric
I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.