Is there any benchmark comparing Lucene Vs PyLucene Vs Whoosh?
Lucene seems to be far ahead in terms of popularity, but I'm looking for something more Pythonic. So just want to get rough idea on the tradeoff.
Related
I am going to migrate Lucene version from 3.5 to 4.7.
And as my index is really huge I am wondering if it is worth to reindex it.
Mostly I am interested if it is worth in case of performance.
Any suggestions?
Regards
As usual, there is no simple answer to this.
The big change is that in v4.0 Lucene has introduced the ability to provide custom codec/postings format. Michael McCandless (one of the Lucene authors), explains the difference between 3.X and 4.0:
By default, Lucene uses the StandardCodec, which writes and reads in
nearly the same format as the current stable branch (3.x). Details for
a given term are stored in terms dictionary files, while the docs and
positions where that term occurs are stored in separate files.
That said, there are different codecs and each of them focuses on different things.
This presentation covers some posting formats and has some insights which format is tuned for which scenario. If you are going to stay with StandardCodec, I'd guess you won't gain much choosing to reindex.
Consider using IndexUpgrader of 4.7 to upgrade your index as there are some changes in the index format (postings format to be precise) from 3.x to 4.x. Default Codec for Lucene 4.7 may not be able to read the index files of Lucene 3.x
IndexUpgrader is a utility provided by Lucene.
http://lucene.apache.org/core/4_7_0/core/org/apache/lucene/index/IndexUpgrader.html
I'm indexing some English texts in a Java application with Lucene, and I need to lemmatization them with Lucene 4_1_0. I've found stemming (PorterStemFilter and SnowballFilter), but it not enough.
After lemmatizations I wanted to use a thesaurus for query expansion, does Lucene also include a thesaurus?
If it is not possible I will use the StanfordCoreNLP and WordNet instead.
Do you think that lemmatization may influence the search using Lucene library?
Thanks
As far as I know, you will need to build synonymization support in yourself.
Can we customize Lucene which is embedded in Solr just as we can in raw Lucene ? So that we can have "everything" that we have in Lucene in Solr ?
I am asking this because we are stuck at a point of deciding Solr vs Lucene, thinking like so :
Argument 1 :
"We might hit a dead zone in future if
we choose Solr, and Lucene is a better
choice hence... So we might as well
start writing HTTP wrappers and almost
half of Solr ourselves on top of
Lucene to be on safer side. "
Argument 2 :
"Solr already has all the features we
want to use, so why not just use it ?
Since people who commit to Lucene are
also responsible for committing to
Solr, all features of Lucene are
available to Solr too..."
I went through many blogs and posts that say something like :
For situations where you have very customized requirements requiring
low-level access to the Lucene API classes, Solr would be more a
hindrance than a help, since it is an extra layer of indirection.
-http://www.lucenetutorial.com/lucene-vs-solr.html
One way of defending Argument 2 is by confirming that we can customize the underlying Lucene in Solr just like we would do if we had only Lucene.
Can someone provide a better way of closing this argument ? :)
ps : We need a fast search with indexing and sharding terabytes of data...
Can we customize Lucene which is embedded in Solr ?
Yes, you can. But keep this in mind:
Lucene and Solr committers are some of the foremost experts in the field of full-text search. They have several years of experience in this field. If you think you can do better than them, then go ahead and change Solr to your needs (it's Apache-licensed so there aren't any commercial restrictions), and if you do so try to do it so that you can later contribute it back to the project so everyone can benefit and the project moves forward.
For the vast majority of Solr users though, the stock product is more than enough and satisfies all needs.
In other words, before jumping in to change the code, ask on a mailing list (stackoverflow or solr-user), there's a good chance that you don't really need to change any code.
"Fast search with indexing and sharding terabytes of data" is precisely what Solr was built for. It would be a bad case of Not-Invented-Here not to use it or any of the other similar solutions, such as ElasticSearch, Sphinx, Xapian, etc. If you think you'll need to customize or extend the search server in any way, consider the license and underlying code of each one. Solr and ElasticSearch are both Apache-licensed so they don't have commercial restrictions and are built on top of Lucene, a well-known library.
Can anyone provide a simple comparative analysis of these search engines? What advantages does either framework have?
BTW, I've seen the following basic explanations of choosing mg4j from several academic papers:
combining indices over the same collection
multi-index queries
Update:
These slides (from mir2ed.org) contain a more fresh overview of open source search engines including Lucene and mg4j on benchmarking various aspects: memory & CPU, index size, search performance, search quality etc.
Jeff Dalton reviewed many open source search engines including Lucene and mg4j in 2007, and updated the comparison in 2009.
I have not used mg4j. I have used Lucene, though. The number one feature of Lucene IMO is its wide adoption and wonderful community of users/developers/committers. This means that there is a fair chance that somebody worked on a use case similar to yours using Lucene.
Current weak points of Lucene are its scoring model and its ability to scale to large collections of text. The Lucene developers are working on these issues.
I believe that the choice of a search library is very dependent on your (academic or industrial) setting, the other parts of your application and your use case.
i am using java lucene and i am moving my code from java to c++ for some reason so i need to know about the performance of clucene
can any one explain
According to a benchmark posted on CLucene's SourceForge wiki, CLucene outperforms Java Lucene by a factor of 2 to 3 during indexing, but search performance is only about 10% better.
The data Michael linked to is quite old and incomplete. The answer is yes mainly because C++ has no GC threads and memory allocations are made in C++ by hand. Even reference counting in C++ will be performed faster in C++ since its compiled to machine code, unlike Java which runs on a VM.
For more info see the free chapter on CLucene from Lucene In Action, available from http://www.code972.com/blog/2010/06/lucene-in-action-free-chapter-coupon-code/