Both MongoDB and CouchDB use B-trees as an underlying data structure for storing indexes. Anyone knows what is the equivalent for RavenDB? There is nothing mentioned about this in the documentation. Thanks!
RavenDB uses Lucene index.
In order to allow fast queries over your indexes, RavenDB processes
them in the background, executing the queries against the stored
documents and persisting the results to a Lucene index. Lucene is a
full text search engine library (Raven uses the .NET version) which
allows us to perform lightning fast full text searches.
You can read more about indexing in the documentation: How the indexes work
Related
as you know, there are different technique to index documents for search engines.
such as inverted index, Distributed Dynamic Indexing, Semantic Indexing, NGram Indexing, Context Indexing, Big Data, Multilingual Indexing and so on.
I am working with Solr now. I wonder which techniques does Solr use to index documents and how does Solr (or Lucene) use these techniques?
First - this is a very wide area and most of the terms you're listing isn't index types. They describe product features (or buzzwords) that could be supported regardless of how the index is built behind the scene.
Solr uses Lucene - which at the core is an inverted index.
The index stores statistics about terms in order to make term-based search more efficient. Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it. This is the inverse of the natural relationship, in which documents list terms.
There is also many support structures in place to make Lucene even more efficient for certain queries and features. On such feature is the DocValues support - which can be described as a column oriented store with document -> term mappings to speed up things like faceting.
You can see most of these support features in the Codecs API Doc for Lucene 6.3.0. As it's quite a large list, I'll leave it out from the comment itself.
To answer which techniques - Under the hood , Solr uses Lucene APIs and Lucene indexing technique is - Inverted Indexing. Solr is simply a complete application with infrastructure wrapper but underlying document indexing technique is the one provided by Lucene APIs.
How does Solr (or Lucene) use these techniques?
Here is a nice overview of Lucene indexing for beginners. Its just a very simplistic overview but explains the basics.
Since Solr is a product, most of its available documentations are functional ones ( not explaining actual indexing techniques etc) and since raw usage of Lucene is minimal, Lucene documentation is not up to the mark so most of the time, one needs to dig Lucene code or API documentation to understand working of Lucene.
Hope it helps !!
I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?
Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.
I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.
As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.
If I don't have access to the file system, but do have access to a MySQL instance, can I store the lucene index in a mysql database. I was able to find the DbDirectory and thought that it might do the trick. However, it looks like it works with a Berkeley DB rather than an RDBMS.
There are some contribution that stores lucene index in more simple datastores (from perspective of data model). For example, BerkleyDB and Cassandra. So technically it is possible to write implementation of Directory which would store index in Jdbc. There is one in Compass framework.
I dont believe you can, it would defeat the purpose of Lucene. If your indexing does not take to long you could consider a RAMDirectory which I believe stores it in memory.
I'm working on a job portal using asp.net 3.5
I've used Lucene for job and resume search functionality.
Would like to know tips/recommendations if any with respect to Lucene performance optimization, scalability, etc.
Thanks a ton!
I've documented how I used Lucene.NET (in BugTracker.NET) here:
http://www.ifdefined.com/blog/post/2009/02/Full-Text-Search-in-ASPNET-using-LuceneNET.aspx
One thing you should keep in mind is that it is very hard to cluster or replicate lucene indexes in large installations, like fail over scenarios or distributed systems. So you should either have a good way to replicate your index jobs or the whole database.
If you use a sort, watch out for the size of the comparators. When sorts are used, for each document returned by the searcher there will be a comparator object stored for each SortField in the Sort object. Depending on the size of the documents and the number of fields you want to sort on, this can become a big headache.
I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.