Indexing engine - indexing

I'm developing context discover system - which is mix of searching and suggestions.
Currently I'm looking for library for indexing.
After some investigation I stayed on Lucene and Terrier and found Indri not comfortable.
What are the downsides of both? What problem I can meet while using them?
Is it true that Terrier doesn't have incremental indexing (every time new document is added, I need to rebuild and reindex everything)?
My requirements are:
- easy adding new documents
- easy score methods injection
- quiet well defined model
And one more thing: is Terrier still active? I haven't seen any update since 10/03/2010 terrier changelog

What sort of database are you going to be using? Lucene, in my experience, is much better documented than Terrier.
Here's an article comparing Lucene and Terrier:
http://text-analytics.blogspot.com/2011/05/java-based-retrieval-toolkits.html

Related

How to index relationship properties in Neo4J in an easy way

I want to index the existing relationship properties in Neo4J (2.0.1) and also to set up automatic indexing for the ones that will appear in the future.
I found out that it's possible to do that in Neo4J documentation through the legacy auto-indexing as well as the examples of some Java code.
However, as I'm neither an expert in Java, nor want to use "legacy" functionality, I wanted to ask you if there is an easy way to index relationships on a specific property using Cypher command or any other way (rest API?) that wouldn't involve me having to write some Java program and run it (I don't know how to do that).
Thank you for your help!
My original answer was wrong. Editing so that it doesn't generate confusion to others looking for a solution.
Please refer to Relationship Labels and Index in Neo4J a for correct answer, as #deemeetree pointed out in the comments.
Since Neo4j 4.3 (released June 17, 2021), creating relationship property indexes can be done directly with Cypher, as discussed on the Neo4j blog and the 4.3 release notes.
Example from the blog:
CREATE INDEX officerRelationshipProperty
FOR ()-[r:OFFICER_OF]-()
ON (r.role);
You can't do indexing on relationship. Indexing is done only on nodes.

lucene: space requirements for optimising the database?

It is my understanding that there are several options when it comes to database optimisation in Lucene:
optimise the whole thing into one segment, space hungry by at least 2× ?
optimise into several segments
remove deleted entries — expungeDeletes(), without changing the number of segments?
Consider that a database is not held on a platter disc (mfs is in use). Do each of these operations have some bound on space requirements?
I noticed that expungeDeletes() is no longer documented for Lucene 4.6.0 — has it been removed? I'm coming from Lucene 3.0.2 / December 2011, although I'm open to upgrading to 4.6 sometime.
Manual optimization methods have now been removed in favour of Tiered Merge Policy. You may read about this in the blog post of one of the authors of Lucene. In short, merge will happen automatically as it is believed that the algorithm (which knows the internal state of the index) will do a better job than the user.
p.s. I think you need to get the nomenclature right. There's no such thing as "database" in Lucene (you probably meant index?)

Solr on a .NET site

I've got an ASP.NET site backed with a SQL Server database. I'm been using Lucene.NET to index and search the database. I'm adding faceted search navigation to the results page (the facets are a hiarchical category tree). I asked yesterday to make sure I was using the right technique for faceting. All I've gotten so far is a suggestion to use Solr, but Solr does a lot of things I don't need.
I would really like to know from anyone who is familiar with the Solr's source code if Solr's facet processing is terribly different from the one described here by Bert Willems. Bascially you have a Lucene filter for each facet, you get the bits array from it, and you count the set bits in the array.
I'm thinking since mine is hiarchical to begin with I should be able to optimize this pretty well, but I'm afraid I might be grossly under-estimating the impact of this design on search performance. If Solr is no quicker, I'm not going to gain anything by using it.
I'd recommend creating a prototype project modeling your faceting needs with Solr and benchmark it against Lucene.net.
Even though faceting in Solr is very optimized (and gets new optimizations all the time, like the parallel per-segment faceting method), when using Solr there is some overhead, for example network roundtrips and response parsing.
If your code already implements Lucene.NET, performs adequately and you don't need any of Solr's additional features, then there is no need to switch to Solr. But also consider that if you choose Solr you will get faceting performance boosts for free with each new version.

Tips/recommendations for using Lucene

I'm working on a job portal using asp.net 3.5
I've used Lucene for job and resume search functionality.
Would like to know tips/recommendations if any with respect to Lucene performance optimization, scalability, etc.
Thanks a ton!
I've documented how I used Lucene.NET (in BugTracker.NET) here:
http://www.ifdefined.com/blog/post/2009/02/Full-Text-Search-in-ASPNET-using-LuceneNET.aspx
One thing you should keep in mind is that it is very hard to cluster or replicate lucene indexes in large installations, like fail over scenarios or distributed systems. So you should either have a good way to replicate your index jobs or the whole database.
If you use a sort, watch out for the size of the comparators. When sorts are used, for each document returned by the searcher there will be a comparator object stored for each SortField in the Sort object. Depending on the size of the documents and the number of fields you want to sort on, this can become a big headache.

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.