Is there a set of best practices for building a Lucene index from a relational DB? - lucene

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?

Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.

I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)

We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.

As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.

Related

Can we use lucene query to have fts_alfresco search?

I want to upgrade my Alfresco server to 5.2 and in all my custom webscripts am using lucene queries. Since from Alfresco 5.x lucene indexing has been removed and solr indexing is not instantaneous, am planing to use fts_alfresco search. While testing i found that few lucene queries can be used for fts_alfresco search without modifying. So my concern is will i be able to do fts_alfresco search using lucene query? If no, is there any better way to migrate all my lucene queries to fts_alfresco?
Thanks in advance.
You will need to test/check your queries since there are small differences (for instance, date range query is not the same), but in general there's no reason why you would not be able to use FTS.
I'm not sure a comprehensive documentation exists where you would see all those small differences, though. If you find it, please share.
"Alfresco FTS is compatible with most, if not all of the examples here.."
https://community.alfresco.com/docs/DOC-4673-search

Creating Lucene Index in a Database - Apache Lucene

I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?
Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.
I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.
As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.

Solr on a .NET site

I've got an ASP.NET site backed with a SQL Server database. I'm been using Lucene.NET to index and search the database. I'm adding faceted search navigation to the results page (the facets are a hiarchical category tree). I asked yesterday to make sure I was using the right technique for faceting. All I've gotten so far is a suggestion to use Solr, but Solr does a lot of things I don't need.
I would really like to know from anyone who is familiar with the Solr's source code if Solr's facet processing is terribly different from the one described here by Bert Willems. Bascially you have a Lucene filter for each facet, you get the bits array from it, and you count the set bits in the array.
I'm thinking since mine is hiarchical to begin with I should be able to optimize this pretty well, but I'm afraid I might be grossly under-estimating the impact of this design on search performance. If Solr is no quicker, I'm not going to gain anything by using it.
I'd recommend creating a prototype project modeling your faceting needs with Solr and benchmark it against Lucene.net.
Even though faceting in Solr is very optimized (and gets new optimizations all the time, like the parallel per-segment faceting method), when using Solr there is some overhead, for example network roundtrips and response parsing.
If your code already implements Lucene.NET, performs adequately and you don't need any of Solr's additional features, then there is no need to switch to Solr. But also consider that if you choose Solr you will get faceting performance boosts for free with each new version.

Why are document stores like Lucene / Solr not included in NoSQL conversations?

All of us have come across the recent hype of no-SQL solutions lately. MongoDB, CouchDB, BigTable, Cassandra, and others have been listed as no-SQL options. Here's an example:
http://architects.dzone.com/articles/what-nosql-store-should-i-use
However, three years ago a co-worker and I were using Lucene.NET as what seem to fit the description of no-SQL. We did not use it just for user-inputted search queries; we used it to make a few reindexed RDBMS table data extremely performant. We implemented our own .NET sort-of-equivalent-to-Solr service to manage these indexes and make them callable. When I left the company, the team switched to Solr itself. (For those not in the know, Solr is a web service that wraps Lucene with REST-callable queries and index dumps.)
What I don't understand is, why is Solr not counted in the typical lists of no-SQL solution options? Am I missing something here? I assume that there are technical reasons why Solr is not comparable to the likes of CouchDB, etc., and in fact I understand that CouchDB uses Lucene as its data store (yes?), but what disqualifies Solr?
I'm not asking as some kind of Solr fanboy or anything, I just don't understand why Solr and the like don't fit the definition of no-SQL, and if Solr technically does fit the definition then what about it likely makes people pooh-pooh it? I'm asking because I'm having difficulty determining whether I should continue using Lucene-based solutions (like Solr) for solutions that I build or if I should really do more research with these other options.
I once listened to an interview with author Ursula K. LeGuin about fiction writing. The interviewer asked her about authors who work in different genre of writing. What makes one author a romance writer, and another a mystery writer, and another a science fiction writer? LeGuin responded by explaining:
Genre is about marketing, not about content.
It was an eye-opening statement.
I think the same applies to technology solutions. The NoSQL movement is attracting attention because it's full of marketing energy right now. NoSQL data stores like Hadoop, CouchDB, MongoDB, have commercial ventures backing them, pushing their solutions as new and innovative and exciting so they can grow their business. The term "NoSQL" is a marketing brand that helps them to explain their value.
You're right that Lucene/Solr is technically very similar to a NoSQL document store: it's a denormalized bag of documents (their term) with fields that aren't necessarily consistent across the collection of documents. It's indexed in a sophisticated way to allow you to search across all fields or by specific fields.
But that's not the genre Lucene uses to explain its value. They don't have the same mission to grow a market and a business, since they're managed by the Apache Foundation. They're happy to focus on the use case of fulltext search, even though the technology could be used in other ways. They're following a tenet of software success: do one thing, and do it well.
After doing more Google-searching, I think this document sums it up pretty well:
https://web.archive.org/web/20100504055638/http://www.lucidimagination.com/blog/2010/04/30/nosql-lucene-and-solr/
Case in point, Lucene/Solr is NoSql and could be considered one of NoSql's more mature "forefathers". It just does not get the NoSql hype it deserves because it didn't invent the term "no-SQL" and its users don't use the term, so the hype machine overlooked it.
I think that the most relevant characteristic of solr/lucene that drops from the nosql list it's because until recently, making lucene work as a real-time system was a pain. The usual workflow for any performant application was to index the incremental updates in batchs, and updating the index every 5 minutes for example.
I think that stimpy77 is partly right on the NoSQL being a branding thing. But also, NoSQL means that it's a data storage platform that is simpler/easier then SQL based solutions. And I think while Solr/Lucene share some aspects (they store data), it really misses the mark to think that Solr/Lucene could be used as primary data storage for anything that has relationships. Sure, lots of documents can be thrown into it, and powerful search pull them back. But as soon as you want relationships, then others such as CouchDB and others do much better that have a query syntax of some kind. Search is a bandaid solution in that case. Think about the use case "find all documents tagged with word 'car'". If I have some structures in my data, then it's easy for me to get the document for tag car, and pull everybody back. Versus relying on a search query that includes fq=tag:'car'. Search is more and more powerful the fewer relationships you have, but the more relationships, the better a datastore like CouchDB and brethren are. Thats why you still see CouchDB and friends paired with Solr, and vice versa! Let each one do what it does best.
Of course, that isn't to say you can't leverage storing your source data in Solr, that can be a powerful tool to use!
The main differences between a no sql and solr in operational wise are the following in my opinion.
Solr requires an intermediate data store (database or XML files) whereas nosql itself a straight data store.
You cannot do a constant writes to solr (solr 4.0 seems to bring that support) and you can only index at the max of every 2 mins and 200 records (which is very slow for high throughput writes and you are forced for an intermediate storage).
You are require to change / define the schema when you alter what is stored in document. NoSQL has no such definitions.
Solr indexes has performance implication when its index size grows whereas NoSQL is optimized for it (or claims to be :) )
Solr has underlying lucene search algorithms bundled but in NoSQL you need to build them, This applies to the magnificent faceted search or blazing fast document search provided by solr.
Last but few points, Its about the difference not the one mentioned here as marketing strategy in which solr goes out from NoSQL
Lucene/Solr - Iam gonna use Solr, Since Solr uses lucene internally and has addition features. So Solr is basically an upgrade to Lucene with new constume.
Solr is mainly used for purpose to create facets and indexing plain texts for search engine.
Solr can use most of the databases to store its data. It is inconsistent to keep data in solr since it directly use disks.
NoSQL databases are easy to learn compared to Solr. Solr is more or less having lot of configurations and concepts (For eg: Fields).
Performance is something that we have to consider b/w . Solr provides high performance compared to other NoSQL databases.
Note: Combining the Solr with some databases provides the best performance.
Summary: Solr is also a NoSQL datastore which is a predecessor of all NoSQL databases. Which didn't get the hype of others. But still in the field due to its performance and power.

Fulltext Search with InnoDB

I'm developing a high-volume web application, where part of it is a MySQL database of discussion posts that will need to grow to 20M+ rows, smoothly.
I was originally planning on using MyISAM for the tables (for the built-in fulltext search capabilities), but the thought of the entire table being locked due to a single write operation makes me shutter. Row-level locks make so much more sense (not to mention InnoDB's other speed advantages when dealing with huge tables). So, for this reason, I'm pretty determined to use InnoDB.
The problem is... InnoDB doesn't have built-in fulltext search capabilities.
Should I go with a third-party search system? Like Lucene(c++) / Sphinx? Do any of you database ninjas have any suggestions/guidance? LinkedIn's zoie (based off Lucene) looks like the best option at the moment... having been built around realtime capabilities (which is pretty critical for my application.) I'm a little hesitant to commit yet without some insight...
(FYI: going to be on EC2 with high-memory rigs, using PHP to serve the frontend)
Along with the general phasing out of MyISAM, InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release.
Lots of juicy details at https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html.
While other engines have lots of different features, this one is InnoDB, so it's native (which means there's an upgrade path), and that makes it a worthwhile option.
I can vouch for MyISAM fulltext being a bad option - even leaving aside the various problems with MyISAM tables in general, I've seen the fulltext stuff go off the rails and start corrupting itself and crashing MySQL regularly.
A dedicated search engine is definitely going to be the most flexible option here - store the post data in MySQL/innodb, and then export the text to your search engine. You can set up a periodic full index build/publish pretty easily, and add real-time index updates if you feel the need and want to spend the time.
Lucene and Sphinx are good options, as is Xapian, which is nice and lightweight. If you go the Lucene route don't assume that Clucene will better, even if you'd prefer not to wrestle with Java, although I'm not really qualified to discuss the pros and cons of either.
You should spend an hour and go through installation and test-drive of Sphinx and Lucene. See if either meets your needs, with respect to data updates.
One of the things that disappointed me about Sphinx is that it doesn't support incremental inserts very well. That is, it's very expensive to reindex after an insert, so expensive that their recommended solution is to split your data into older, unchanging rows and newer, volatile rows. So every search your app does would have to search twice: once on the larger index for old rows and also on the smaller index for recent rows. If that doesn't integrate with your usage patterns, this Sphinx is not a good solution (at least not in its current implementation).
I'd like to point out another possible solution you could consider: Google Custom Search. If you can apply some SEO to your web application, then outsource the indexing and search function to Google, and embed a Google search textfield into your site. It could be the most economical and scalable way to make your site searchable.
Perhaps you shouldn't dismiss MySQL's FT so quickly. Craigslist used to use it.
MySQL’s speed and Full Text Search has enabled craigslist to serve their users .. craigslist uses MySQL to serve approximately 50 million searches per month at a rate of up to 60 searches per second."
edit
As commented below, Craigslist seems to have switched to Sphinx some time in early 2009.
Sphinx, as you point out, is quite nice for this stuff. All the work is in the configuration file. Make sure whatever your table is with the strings has some unique integer id key, and you should be fine.
try this
ROUND((LENGTH(text) - LENGTH(REPLACE(text, 'serchtext', ''))) / LENGTH('serchtext'),0)!=0
You should take a look at Sphinx. It is worth a try. It's indexing is super fast and it is distributed. You should take a look at this (http://www.percona.com/webinars/2012-08-22-full-text-search-throwdown) webminar. It talks about searching and has some neat benchmarks. You may find it helpful.
If everything else fails, there's always soundex_match, which sadly isn't really fast an accurate
For anyone stuck on an older version of MySQL / MariaDB (i.e. CentOS users) where InnoDB doesn't support Fulltext searches, my solution when using InnoDB tables was to create a separate MyISAM table for the thing I wanted to search.
For example, my main InnoDB table was products with various keys and referential integrity. I then created a simple MyISAM table called product_search containing two fields, product_id and product_name where the latter was set to a FULLTEXT index. Both fields are effectively a copy of what's in the main product table.
I then search on the MyISAM table using fulltext, and do an inner join back to the InnoDB table.
The contents of the MyISAM table can be kept up-to-date via either triggers or the application's model.
I wouldn't recommend this if you have multiple tables that require fulltext, but for a single table it seems like an adequate work around until you can upgrade.