Search index replication - lucene

I am developing an application that requires a CLucene index to be created in a desktop application, but replicated for (read-only) searching on iOS devices and efficiently updated when the index is updated.
Aside from simply re-downloading the entire index whenever it changes, what are my options here? CLucene does not support replication on its own, but Solr (which is built on top of Lucene) does, so it's clearly possible. Does anybody know how Solr does this and how one would approach implementing similar functionality?
If this is not possible, are there any (non-Java-based) full-text search implementations that would meet my needs better than CLucene?
Querying the desktop application is not an option - the mobile applications must be able to search offline.

A Lucene index is based on write-once read-many segments. This means that when new documents have been committed to a Lucene index, all you nee to retrieve is:
the new segments,
the merged segments (old segments which have been merged in a single segment, if any),
the segments file (which stores information about the current segments).
Once all these new files have been downloaded, the segments files which have been merged can be safely removed. To take the changes into account, just reopen an IndexReader.
Solr has a Java implementation to do this, but given how simple it is, using a synchronization tool such as rsync would do the trick too. By the way, this is how Solr replication worked before Solr 1.4, you can still find some documentation on the wiki about rsync replication.

Related

Memory consumption in my Jersey application keeps growing with time

Memory consumption in my application keeps growing with time. This app uses Lucene and search is performed using Rest endpoints that searched Lucene directories. Around 10 different directories are created and multiple users can perform search on one or more directories at the same time. While searching it also checked if any new record is entered or modified in DB then directories are updated by deleting and re-adding the documents. I could not find anything wrong in Lucene configurations an coding doe for search like IndexWriter on directories are flushed and committed after deletion and addition of documents. I am just wondering if search can also consume memory. I can provide more details if required.
Will appreciate any clue provided exactly what might be wrong.

Two-directional replication of two separate Solr servers

I read about multi core or master slave in Solr but I am looking for complete replication of two separate Solr servers (Two-directional ). Where can I find a manual for doing that?
The two or more separate Solr servers can have internal replication or not.
The primary reason I expect you'd want bi-directional replication would be to support something like a cross-datacenter situation. That is, you want to isolate queries to particular places, but keep things in sync across a high-latency link.
If you don't need this, just use SolrCloud and let it handle replication. You can shard your index and get whatever update throughput you need. Any update can go to any node, and Solr will make sure it gets written to the right places.
If you are really thinking about datacenters, Solr added some brand new data center support in 6.0, which you can read about here: https://sematext.com/blog/2016/04/20/solr-6-datacenter-replication/
However, this still assumes updating to a single data center and just having the other just follow along.
Apple also did a talk about their (internal) bidirectional replication system you can watch here: https://www.youtube.com/watch?v=_Erkln5WWLw
That said, the simplest thing would just to be to write the updates to both places.

How to handle index files in a distributed Lucene cluster?

We are using Lucene in our application, and the index files saved in the disk of the same server where the application run.
The index files are almost 2Gb at the moment, and they maybe updated sometime, for example, when new data are inserted into the database, we may have to rebuild that part of index and add them.
So far so good since there is only one application server, now we have to add another to make a cluster, so I wonder how to handle the index files?
BTW, out application should be platform independent, since our clients use different os like Linux, and some of them even use the cloud platform with different storage like Amazon EFS or Azure storage.
Seems I have two opinions:
1 Every server hold a copy of the index files, and the make them synchronized with each other.
But the synchronized mechanism will depend on the OS, we tried to avoid this. And I am not sure if it will cause conflict if two server update the index files with different documents at the sometime.
2 Make the index file shared.
Like 1), the file share mechanism is platform aware. Maybe save them to the database is an alternative, but how about the performance? I have thought to use memcached to save them, but I have not find any examples.
How do you handle this kind of problem?
Possibly you should look into Compass project. Compass allowed to store Lucene index in database and distributed in memory data grids like GigaSpaces, Coherence and Terracotta. Unfortunately this project is outdated and last version was released at 2009. But you can try adapt it for your propose.
Another option, to look at HdfsDirectory that support a storing a index in HDFS file systems. I see only 5 classes in package org.apache.solr.store.hdfs , so it will be relatively easy to adapt them to storing index into in-memory caches like memcached or redis.
Aslo I find a project on github for RedisDirectory, but it initial stage and last commit was at 2012. I can recommend it only for reference.
Hope this help you find a right solution.

Best way to keep index real time?

I have a Solr/Lucene index file of approximately 700 Gb. The documents that I need to index are being read in real-time, roughly 1000 docs every 30 minutes are submitted and need to be indexed. In my scenario a script is run every 30 mins that indexes the documents that are not yet indexed, since it is a requirement that new documents should be searchable as soon as possible, but this process slow down the searching.
Is this the best way i can index latest documents or there is some other better way!
First, remember that Solr is not a real-time search engine (yet). There is still work to be done.
You can use a master/slave setup, where the indexation are done on the master and the search on the slave. With this, indexation does not affect search performance. After the commit is done on the master, force the slave to fetch the latest index from the master. While the new index is being replicated on the slave, it is still processing queries with the previous index.
Also, check you cache warming settings. Remember that this might slow down the searches if those settings are too aggressive. Also check the queries launched on the new searcher event.
You can do this with Lucene easily. Split the indexes in multiple parts (or to be precise, while building indexes, create "smaller" parts.) Create searcher for each of the part and store a reference to them. You can create a MultiSearcher on top of these individual parts.
Now, there will be only one index that will get the new documents. At regular intervals, add documents to this index, commit and re-open this searcher.
After the last index is updated, you can create a new multi-searcher again, using the previously opened searchers.
Thus, at any point, you will be re-opening only one searcher and that will be quite fast.
Check http://code.google.com/p/zoie/ wrapper around Lucene to make it real time - code donated from Linkedin.
^^i do this, with normal lucene, non solr, and it works really nice. however not sure if there is a solr way to do that at the moment. twitter recently went with lucene for searching and has effectively real time searching by just writing to their index at any update. their index resides completely in memory, so updating/reading the index is of no consequence and happens instantly, a lucene index can always be read while being written to, as long as there is only one writer at a time.
Check out this wiki page

Index replication and Load balancing

Am using Lucene API in my web portal which is going to have 1000s of concurrent users.
Our web server will call Lucene API which will be sitting on an app server.We plan to use 2 app servers for load balancing.
Given this, what should be our strategy for replicating lucene indexes on the 2nd app server?any tips please?
You could use solr, which contains built in replication. This is possibly the best and easiest solution, since it probably would take quite a lot of work to implement your own replication scheme.
That said, I'm about to do exactly that myself, for a project I'm working on. The difference is that since we're using PHP for the frontend, we've implemented lucene in a socket server that accepts queries and returns a list of db primary keys. My plan is to push changes to the server and store them in a queue, where I'll first store them into the the memory index, and then flush the memory index to disk when the load is low enough.
Still, it's a complex thing to do and I'm set on doing quite a lot of work before we have a stable final solution that's reliable enough.
From experience, Lucene should have no problem scaling to thousands of users. That said, if you're only using your second App server for load balancing and not for fail over situations, you should be fine hosting Lucene on only one of those servers and accessing it via NDS (if you have a unix environment) or shared directory (in windows environment) from the second server.
Again, this is dependent on your specific situation. If you're talking about having millions (5 or more) of documents in your index and needing your lucene index to be failoverable, you may want to look into Solr or Katta.
We are working on a similar implementation to what you are describing as a proof of concept. What we see as an end-product for us consists of three separate servers to accomplish this.
There is a "publication" server, that is responsible for generating the indices that will be used. There is a service implementation that handles the workflows used to build these indices, as well as being able to signal completion (a custom management API exposed via WCF web services).
There are two "site-facing" Lucene.NET servers. Access to the API is provided via WCF Services to the site. They sit behind a physical load balancer and will periodically "ping" the publication server to see if there is a more current set of indicies than what is currently running. If it is, it requests a lock from the publication server and updates the local indices by initiating a transfer to a local "incoming" folder. Once there, it is just a matter of suspending the searcher while the index is attached. It then releases its lock and the other server is available to do the same.
Like I said, we are only approaching the proof of concept stage with this, as a replacement for our current solution, which is a load balanced Endeca cluster. The size of the indices and the amount of time it will take to actually complete the tasks required are the larger questions that have yet to be proved out.
Just some random things that we are considering:
The downtime of a given server could be reduced if two local folders are used on each machine receiving data to achieve a "round-robin" approach.
We are looking to see if the load balancer allows programmatic access to have a node remove and add itself from the cluster. This would lessen the chance that a user experiences a hang if he/she accesses during an update.
We are looking at "request forwarding" in the event that cluster manipulation is not possible.
We looked at solr, too. While a lot of it just works out of the box, we have some bench time to explore this path as a learning exercise - learning things like Lucene.NET, improving our WF and WCF skills, and implementing ASP.NET MVC for a management front-end. Worst case scenario, we go with something like solr, but have gained experience in some skills we are looking to improve on.
I'm creating the Indices on the publishing Backend machines into the filesystem and replicate those over to the marketing.
That way every single, load & fail balanced, node has it's own index without network latency.
Only drawback is, you shouldn't try to recreate the index within the replicated folder, as you'll have the lockfile lying around at every node, blocking the indexreader until your reindex finished.