How to handle index files in a distributed Lucene cluster? - indexing

We are using Lucene in our application, and the index files saved in the disk of the same server where the application run.
The index files are almost 2Gb at the moment, and they maybe updated sometime, for example, when new data are inserted into the database, we may have to rebuild that part of index and add them.
So far so good since there is only one application server, now we have to add another to make a cluster, so I wonder how to handle the index files?
BTW, out application should be platform independent, since our clients use different os like Linux, and some of them even use the cloud platform with different storage like Amazon EFS or Azure storage.
Seems I have two opinions:
1 Every server hold a copy of the index files, and the make them synchronized with each other.
But the synchronized mechanism will depend on the OS, we tried to avoid this. And I am not sure if it will cause conflict if two server update the index files with different documents at the sometime.
2 Make the index file shared.
Like 1), the file share mechanism is platform aware. Maybe save them to the database is an alternative, but how about the performance? I have thought to use memcached to save them, but I have not find any examples.
How do you handle this kind of problem?

Possibly you should look into Compass project. Compass allowed to store Lucene index in database and distributed in memory data grids like GigaSpaces, Coherence and Terracotta. Unfortunately this project is outdated and last version was released at 2009. But you can try adapt it for your propose.
Another option, to look at HdfsDirectory that support a storing a index in HDFS file systems. I see only 5 classes in package org.apache.solr.store.hdfs , so it will be relatively easy to adapt them to storing index into in-memory caches like memcached or redis.
Aslo I find a project on github for RedisDirectory, but it initial stage and last commit was at 2012. I can recommend it only for reference.
Hope this help you find a right solution.

Related

archiving some redis data to disk

I have been using redis a lot lately, and really am loving it. I am mostly familiar with persistence (rdb and aof). I do have one concern. I would like to be able to selectively "archive" some of my data to disk (or cheaper storage) once it is no longer important. I don't really want to delete it because it might be valuable at some point.
All of my keys are named id_<id>_<someattribute>. So when I am done with id 4, I want to "archive" all all keys that match id_4_*. I can view them quite easily in with the command line, but I can't do anything with them, persay. I have quite a bit of data (very large bitmaps) associated with this data set, and frankly I can't afford the space once the id is no longer relevant or important.
If this were mysql, I would have my different tables and would very easily just dump it to a .sql file and then drop the table. The actual .sql file isn't directly useful to me, but I could reimport the data if/when I need it. Or maybe I have to mysql database and I want to move one table to another database. Are there redis corollaries to these processes? Is there someway to make an rdb or aof file that is a subset of the data?
Any help or input on this matter would be appreciated! Thanks!
#Hoseong Hwang recently asked what I did, so I'm posting what I ended up doing.
It was really quite simple, actually. I was benefited by the fact that my key space is segmented out by different users. All of my keys were of the structure user_<USERID>_<OTHERVALUES>. My archival needs were on a user basis, some user's data was no longer needed to be kept in redis.
So, I started up another instance of redis-server, on another port locally (6380?) or another machine, it makes no difference. Then, I wrote a short script that basically just called KEYS user_<USERID>_* (I understand the blocking nature of KEYS, my key space is so small it didn't matter, you can use SCAN if that is an issue for you.) Then, for each key, I MIGRATED them to that new redis-server instance. After they were all done. I did a SAVE to ensure that the rdb file for that instance was up to date. And now I have that rdb, which is just the content that I wanted to archive. I then terminated that temporary redis-server and the memory was reclaimed.
Now, keep that rdb file somewhere for cheap, safe keeping. And if you ever needed it again, doing the reverse of my process above to get those keys back into your main redis-server would be fairly straightforward.
Instead of trying to extract data from a live Redis instance for archiving purpose, my suggestion would be to extract the data from a dump file.
Run a bgsave command to generate a dump, and then use redis-rdb-tools to extract the keys you are interested in - you can easily get the result as a json file.
See https://github.com/sripathikrishnan/redis-rdb-tools
You can keep the json data in flat files, or try to store them into a relational database or a document store if you need them to be indexed for retrieval purpose.
A few suggestions for you...
I would like to be able to selectively "archive" some of my data to
disk (or cheaper storage) once it is no longer important. I don't
really want to delete it because it might be valuable at some point.
If such data is that valuable, use a traditional database for storage. Despite redis supporting snap-shotting to disk and AOF logs, you should view it as mostly volatile storage. The primary use case for redis is reducing latency, not persistence of valuable data.
So when I am done with id 4, I want to "archive" all all keys that
match id_4_*
What constitutes done? You need to ask yourself this question; does it mean after 1 day the data can fall out of redis? If so, just use TTL and expiration to let redis remove the object from memory. If you need it again, fall back to the database and pull the object back into redis. That first client will take the hit of pulling from the db, but subsequent requests will be cached. If done means something not associated with a specific duration, then you'll have to remove items from redis manually to conserve memory space.
If this were mysql, I would have my different tables and would very
easily just dump it to a .sql file and then drop the table. The actual
.sql file isn't directly useful to me, but I could reimport the data
if/when I need it.
We do the same at my firm. Important data is imported into redis from rdbms executed as on-demand job. We don't drop tables, we just selectively import data from the database into redis; nothing wrong with that.
Is there someway to make an rdb or aof file that is a subset of the
data?
I don't believe there is a way to do selective archiving; it's either all or none.
IMO, spend more time playing with redis. I highly recommend leveraging out-of-box features instead of reinventing and/or over-engineering solutions to suit your needs.
Hope that helps!...

Search index replication

I am developing an application that requires a CLucene index to be created in a desktop application, but replicated for (read-only) searching on iOS devices and efficiently updated when the index is updated.
Aside from simply re-downloading the entire index whenever it changes, what are my options here? CLucene does not support replication on its own, but Solr (which is built on top of Lucene) does, so it's clearly possible. Does anybody know how Solr does this and how one would approach implementing similar functionality?
If this is not possible, are there any (non-Java-based) full-text search implementations that would meet my needs better than CLucene?
Querying the desktop application is not an option - the mobile applications must be able to search offline.
A Lucene index is based on write-once read-many segments. This means that when new documents have been committed to a Lucene index, all you nee to retrieve is:
the new segments,
the merged segments (old segments which have been merged in a single segment, if any),
the segments file (which stores information about the current segments).
Once all these new files have been downloaded, the segments files which have been merged can be safely removed. To take the changes into account, just reopen an IndexReader.
Solr has a Java implementation to do this, but given how simple it is, using a synchronization tool such as rsync would do the trick too. By the way, this is how Solr replication worked before Solr 1.4, you can still find some documentation on the wiki about rsync replication.

What kind of server for operational transform operations?

I am hoping to use the Diff-Match-Patch algorithms available from google as apart of the Google-Mobwrite real time collaborative text editor protocol in order to embed a real time collaborative text editor in my program.
Anyways I was wondering what exactly might be the most efficient way of storing "global" copies of each document that users are editing. I would like to have each document stored on a server that is not local to any user and each time a user performs an "operation" ( delete insert paste cut ) that the diff is computed between their copy and the server and its patched etc... if you know the Google mobwrite protocol you probably understand what I am saying.
Should the servers text files be stored as a file that is changed or inside an sql database as a long string or what? Should I be using websockets to communicate with the server? I am honestly kind of an amateur when it comes to this but am generally a fast learner. Does anyone have any tips or resources I could follow perhaps? Thanks lot
This would be a big project to tackle from scratch, so I suggest you use one of the many open source projects in this area. For example, etherPad:
https://code.google.com/p/etherpad/
Mobwrite is using Differential Synchronization technique and its totally different from Operational Transformation technique.
Differential Synchronization suppose to have a communication circle that always starts from the client(the browser), which means you cant use web-sockets to send diffs from the server directly. The browser needs to request the server frequently to get the updates (lets say every 2 seconds), otherwise your shadow-copies will be out of sync.
For storing your shadow-copies when the user is active, you can use whatever you want, but its better to to use in-memory DB (Redis) since you need fast access to do the diffs and patches. And when the user leaves the session you don't need his copy anymore. But, If you need persistence in you app, you should persist only the server-copy not the shadow-copy (shadow-copies are used to find-out the diffs), then you can use MySQL or whatever you like.
But for Operational Transformation technique there are some nice libs out there
NodeJS:
ShareJS (sharejs.org): supports all operations for JSON.
RacerJS: synchronization model built on top of ShareJS
DerbyJS: Complete framework that uses RacerJS as its model.
OpenCoweb (opencoweb.org):
The server is either Java or Python, the client is built with Dojo

Database of images and text

background:
I'm in the design phase of building an app.
I want the app to display text and images, the problem is that I will have A LOT of them. hundreds to thousands.
This is my largest app so far, and I am unsure on how to handle all the data.
The question???????:
What would be the best way to store and access these images and text?
Would I use a formal database approach like SQL?
Or would it be better to navigate files/folders e.g. dropping all the files in res/drawable?
potentially useful facts:
The database will be stored and accessed natively so it can be accessed off-line.
The user will not be adding to the database in anyway, only accessing the data.
the database will be updated every 6 months.
The application 'page' will display 1-5 images along with several blocks of text.
Concept:
the app will be like a recipe app...the user will pick some parameters e.g. ingredients, type, diet.. then select a recipe. And then several images and blocks of text will be displayed showing and detailing the process of some recipe.
I apologize if this is repeated but I didn't see a specific answer for my purposes.
The "Best" approach will depend on the functionality of the database server in question.
Generally, you should store the images "In" the database until that becomes a performance issue. Once you start storing images "Outside" of the database you will have to handle all the issue that are normally taken care of by the database. Disk space management, orphan records, file name conflicts, folder file limits, to name just a few. Depending on your situation these may be big issues or thay may be nothing to worry about.
I've seen several application where images (or attachements) were kept "Outside" the database, and in each case it was done poorly. There are just so many issues to handle, and most developers don't even think of half of them. In many cases the performance of storing the images "In" the databse was acceptable, but the developers decided against it because they just knew it would not perform well.
If your using SQL server 2008 the Filestream data type is ideal for your case. It stores the binary files outside of the database but behaves as a normal field. Also you are able to read/write the files using a stream instead of getting/setting the whole file as a byte array (like when using varbin(max))
If you don't have this functionality in your database, I would recommend storing the images outside of the DB
Its probably a better idea to use a file based approach for deployed static resources.
At the very least because taking a dependency on file system is typically easier to manage then taking a dependency on a DB.
Also this line indicates some sort of non-web client
The database will be stored and accessed natively so it can be accessed off-line."
This means if you go with the DB approach you'll have a couple of other interesting problems
Deployment
Depending on the platform deploying a DB can be a real bear depending on your target platform. What happens if they if already have the engine but its a different version.
Resources
Is your DB going to be client/server based (like MySQL/SQL Server etc)? If so then your app has to now manage the current state of its process. If not then you'll be using a file-based db SQL Lite/MS Access, at which point I would question why using a static DB is worth doing at all.
One final note. There's nothing stopping your Content Production environment from using a DB. Its quite common for Content producers to maintain a database for their content that will you will later use to produce the files for publishing/deployment.

Index replication and Load balancing

Am using Lucene API in my web portal which is going to have 1000s of concurrent users.
Our web server will call Lucene API which will be sitting on an app server.We plan to use 2 app servers for load balancing.
Given this, what should be our strategy for replicating lucene indexes on the 2nd app server?any tips please?
You could use solr, which contains built in replication. This is possibly the best and easiest solution, since it probably would take quite a lot of work to implement your own replication scheme.
That said, I'm about to do exactly that myself, for a project I'm working on. The difference is that since we're using PHP for the frontend, we've implemented lucene in a socket server that accepts queries and returns a list of db primary keys. My plan is to push changes to the server and store them in a queue, where I'll first store them into the the memory index, and then flush the memory index to disk when the load is low enough.
Still, it's a complex thing to do and I'm set on doing quite a lot of work before we have a stable final solution that's reliable enough.
From experience, Lucene should have no problem scaling to thousands of users. That said, if you're only using your second App server for load balancing and not for fail over situations, you should be fine hosting Lucene on only one of those servers and accessing it via NDS (if you have a unix environment) or shared directory (in windows environment) from the second server.
Again, this is dependent on your specific situation. If you're talking about having millions (5 or more) of documents in your index and needing your lucene index to be failoverable, you may want to look into Solr or Katta.
We are working on a similar implementation to what you are describing as a proof of concept. What we see as an end-product for us consists of three separate servers to accomplish this.
There is a "publication" server, that is responsible for generating the indices that will be used. There is a service implementation that handles the workflows used to build these indices, as well as being able to signal completion (a custom management API exposed via WCF web services).
There are two "site-facing" Lucene.NET servers. Access to the API is provided via WCF Services to the site. They sit behind a physical load balancer and will periodically "ping" the publication server to see if there is a more current set of indicies than what is currently running. If it is, it requests a lock from the publication server and updates the local indices by initiating a transfer to a local "incoming" folder. Once there, it is just a matter of suspending the searcher while the index is attached. It then releases its lock and the other server is available to do the same.
Like I said, we are only approaching the proof of concept stage with this, as a replacement for our current solution, which is a load balanced Endeca cluster. The size of the indices and the amount of time it will take to actually complete the tasks required are the larger questions that have yet to be proved out.
Just some random things that we are considering:
The downtime of a given server could be reduced if two local folders are used on each machine receiving data to achieve a "round-robin" approach.
We are looking to see if the load balancer allows programmatic access to have a node remove and add itself from the cluster. This would lessen the chance that a user experiences a hang if he/she accesses during an update.
We are looking at "request forwarding" in the event that cluster manipulation is not possible.
We looked at solr, too. While a lot of it just works out of the box, we have some bench time to explore this path as a learning exercise - learning things like Lucene.NET, improving our WF and WCF skills, and implementing ASP.NET MVC for a management front-end. Worst case scenario, we go with something like solr, but have gained experience in some skills we are looking to improve on.
I'm creating the Indices on the publishing Backend machines into the filesystem and replicate those over to the marketing.
That way every single, load & fail balanced, node has it's own index without network latency.
Only drawback is, you shouldn't try to recreate the index within the replicated folder, as you'll have the lockfile lying around at every node, blocking the indexreader until your reindex finished.