Solr approaches to re-indexing large document corpus - indexing

We are looking for some recommendations around systematically re-indexing in Solr an ever growing corpus of documents (tens of millions now, hundreds of millions in than a year) without taking the currently running index down. Re-indexing is needed on a periodic bases because:
New features are introduced around
searching the existing corpus that
require additional schema fields
which we can't always anticipate in
advance
The corpus is indexed across multiple
shards. When it grows past a certain
threshold, we need to create more
shards and re-balance documents
evenly across all of them (which
SolrCloud does not seem to yet
support).
The current index receives very frequent updates and additions, which need to be available for search within minutes. Therefore, approaches where the corpus is re-indexed in batch offline don't really work as by the time the batch is finished, new documents will have been made available.
The approaches we are looking into at the moment are:
Create a new cluster of shards and
batch re-index there while the old
cluster is still available for
searching. New documents that are not
part of the re-indexed batch are sent
to both the old cluster and the new
cluster. When ready to switch, point
the load balancer to the new cluster.
Use CoreAdmin: spawn a new core per
shard and send the re-indexed batch
to the new cores. New documents that
are not part of the re-indexed batch
are sent to both the old cores and
the new cores. When ready to switch,
use CoreAdmin to dynamically swap
cores.
We'd appreciate if folks can either confirm or poke holes in either or all these approaches. Is one more appropriate than the other? Or are we completely off? Thank you in advance.

This may not be applicable to you guys, but I'll offer my approach to this problem.
Our Solr setup is currently a single core. We'll be adding more cores in the future, but the overwhelming majority of the data is written to a single core.
With this in mind, sharding wasn't really applicable to us. I looked into distributed searches - cutting up the data and having different slices of it running on different servers. This, to me, just seemed to complicate things too much. It would make backup/restores more difficult and you end up losing out on certain features when performing distributed searches.
The approach we ended up going with was a very simple clustered master/slave setup.
Each cluster consists of a master database, and two solr slaves that are load balanced. All new data is written to the master database and the slaves are configured to sync new data every 5 minutes. Under normal circumstances this is a very nice setup. Re-indexing operations occur on the master, and while this is happening the slaves can still be read from.
When a major re-indexing operation is happening, I remove one slave from the load balancer and turn off polling on the other. So, the customer facing Solr database is now not syncing with the master, while the other is being updated. Once the re-index is complete and the offline slave database is in sync, I add it back to the load balancer, remove the other slave database from the load balancer, and re-configure it to sync with the master.
So far this has worked very well. We currently have around 5 million documents in our database and this number will scale much higher across multiple clusters.
Hope this helps!

Related

which system should I choose to make it easier to transfer it to a cluster later?

we have a small project, and we want to start using a non-clustered version of either keydb or redis. I've read a lot of reviews. I would like to hear more. Which system will be easier to turn into a cluster in the future, and maybe transfer to kubernetes?
Regarding scaling/simplicity, I would point out both Redis and KeyDB are able to turn into sharded clusters, or add replica nodes, KeyDB also offers active replication (some limits, but avoids sentinel). Both are also compatible with RESP protocol so can use any Redis client.
A few points relevant to both KeyDB and Redis when trying to simplify scaling in the future (ie. moving to a sharded data set):
Ensure you use a client that is compatible with cluster-mode enabled as not all are
Be careful of how you use transactions. If you rely heavily on transactions that hit multiple keys, you may need to rethink this when spreading data across multiple shards.
The point above also applies to certain commands that can hit multiple shards such as SCAN, KEYS, batch requests (ie. MGET), SUNION, etc. Planning how you structure your data may make this easier when you decide to scale up.

Does Redis persist data?

I understand that Redis serves all data from memory, but does it persist as well across server reboot so that when the server reboots it reads into memory all the data from disk. Or is it always a blank store which is only to store data while apps are running with no persistence?
I suggest you read about this on http://redis.io/topics/persistence . Basically you lose the guaranteed persistence when you increase performance by using only in-memory storing. Imagine a scenario where you INSERT into memory, but before it gets persisted to disk lose power. There will be data loss.
Redis supports so-called "snapshots". This means that it will do a complete copy of whats in memory at some points in time (e.g. every full hour). When you lose power between two snapshots, you will lose the data from the time between the last snapshot and the crash (doesn't have to be a power outage..). Redis trades data safety versus performance, like most NoSQL-DBs do.
Most NoSQL-databases follow a concept of replication among multiple nodes to minimize this risk. Redis is considered more a speedy cache instead of a database that guarantees data consistency. Therefore its use cases typically differ from those of real databases:
You can, for example, store sessions, performance counters or whatever in it with unmatched performance and no real loss in case of a crash. But processing orders/purchase histories and so on is considered a job for traditional databases.
Redis server saves all its data to HDD from time to time, thus providing some level of persistence.
It saves data in one of the following cases:
automatically from time to time
when you manually call BGSAVE command
when redis is shutting down
But data in redis is not really persistent, because:
crash of redis process means losing all changes since last save
BGSAVE operation can only be performed if you have enough free RAM (the amount of extra RAM is equal to the size of redis DB)
N.B.: BGSAVE RAM requirement is a real problem, because redis continues to work up until there is no more RAM to run in, but it stops saving data to HDD much earlier (at approx. 50% of RAM).
For more information see Redis Persistence.
It is a matter of configuration. You can have none, partial or full persistence of your data on Redis. The best decision will be driven by the project's technical and business needs.
According to the Redis documentation about persistence you can set up your instance to save data into disk from time to time or on each query, in a nutshell. They provide two strategies/methods AOF and RDB (read the documentation to see details about then), you can use each one alone or together.
If you want a "SQL like persistence", they have said:
The general indication is that you should use both persistence methods if you want a degree of data safety comparable to what PostgreSQL can provide you.
The answer is generally yes, however a fuller answer really depends on what type of data you're trying to store. In general, the more complete short answer is:
Redis isn't the best fit for persistent storage as it's mainly performance focused
Redis is really more suitable for reliable in-memory storage/cacheing of current state data, particularly for allowing scalability by providing a central source for data used across multiple clients/servers
Having said this, by default Redis will persist data snapshots at a periodic interval (apparently this is every 1 minute, but I haven't verified this - this is described by the article below, which is a good basic intro):
http://qnimate.com/redis-permanent-storage/
TL;DR
From the official docs:
RDB persistence [the default] performs point-in-time snapshots of your dataset at specified intervals.
AOF persistence [needs to be explicitly configured] logs every write operation received by the server, that will be played again at server startup, reconstructing the
original dataset.
Redis must be explicitly configured for AOF persistence, if this is required, and this will result in a performance penalty as well as growing logs. It may suffice for relatively reliable persistence of a limited amount of data flow.
You can choose no persistence at all.Better performance but all the data lose when Redis shutting down.
Redis has two persistence mechanisms: RDB and AOF.RDB uses a scheduler global snapshooting and AOF writes update to an apappend-only log file similar to MySql.
You can use one of them or both.When Redis reboots,it constructes data from reading the RDB file or AOF file.
All the answers in this thread are talking about the possibility of redis to persist the data: https://redis.io/topics/persistence (Using AOF + after every write (change)).
It's a great link to get you started, but it is defenently not showing you the full picture.
Can/Should You Really Persist Unrecoverable Data/State On Redis?
Redis docs does not talk about:
Which redis providers support this (AOF + after every write) option:
Almost none of them - redis labs on the cloud does NOT provide this option. You may need to buy the on-premise version of redis-labs to support it. As not all companies are willing to go on-premise, then they will have a problem.
Other Redis Providers does not specify if they support this option at all. AWS Cache, Aiven,...
AOF + after every write - This option is slow. you will have to test it your self on your production hardware to see if it fits your requirements.
Redis enterpice provide this option and from this link: https://redislabs.com/blog/your-cloud-cant-do-that-0-5m-ops-acid-1msec-latency/ let's see some banchmarks:
1x x1.16xlarge instance on AWS - They could not achieve less than 2ms latency:
where latency was measured from the time the first byte of the request arrived at the cluster until the first byte of the ‘write’ response was sent back to the client
They had additional banchmarking on a much better harddisk (Dell-EMC VMAX) which results < 1ms operation latency (!!) and from 70K ops/sec (write intensive test) to 660K ops/sec (read intensive test). Pretty impresive!!!
But it defenetly required a (very) skilled devops to help you create this infrastructure and maintain it over time.
One could (falsy) argue that if you have a cluster of redis nodes (with replicas), now you have full persistency. this is false claim:
All DBs (sql,non-sql,redis,...) have the same problem - For example, running set x 1 on node1, how much time it takes for this (or any) change to be made in all the other nodes. So additional reads will receive the same output. well, it depends on alot of fuctors and configurations.
It is a nightmare to deal with inconsistency of a value of a key in multiple nodes (any DB type). You can read more about it from Redis Author (antirez): http://antirez.com/news/66. Here is a short example of the actual ngihtmare of storing a state in redis (+ a solution - WAIT command to know how much other redis nodes received the latest change change):
def save_payment(payment_id)
redis.rpush(payment_id,”in progress”) # Return false on exception
if redis.wait(3,1000) >= 3 then
redis.rpush(payment_id,”confirmed”) # Return false on exception
if redis.wait(3,1000) >= 3 then
return true
else
redis.rpush(payment_id,”cancelled”)
return false
end
else
return false
end
The above example is not suffeint and has a real problem of knowing in advance how much nodes there actually are (and alive) at every moment.
Other DBs will have the same problem as well. Maybe they have better APIs but the problem still exists.
As far as I know, alot of applications are not even aware of this problem.
All in all, picking more dbs nodes is not a one click configuration. It involves alot more.
To conclude this research, what to do depends on:
How much devs your team has (so this task won't slow you down)?
Do you have a skilled devops?
What is the distributed-system skills in your team?
Money to buy hardware?
Time to invest in the solution?
And probably more...
Many Not well-informed and relatively new users think that Redis is a cache only and NOT an ideal choice for Reliable Persistence.
The reality is that the lines between DB, Cache (and many more types) are blurred nowadays.
It's all configurable and as users/engineers we have choices to configure it as a cache, as a DB (and even as a hybrid).
Each choice comes with benefits and costs. And this is NOT an exception for Redis but all well-known Distributed systems provide options to configure different aspects (Persistence, Availability, Consistency, etc). So, if you configure Redis in default mode hoping that it will magically give you highly reliable persistence then it's team/engineer fault (and NOT that of Redis).
I have discussed these aspects in more detail on my blog here.
Also, here is a link from Redis itself.

Does Redis Replication help in load balancing?

We keep continuously writing and updating events into redis and so when we ever we want to read data(which is a lot of data , upwards of for 500000 key value pairs), redis has performance issues. So, we decided to get the data via multiple threads. But because of single instance redis , the performance issues persisted .Will replication help us? As in, by making master and slave redis's , will our reads of the events be distributed to the slaves . We are thinking of making the master write only.
Any other suggestion for performance improvements?
(one of) Replication's declared purposes is to help in scaling reads, so yes to the topic.
Note that after you've set up the slave, you'll need to specify its address for your reader threads and processes. Make sure that you start with read-slaves if you don't have a clear separation between writers and readers.
If a single slave isn't enough, you can actually add more slaves. If you add them directly to the master, you'll get fresher reads but there'll eventually be a performance impact on the master. Alternatively, replication chaining is a great solution for most use cases, i.e. 1 master -> 1 slave -> n slaves.
There are probably other ways to scale Redis for your use case (e.g. clustering), but that really depends on what you're trying/wanting to do :)

Is Infinispan an improvement of JBoss Cache?

According to this link which belongs to JBoss documentation, I understood that Infinispan is a better product than JBoss Cache and kind of improvement the reason for which they recommend to migrate from JBoss Cache to Infinispan, that is supported by JBoss as well. Am I right in what I understood? Otherwise, are there differences?
One more question : Talking about replication and distribution, can any one of them be better than the other according to the need?
Thank you
Question:
Talking about replication and distribution, can any one of them be better than the other according to the need?
Answer:
I am taking a reference directly from Clustering modes - Infinispan
Distributed:
Number of copies represents the tradeoff between performance and durability of data
The more copies you maintain, the lower performance will be, but also the lower the risk of losing data due to server outages
use of a consistent hash algorithm to determine where in a cluster entries should be stored
No need to replicate data on each node that takes more time than just communicating hash code
Best suitable if no of nodes are high
Best suitable if size of data stored in cache is high.
Replicated:
Entries added to any of these cache instances will be replicated to all other cache instances in the cluster
This clustered mode provides a quick and easy way to share state across a cluster
replication practically only performs well in small clusters (under 10 servers), due to the number of replication messages that need to happen - as the cluster size increases
Practical Experience:
I are using Infinispan cache in my running live application on Jboss server having 8 nodes. Initially I used replicated cache but it took much longer time to respond due to large size of data. Finally we come back to Distributed and now its working fine.
Use replicated or distributed cache only for data specific to any user session. If data is common regardless of any user than prefer Local cache that's created separately for each node.

how to keep visited urls and maintain the job queue when writing a crawler

I'm writing a crawler. I keep the visited urls in redis set,and maintain the job queue using redis list. As data grows,memory is used up, my memory is 4G. How to maintain these without redis? I have no idea,if I store these in files,they also need to be in memory.
If I use a mysql to store that,I think it maybe much slower than redis.
I have 5 machines with 4G memory,if anyone has some material to set up a redis cluster,it also helps a lot. I have some material to set up a cluster to be failover ,but what I need is to set a load balanced cluster.
thx
If you are just doing the basic operations of adding/removing from sets and lists, take a look at twemproxy/nutcracker. With it you can use all of the nodes.
Regarding your usage pattern itself, are you removing or expiring jobs and URLs? How much repetition is there in the system? For example, are you repeatedly crawling the same URLs? If so, perhaps you only need a mapping of URLs to their last crawl time, and instead of a job queue you pull URLs that are new or outside a given window since their last run.
Without the details on how your crawler actually runs or interacts with Redis, that is about what I can offer. If memory grows continually, it likely means you aren't cleaning up the DB.