How to restart redis cluster node after failure - redis

I am experimenting with Redis Cluster as per document. I have small confusion.
Initial Configuration
35edd8052caf37149b4f9cc800fcd2ba60018ab5 127.0.0.1:30005#40005 slave bd76f831d34ed265a964e5f5caff2c0807c96b85 0 1524390407263 5 connected
d9e92c606f1fddebf84bbbc6f76485e418647683 127.0.0.1:30003#40003 master - 0 1524390407263 8 connected 10923-16383
edf62838d10b99018a0ecb7698c1b9ac52aa3bbb 127.0.0.1:30002#40002 myself,master - 0 1524390407000 2 connected 5461-10922
bd76f831d34ed265a964e5f5caff2c0807c96b85 127.0.0.1:30001#40001 master - 0 1524390407062 1 connected 0-5460
55a72ea5b4d0a77e2b18ca2b3f74b20d3550244c 127.0.0.1:30006#40006 slave edf62838d10b99018a0ecb7698c1b9ac52aa3bbb 0 1524390407562 6 connected
26788ce4523c95a93bd63907c1c75827fe61476a 127.0.0.1:30004#40004 slave d9e92c606f1fddebf84bbbc6f76485e418647683 0 1524390407263 8 connected
Now to test that if any master get failed I failed it manually using following command.
redis-cli -p 30001 debug segfault
Now configuration is look like this. ( 30001 is failed and 30005 promoted as master)
35edd8052caf37149b4f9cc800fcd2ba60018ab5 127.0.0.1:30005#40005 master - 0 1524390694964 9 connected 0-5460
d9e92c606f1fddebf84bbbc6f76485e418647683 127.0.0.1:30003#40003 master - 0 1524390695064 8 connected 10923-16383
edf62838d10b99018a0ecb7698c1b9ac52aa3bbb 127.0.0.1:30002#40002 myself,master - 0 1524390694000 2 connected 5461-10922
bd76f831d34ed265a964e5f5caff2c0807c96b85 127.0.0.1:30001#40001 master,fail - 1524390636966 1524390636165 1 disconnected
55a72ea5b4d0a77e2b18ca2b3f74b20d3550244c 127.0.0.1:30006#40006 slave edf62838d10b99018a0ecb7698c1b9ac52aa3bbb 0 1524390694964 6 connected
26788ce4523c95a93bd63907c1c75827fe61476a 127.0.0.1:30004#40004 slave d9e92c606f1fddebf84bbbc6f76485e418647683 0 1524390695164 8 connected
How can I add 30001 again into cluster ? Also How can I start that node Only ?
I am following this document.
https://redis.io/topics/cluster-tutorial. ( Here there is one statement that "I restarted the crashed instance so that it rejoins the cluster as a slave" but did not mention how to do that ?)

creating a cluster using redis-trib.rb needs running instances of Redis which we should start using a custom config file
../redis-server redis.conf
where redis.conf contains config for that node.
For instance
port 7000
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
The redis cluster is created as below,
./redis-trib.rb create --replicas 1 host1:port1 host2:port2 host3:port3 host4:port4 host5:port5 host6:port6
The ruby file will randomly create master and slaves among these and create a nodes.conf file (as mentioned in redis.conf file) which will have the node information
when you start the server using ../redis-server redis.conf it will pick node information like id, its master/slave from nodes.conf and connect to cluster again

You can restart the redis instance on required port, using the same command as you have used to start it earlier i.e.
cd 30001
../redis-server redis.conf

Assuming that you followed the tutorial and created the cluster using create-cluster command i.e.
# pwd: redis/utils/create-cluster
./create-cluster start
./create-cluster create
To bring back the node that you failed, start it again using
./create-cluster start
This will start the failed node. Currently running nodes won't be affected.
https://github.com/antirez/redis/blob/unstable/utils/create-cluster/create-cluster#L25

Related

Redis cluster failover: slave won't become master

I am trying to test my software behavior during cluster failover, and for that reason I want to configure a simplest cluster: one master and two slaves. I have tree files 7000.conf - 7002.conf of the following content:
port 7000
cluster-config-file nodes.7000.conf
appendfilename appendonly.7000.aof
dbfilename dump.7000.rdb
pidfile /var/run/redis_7000.pid
include cluster.conf
The content of cluster.conf:
cluster-enabled yes
appendonly yes
maxclients 100
daemonize yes
cluster-node-timeout 2000
cluster-slave-validity-factor 0
I've configured then that 7000 runs all slots from 0 to 16383, and 7001 and 7002 are replicas of 7000:
XXX 127.0.0.1:7002 slave YYY 0 1511389011347 4 connected
YYY 127.0.0.1:7000 myself,master - 0 0 4 connected 0-16383
ZZZ 127.0.0.1:7001 slave YYY 0 1511389011246 4 connected
Then I try to get rid of 7000 - via shutdown command, or via killing a process. One of the slaves should promote itself to master, but none does:
ZZZ 127.0.0.1:7001 slave YYY 0 0 3 connected
YYY 127.0.0.1:7000 master,fail? - 1511389104442 1511389103933 4 disconnected 0-16383
XXX 127.0.0.1:7002 myself,slave YYY 0 1511389116543 4 connected
I've waited for like minutes, and my slaves not want to become master. If I force a slave to become master via cluster failover takeover, it's more than happy to do so (and if I restart master, it becomes slave), but not automatically.
I've tried to play with cluster-node-timeout - does not help.
Am I doing something wrong? Redis version is 3.2.11.
The issue is that a redis-cluster has a minimum size of 3 masters to get automatic failover working. It's the master nodes that watch each other, and detect the failover, so with a single master in the cluster there is no processes running are able to detect that your one master is down. The minimum of three, is to make sure that in the case of any downed node, the majority of the entire cluster needs to agree, so at the minimum you need 3 nodes, to still have more than half of them around to reach a majority view in case of failure.
The Redis-cluster tutorial mentions this in the following section: https://redis.io/topics/cluster-tutorial#creating-and-using-a-redis-cluster
"Note that the minimal cluster that works as expected requires to contain at least three master nodes."
Please note that even with 3 masters the automatic failover is not guaranteed if the failure happens like below in the cluster: (M-Master / S-Slave)
Node-1: M1 S3
Node-2: M2 S1
Node-3: M3 S2
Now if node 3 fails, then its slave S3 in Node-1 is promoted as Master automatically.All is well with following status after the Node-3 recovers:
Node-1: M1 M3 <----- Please note 2 Masters in Node-1 now with S3 become M3 in prev step.
Node-2: M2 S1
Node-3: S3 S2 <----- Please note that the redis-server came up as Slave(was M3 before)
Now you might think the cluster will continue to handle failures easily since 3 masters are there in this setup. However, if Node-1 fails the Cluster is DOWN due to quorum not satisfied and never gets up unless we do some manual adjustments.
Hope this helps.

redis-cli redirected to 127.0.0.1

I started Redis cluster on PC1, then connected it on PC2. When needed to redirect to another cluster node, it shows Redirected to slot [7785] located at 127.0.0.1, but should show Redirected to slot [7785] located at [IP of PC1, like 192.168.1.20], then it shows an error. What is happening? What can I do?
The output:
[admin#localhost ~]$ redis-cli -c -h 192.168.1.20 -p 30001
192.168.1.20:30001> get foo
-> Redirected to slot [12182] located at 127.0.0.1:30003
Could not connect to Redis at 127.0.0.1:30003: Connection refused
Could not connect to Redis at 127.0.0.1:30003: Connection refused
not connected>
Output of redis-cli -h 192.168.1.20 -p 30001 cluster nodes:
5f6d6f1319318233917aba92b6ab0e244b3260d7 127.0.0.1:30004 slave 4c7b046ecaeb2dc689cbad21ee3466fb43b48fb9 0 14639
84410573 4 connected
e04d5b461cb6a2b48cb2a607e2140b7c1d32af25 127.0.0.1:30006 slave 3fc25c3851f7a9afd09b60739434118c25cd9243 0 14639
84410473 6 connected
3fc25c3851f7a9afd09b60739434118c25cd9243 127.0.0.1:30003 master - 0 1463984410573 3 connected 10923-16383
4c7b046ecaeb2dc689cbad21ee3466fb43b48fb9 127.0.0.1:30001 myself,master - 0 0 1 connected 0-5460
7383830ac84f199db346da3112b5aaf9e124d3cf 127.0.0.1:30005 slave 1eeeb51522aed364fcf9623d6045fa3df2748579 0 14639
84410573 5 connected
1eeeb51522aed364fcf9623d6045fa3df2748579 127.0.0.1:30002 master - 0 1463984410473 2 connected 5461-10922
Hey could you try binding your redis cluster instance to server's IP
Update your redis.conf to add
bind 172.31.28.76
PS- Update IP as required
That is because all your Redis IP addresses have updated to 127.0.0.1, and they believe other Redis are located in 127.0.0.1 too. That's not wrong if nodes in a cluster just communicate with each other, but definitely improper when a connection from other host want to know about the cluster.
In that situation, your client asked a Redis for a key it's not in charge and the Redis told the client to redirect to 127.0.0.1:30003. The client misunderstood it and tried to connect the port 30003 in its localhost, and certainly found nothing.
To fix it, try to send cluster meet with the right IP to each Redis in the cluster. I've made an experiment like this
# initial, Redis doesn't know its IP before a meet
127.0.0.1:7000> cluster nodes
8af9e47cb96f3bd8fff3800c38da11601157605d :7000 myself,master - 0 0 0 connected
# meet from 127.0.0.1, and their IP addresses updated to 127.0.0.1
127.0.0.1:7000> cluster meet 127.0.0.1 7001
OK
127.0.0.1:7000> cluster nodes
8af9e47cb96f3bd8fff3800c38da11601157605d 127.0.0.1:7000 myself,master - 0 0 0 connected
2c3d9b6c29f21ecd846f42bcfb238099d88b57df 127.0.0.1:7001 master - 0 1463987186714 1 connected
# send another meet, use the eth0 IP other than lo
127.0.0.1:7000> cluster meet 172.31.28.76 7001
OK
127.0.0.1:7000> cluster nodes
8af9e47cb96f3bd8fff3800c38da11601157605d 127.0.0.1:7000 myself,master - 0 0 0 connected
2c3d9b6c29f21ecd846f42bcfb238099d88b57df 172.31.28.76:7001 master - 0 1463987192672 1 connected
# connect to :7001, its cluster nodes are what we expect
127.0.0.1:7001> cluster nodes
2c3d9b6c29f21ecd846f42bcfb238099d88b57df 172.31.28.76:7001 myself,master - 0 0 1 connected
8af9e47cb96f3bd8fff3800c38da11601157605d 172.31.28.76:7000 master - 0 1463987203631 0 connected
# send another meet to fix
127.0.0.1:7001> cluster meet 172.31.28.76 7000
OK
# back to :7000, its address updated
127.0.0.1:7000> cluster nodes
8af9e47cb96f3bd8fff3800c38da11601157605d 172.31.28.76:7000 myself,master - 0 0 0 connected
2c3d9b6c29f21ecd846f42bcfb238099d88b57df 172.31.28.76:7001 master - 0 1463987210539 1 connected
In your case you may send multiple cluster meet commands to each Redis to ensure its IP updated at all its peers.
You said, you are running redis server on PC1.
Then mention PC1's IP address (in your case it's 192.168.1.20) while mentioning bind option in redis node config files.
Example of node config file for a cluster -
bind 192.168.1.20
port 6000
cluster-enabled yes
cluster-config-file "nodes.conf"
cluster-node-timeout 5000
appendonly yes
you have to use -c option
for example you want to use client on port 6379
$ service redis-server start
$ redis-cli -c -p 6379

Redis Server Cluster Not Working

On src directory, i am running below command
/redis-trib.rb create --replicas 1 127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 127.0.0.1:7003 127.0.0.1:7004 127.0.0.1:7005
but getting below error.
Creating cluster
[ERR] Sorry, can't connect to node 127.0.0.1:7000
However if i am starting the node at 7000 using command "redis-server redis.conf" where redis.conf is below
port 7000
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 10
cluster-slave-validity-factor 0
appendonly yes
and simillarly i started redis in all ports succesfully.
Now when i am running
/redis-trib.rb create --replicas 1 127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 127.0.0.1:7003 127.0.0.1:7004 127.0.0.1:7005
i am getting another erorr.
Creating cluster [ERR] Node 127.0.0.1:7000 is not empty. Either the node already knows other nodes (check with CLUSTER NODES) or
contains some key in database 0.
please help.
The first error is because redis-trib create attempts to connect to the redis instances while creating the cluster- however you do not have any redis instances running at 127.0.0.1:7000.
The second error looks like you started your redis instance, but now your cluster cannot be created because you already tried to create a cluster on node 7000 (Probably allocated slots to your node) before you got the first error message. To wipe the node clean, run
$redis-cli -p 7000
127.0.0.1:7000> flushall
127.0.0.1:7000> cluster reset
127.0.0.1:7000> exit
then your redis-trib create will work.
Perform steps in the following manner
stop -> clean -> start -> create
of the servers.

Reconnect Shutdown Redis Instance back to Cluster

Given a redis cluster with six nodes (3M/3S) on ports 7000-7005 with master nodes on ports 7000-7002 and slave nodes on the rest, master node 7000 is shut down, so node 7003 becomes the new master:
$ redis-cli -p 7003 cluster nodes
2a23385e94f8a27e54ac3b89ed3cabe394826111 127.0.0.1:7004 slave 1108ef4cf01ace085b6d0f8fd5ce5021db86bdc7 0 1452648964358 5 connected
5799de96ff71e9e49fd58691ce4b42c07d2a0ede 127.0.0.1:7000 master,fail - 1452648178668 1452648177319 1 disconnected
dad18a1628ded44369c924786f3c920fc83b59c6 127.0.0.1:7002 master - 0 1452648964881 3 connected 10923-16383
dfcb7b6cd920c074cafee643d2c631b3c81402a5 127.0.0.1:7003 myself,master - 0 0 7 connected 0-5460
1108ef4cf01ace085b6d0f8fd5ce5021db86bdc7 127.0.0.1:7001 master - 0 1452648965403 2 connected 5461-10922
bf60041a282929cf94a4c9eaa203a381ff6ffc33 127.0.0.1:7005 slave dad18a1628ded44369c924786f3c920fc83b59c6 0 1452648965926 6 connected
How does one go about [automatically] reconnecting/restarting node 7000 as a slave instance of 7003?
Redis Cluster: Re-adding a failed over node has detail explanation about what happens.
Basically, the node will become a slave of the slave (which is now a master) that replaced it during the failover.
Have you seen the Redis Sentinel Documentation?
Redis Sentinel provides high availability for Redis. In practical
terms this means that using Sentinel you can create a Redis deployment
that resists without human intervention to certain kind of failures.

Redis Cluster: No automatic failover for master failure

I am trying to implement a Redis cluster with 6 machine.
I have a vagrant cluster of six machines:
192.168.56.101
192.168.56.102
192.168.56.103
192.168.56.104
192.168.56.105
192.168.56.106
all running redis-server
I edited /etc/redis/redis.conf file of all the above servers adding this
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
cluster-slave-validity-factor 0
appendonly yes
I then ran this on one of the six machines;
./redis-trib.rb create --replicas 1 192.168.56.101:6379 192.168.56.102:6379 192.168.56.103:6379 192.168.56.104:6379 192.168.56.105:6379 192.168.56.106:6379
A Redis cluster is up and running. I checked manually by setting value in one machine it shows up on other machine.
$ redis-cli -p 6379 cluster nodes
3c6ffdddfec4e726f29d06a6da550f94d976f859 192.168.56.105:6379 master - 0 1450088598212 5 connected
47d04bc98ab42fc793f9f382855e5c54ab8f2e20 192.168.56.102:6379 slave caf2cec45114dc8f4cbc6d96c6dbb20b62a39f90 0 1450088598716 7 connected
040d4bb6a00569fc44eec05440a5fe0796952ccf 192.168.56.101:6379 myself,slave 5318e48e9ef0fc68d2dc723a336b791fc43e23c8 0 0 4 connected
caf2cec45114dc8f4cbc6d96c6dbb20b62a39f90 192.168.56.104:6379 master - 0 1450088599720 7 connected 0-10922
d78293d0821de3ab3d2bca82b24525e976e7ab63 192.168.56.106:6379 slave 5318e48e9ef0fc68d2dc723a336b791fc43e23c8 0 1450088599316 8 connected
5318e48e9ef0fc68d2dc723a336b791fc43e23c8 192.168.56.103:6379 master - 0 1450088599218 8 connected 10923-16383
My problem is that when I shutdown or stop redis-server on any one machine which is master the whole cluster goes down, but if all the three slaves die the cluster still works properly.
What should I do so that a slave turns a master if a master fails(Fault tolerance)?
I am under the assumption that redis handles all those things and I need not worry about it after deploying the cluster. Am I right or would I have to do thing myself?
Another question is lets say I have six machine of 16GB RAM. How much total data I would be able to handle on this Redis cluster with three masters and three slaves?
Thank you.
the setting cluster-slave-validity-factor 0 may be the culprit here.
from redis.conf
# A slave of a failing master will avoid to start a failover if its data
# looks too old.
In your setup the slave of the terminated master considers itself unfit to be elected master since the time it last contacted master is greater than the computed value of:
(node-timeout * slave-validity-factor) + repl-ping-slave-period
Therefore, even with a redundant slave, the cluster state is changed to DOWN and becomes unavailable.
You can try with a different value, example, the suggested default
cluster-slave-validity-factor 10
This will ensure that the cluster is able to tolerate one random redis instance failure. (it can be slave or a master instance)
For your second question: Six machines of 16GB RAM each will be able to function as a Redis Cluster of 3 Master instances and 3 Slave instances. So theoretical maximum is 16GB x 3 data. Such a cluster can tolerate a maximum of ONE node failure if cluster-require-full-coverage is turned on. else it may be able to still serve data in the shards that are still available in the functioning instances.