Redis Sentinel: slaves not following new master after failover - redis

I installed a Redis Server + Sentinel version 7.0.4 on 3 nodes:
redis1 (192.168.69.91): cluster node 1 (Redis Server + Sentinel)
redis2 (192.168.69.92): cluster node 2 (Redis Server + Sentinel)
redis3 (192.168.69.93): cluster node 3 (Redis Server + Sentinel)
Sentinel is making the failover when the master Redis goes down, but the live slaves are not following the new master and still tries to connect to the old master (which is not available).
Actually the master node is redis1, redis2 and redis3 are slaves of redis1:
redis1:6379> info replication
# Replication
role:master
connected_slaves:2
slave0:ip=192.168.69.93,port=6379,state=online,offset=332569377,lag=0
slave1:ip=192.168.69.92,port=6379,state=online,offset=332569230,lag=1
master_failover_state:no-failover
master_replid:6735b3ef1c29926ee9f6c0ace75bc169150c2fa6
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:332569377
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:331511956
repl_backlog_histlen:1057422
redis2:6379> info replication
# Replication
role:slave
master_host:192.168.69.91
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_read_repl_offset:332556357
slave_repl_offset:332556357
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:0
master_failover_state:no-failover
master_replid:6735b3ef1c29926ee9f6c0ace75bc169150c2fa6
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:332556357
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:332555021
repl_backlog_histlen:1337
redis3:6379> info replication
# Replication
role:slave
master_host:192.168.69.91
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_read_repl_offset:332549259
slave_repl_offset:332549259
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:1
slave0:ip=192.168.69.92,port=6379,state=online,offset=332549098,lag=0
master_failover_state:no-failover
master_replid:6735b3ef1c29926ee9f6c0ace75bc169150c2fa6
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:332549259
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:331492545
repl_backlog_histlen:1056715
I get the correct master host from Sentinel:
root#redis1:~# redis-cli -p 26379 sentinel get-master-addr-by-name mycluster
1) "192.168.69.91"
2) "6379"
When I shutdown the master Redis Server host, redis3 automatically becomes the new master:
root#redis1:/etc/redis# redis-cli -p 26379 sentinel get-master-addr-by-name mycluster
1) "192.168.69.93"
2) "6379"
redis3:6379> info replication
# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:c97cca5168a874341c372b4e7e5eead1a0d30f1a
master_replid2:6735b3ef1c29926ee9f6c0ace75bc169150c2fa6
master_repl_offset:332740970
second_repl_offset:332733598
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:331676505
repl_backlog_histlen:1064466
but redis2 slave (which is still alive) is not following the new master but it’s trying to still follow the old master:
redis2:6379> info replication
# Replication
role:slave
master_host:192.168.69.91
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_read_repl_offset:332733597
slave_repl_offset:332733597
master_link_down_since_seconds:29
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:0
master_failover_state:no-failover
master_replid:6735b3ef1c29926ee9f6c0ace75bc169150c2fa6
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:332733597
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:332555021
repl_backlog_histlen:178577
These are the Sentinel logs during the failover:
1591:X 16 Aug 2022 14:41:18.199 # +sdown master mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:18.267 # +odown master mycluster 192.168.69.91 6379 #quorum 2/2
1591:X 16 Aug 2022 14:41:18.267 # +new-epoch 23
1591:X 16 Aug 2022 14:41:18.267 # +try-failover master mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:18.273 * Sentinel new configuration saved on disk
1591:X 16 Aug 2022 14:41:18.273 # +vote-for-leader c3d7456b7e06e85f449cfc452ac46a4bb7310a59 23
1591:X 16 Aug 2022 14:41:18.288 # ebe43f2f88aba2d3015f05ba960fd7a609bb67e0 voted for c3d7456b7e06e85f449cfc452ac46a4bb7310a59 23
1591:X 16 Aug 2022 14:41:18.332 # +elected-leader master mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:18.332 # +failover-state-select-slave master mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:18.395 # +selected-slave slave 192.168.69.93:6379 192.168.69.93 6379 # mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:18.395 * +failover-state-send-slaveof-noone slave 192.168.69.93:6379 192.168.69.93 6379 # mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:18.453 * +failover-state-wait-promotion slave 192.168.69.93:6379 192.168.69.93 6379 # mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:19.310 * Sentinel new configuration saved on disk
1591:X 16 Aug 2022 14:41:19.310 # +promoted-slave slave 192.168.69.93:6379 192.168.69.93 6379 # mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:19.310 # +failover-state-reconf-slaves master mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:19.382 # +failover-end master mycluster 192.168.69.91 6379
1591:X 16 Aug 2022 14:41:19.382 # +switch-master mycluster 192.168.69.91 6379 192.168.69.93 6379
1591:X 16 Aug 2022 14:41:19.382 * +slave slave 192.168.69.92:6379 192.168.69.92 6379 # mycluster 192.168.69.93 6379
1591:X 16 Aug 2022 14:41:19.382 * +slave slave 192.168.69.91:6379 192.168.69.91 6379 # mycluster 192.168.69.93 6379
1591:X 16 Aug 2022 14:41:19.387 * Sentinel new configuration saved on disk
1591:X 16 Aug 2022 14:41:24.397 # +sdown slave 192.168.69.91:6379 192.168.69.91 6379 # mycluster 192.168.69.93 6379
1591:X 16 Aug 2022 14:41:24.397 # +sdown slave 192.168.69.92:6379 192.168.69.92 6379 # mycluster 192.168.69.93 6379
As you can see the slave 192.168.69.92 (redis2) is down, but I cannot understand why.
This is the Redis Sentinel configuration before the failover:
protected-mode no
port 26379
sentinel monitor mycluster 192.168.69.91 6379 2
sentinel down-after-milliseconds mycluster 5000
# Generated by CONFIG REWRITE
latency-tracking-info-percentiles 50 99 99.9
user default on nopass ~* &* +#all
sentinel myid c3d7456b7e06e85f449cfc452ac46a4bb7310a59
sentinel config-epoch mycluster 22
sentinel leader-epoch mycluster 22
sentinel current-epoch 22
sentinel known-replica mycluster 192.168.69.93 6379
sentinel known-replica mycluster 192.168.69.92 6379
sentinel known-sentinel mycluster 192.168.69.92 26379 584784ea586a5954576eaa39be12542d4fae3175
sentinel known-sentinel mycluster 192.168.69.93 26379 ebe43f2f88aba2d3015f05ba960fd7a609bb67e0
And this is the Sentinel configuration after the failover:
protected-mode no
port 26379
sentinel monitor mycluster 192.168.69.91 6379 2
sentinel down-after-milliseconds mycluster 5000
# Generated by CONFIG REWRITE
latency-tracking-info-percentiles 50 99 99.9
user default on nopass ~* &* +#all
sentinel myid c3d7456b7e06e85f449cfc452ac46a4bb7310a59
sentinel config-epoch mycluster 23
sentinel leader-epoch mycluster 23
sentinel current-epoch 23
sentinel known-replica mycluster 192.168.69.91 6379
sentinel known-replica mycluster 192.168.69.92 6379
sentinel known-sentinel mycluster 192.168.69.92 26379 584784ea586a5954576eaa39be12542d4fae3175
sentinel known-sentinel mycluster 192.168.69.93 26379 ebe43f2f88aba2d3015f05ba960fd7a609bb67e0
I cannot understand why Sentinel cannot instruct ´redis2´ to follow the new master redis3 and recognise it as down.

Related

Redis sentinel can not fail over the slave service

I'm gonna deploy a simple master-slave redis cluster with two servers: 192.168.0.101, 192.168.0.103, and 101 is the master.
here is the sentinel.conf on 103 server:
port 26379
bind 192.168.0.103 127.0.0.1
sentinel myid 49f552d5540fdcb8aa60be25208c56b689d3c0b0
sentinel monitor mymaster 192.168.0.101 6379 2
sentinel down-after-milliseconds mymaster 60000
sentinel failover-timeout mymaster 900000
sentinel auth-pass mymaster arsenal
sentinel config-epoch mymaster 0
# Generated by CONFIG REWRITE
dir "/etc/redis"
sentinel leader-epoch mymaster 3
sentinel known-slave mymaster 192.168.0.103 6379
sentinel current-epoch 3
and my redis.conf on 103 server:
bind 127.0.0.1 ::1
protected-mode yes
port 6379
tcp-backlog 511
timeout 0
daemonize yes
supervised no
dbfilename dump.rdb
dir /var/lib/redis
slaveof device1 6379
masterauth arsenal
slave-serve-stale-data yes
slave-read-only yes
slave-priority 100
requirepass arsenal
slave-lazy-flush no
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
activerehashing yes
aof-rewrite-incremental-fsync yes
I started with the sentinel on 192.168.0.103 with redis-server sentinel.conf --sentinel
7951:X 14 Mar 14:19:48.479 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
7951:X 14 Mar 14:19:48.479 # Sentinel ID is 49f552d5540fdcb8aa60be25208c56b689d3c0b0
7951:X 14 Mar 14:19:48.479 # +monitor master mymaster 192.168.0.101 6379 quorum 2
7951:X 14 Mar 14:20:48.480 # +sdown slave 192.168.0.103:6379 192.168.0.103 6379 # mymaster 192.168.0.101 6379
7951:X 14 Mar 14:21:11.577 # +sdown master mymaster 192.168.0.101 6379
My sentinel calling is like this:
sentinel = Sentinel([('device3', 26379)], password='arsenal')
sentinel.discover_master('mymaster')
MasterNotFoundError: No master found for 'mymaster'
The problem is after I tried to stop the redis-server service on 101, the sentinel can not switch the 103 server as the master.
Anyone who have idea? thanks.
In your config sentinel monitor mymaster 192.168.0.101 6379 2, quorum is 2, which means only two or more than two Sentinels think master down, can failover start.
See Redis Sentinel doc, only three or more than three Sentinels can be deployed stably, if you only have one Sentinel, it can not elects a leader (which get votes of majority) to start a failover.

Redis+Sentinel cannot execute client-reconfig-script (VIP drift) when failover happens

I have setup a redis master-slave(s) cluster with sentinel monitoring for HA on linux CentOS7.0 (redis v4.0.2).
Sentinel is working well as, when I shutdown one of the three nodes, another node is elected as the new master.
Now I try to setup a reconfig script to notify clients of the new master.
I created a readable and executable (chmod a+rx) script in /usr/opt/notify_master.sh then I added such a line in my 3 sentinel nodes in /etc/sentinel.conf:
sentinel client-reconfig-script mymaster /usr/opt/notify_master.sh
Looking at sentinel config with a sentinel master mymaster command, I can confirm that client-reconfig-script is well configured:
10.0.0.41:26379> sentinel master mymaster
...
41) "client-reconfig-script"
42) "/usr/opt/notify_master.sh"
However, when a failover occurs, my reconfig script is not triggered. And I wonder why. Here is the sentinel log:
3314:X 02 Apr 10:14:07.069 # +sdown master mymaster 10.0.0.41 6379
3314:X 02 Apr 10:14:07.159 # +new-epoch 254
3314:X 02 Apr 10:14:07.161 # +vote-for-leader 8ed286e02e81d6946e0d007f569e164a9404c03f 254
3314:X 02 Apr 10:14:07.679 # +config-update-from sentinel 8ed286e02e81d6946e0d007f569e164a9404c03f 10.0.0.40 26379 # mymaster 10.0.0.41 6379
3314:X 02 Apr 10:14:07.679 # +switch-master mymaster 10.0.0.41 6379 10.0.0.42 6379
3314:X 02 Apr 10:14:07.680 * +slave slave 10.0.0.40:6379 10.0.0.40 6379 # mymaster 10.0.0.42 6379
3314:X 02 Apr 10:14:07.680 * +slave slave 10.0.0.41:6379 10.0.0.41 6379 # mymaster 10.0.0.42 6379
3314:X 02 Apr 10:14:37.742 # +sdown slave 10.0.0.41:6379 10.0.0.41 6379 # mymaster 10.0.0.42 6379
3314:X 02 Apr 10:48:36.099 # -sdown slave 10.0.0.41:6379 10.0.0.41 6379 # mymaster 10.0.0.42 6379
3314:X 02 Apr 10:48:46.056 * +convert-to-slave slave 10.0.0.41:6379 10.0.0.41 6379 # mymaster 10.0.0.42 6379
Here is the notify_master.sh script (which does a VIP drift)
#!/bin/bash
MASTER_IP=$6
LOCAL_IP=`ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v inet6|awk '{print $2}'|tr -d "addr:"​`
VIP='10.0.0.31'
NETMASK='24'
INTERFACE='eno16777736'
if [ ${MASTER_IP} = ${LOCAL_IP} ];then
sudo /usr/sbin/ip addr add ${VIP}/${NETMASK} dev ${INTERFACE}
sudo /usr/sbin/arping -q -c 3 -A ${VIP} -I ${INTERFACE}
exit 0
else
sudo /usr/sbin/ip addr del ${VIP}/${NETMASK} dev ${INTERFACE}
exit 0
fi
exit 1
This script works mannauly.

redis sentinel client-reconfig-script not triggered

I have setup a redis master-slave(s) cluster with sentinel monitoring for HA on linux debian (using stretch backports: redis v4.0.2).
Sentinel is working well as, when I shutdown one of the three nodes, another node is elected as the new master.
Now I try to setup a reconfig script to notify clients of the new master.
I created a readable and executable (chmod a+rx) script in /var/redis/test.sh then I added such a line in my 3 sentinel nodes in /etc/redis/sentinel.conf:
sentinel client-reconfig-script mymaster /var/redis/test.sh
Looking at sentinel config with a sentinel master mymaster command, I can confirm that client-reconfig-script is well configured:
10.2.0.6:26379> sentinel master mymaster
...
43) "client-reconfig-script"
44) "/var/redis/test.sh"
However, when a failover occurs, my reconfig script is not triggered. And I wonder why. Here is the sentinel log:
29765:X 16 Oct 23:03:11.724 # Executing user requested FAILOVER of 'mymaster'
29765:X 16 Oct 23:03:11.724 # +new-epoch 480
29765:X 16 Oct 23:03:11.724 # +try-failover master mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:11.777 # +vote-for-leader 5a0661a5982701465a387b4872cfa4c576edbd38 480
29765:X 16 Oct 23:03:11.777 # +elected-leader master mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:11.777 # +failover-state-select-slave master mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:11.854 # +selected-slave slave 10.2.0.8:6379 10.2.0.8 6379 # mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:11.854 * +failover-state-send-slaveof-noone slave 10.2.0.8:6379 10.2.0.8 6379 # mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:11.910 * +failover-state-wait-promotion slave 10.2.0.8:6379 10.2.0.8 6379 # mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:12.838 # +promoted-slave slave 10.2.0.8:6379 10.2.0.8 6379 # mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:12.838 # +failover-state-reconf-slaves master mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:12.893 * +slave-reconf-sent slave 10.2.0.6:6379 10.2.0.6 6379 # mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:13.865 * +slave-reconf-inprog slave 10.2.0.6:6379 10.2.0.6 6379 # mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:13.865 * +slave-reconf-done slave 10.2.0.6:6379 10.2.0.6 6379 # mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:13.937 # +failover-end master mymaster 10.2.0.7 6379
29765:X 16 Oct 23:03:13.937 # +switch-master mymaster 10.2.0.7 6379 10.2.0.8 6379
29765:X 16 Oct 23:03:13.937 * +slave slave 10.2.0.6:6379 10.2.0.6 6379 # mymaster 10.2.0.8 6379
29765:X 16 Oct 23:03:13.937 * +slave slave 10.2.0.7:6379 10.2.0.7 6379 # mymaster 10.2.0.8 6379
May I have a missing configuration option?
additional information: I installed a similar architecture a few weeks ago (redis 4.0.1) and it worked (I mean it was firing my reconfig script), but I did not keep the configuration, so I may have missed something. Or... could it be a bug introduced in v4.0.2?!
The 'chroot-like environment' for me was the systemd setup that comes with the default apt install redis-sentinel.
Changing the options in /etc/systemd/system/sentinel.service
PrivateTmp=no
ReadWriteDirectories=-/tmp
will make writing a test file to /tmp work as expected.
Sending emails from the command line involves switching most of the other options off (or swap it to run as root...)
I finally solved my problem.
The "reconfig.sh" script WAS fired by the failover, but I didn't realize it was because:
sentinel logging (even in debug mode) is not very clear about the reconfig script execution
reconfig script seems to be run in a chroot-like environment that made my tests non-relevant!
Here is the sentinel log when a client-reconfig-script is triggered ("script-child" lines):
32711:X 18 Oct 16:06:42.615 # +failover-state-reconf-slaves master mymaster 10.2.0.6 6379
32711:X 18 Oct 16:06:42.671 * +slave-reconf-sent slave 10.2.0.8:6379 10.2.0.8 6379 # mymaster 10.2.0.6 6379
32711:X 18 Oct 16:06:42.671 . +script-child 397
32711:X 18 Oct 16:06:42.813 . -script-child 397 0 0
Then my reconfig.sh looked like this:
#!/bin/bash
touch /tmp/reconfig
exit 0
=> Don't expect to find a /tmp/reconfig file when this script is called by Sentinel!
However, I still do not know exactly how it works internally...
If run redis as the user 'root', the client-reconfig-script will be triggered
.

Redis sentinel marks slaves as down

I'm trying to setup a typical redis sentinel configuration, with three machines that will run three redis servers and three redis sentinels. The Master/Slave part of the redis servers are working OK, but the sentinels are not working. When I start two sentinels, the sentinel with the master detects the slaves, but mark them as down after the specified amount of time. I'm running Redis 3.0.5 64-bit in debian jessie machines.
8319:X 22 Dec 14:06:17.855 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8319:X 22 Dec 14:06:17.855 # Sentinel runid is cdd5bbd5b84c876982dbca9d45ecc4bf8500e7a2
8319:X 22 Dec 14:06:17.855 # +monitor master mymaster xxxxxxxx0 6379 quorum 2
8319:X 22 Dec 14:06:18.857 * +slave slave xxxxxxxx2:6379 xxxxxxx2 6379 # mymaster xxxxxxx0 6379
8319:X 22 Dec 14:06:18.858 * +slave slave xxxxxx1:6380 xxxxxxx1 6380 # mymaster xxxxxxx0 6379
8319:X 22 Dec 14:07:18.862 # +sdown slave xxxxxxxx1:6380 xxxxxxx1 6380 # mymaster xxxxxx0 6379
8319:X 22 Dec 14:07:18.862 # +sdown slave xxxxxx2:6379 xxxxxxx2 6379 # mymaster xxxxxx0 6379
Sentinel config file:
daemonize yes
pidfile "/var/run/redis/redis-sentinel.pid"
logfile "/var/log/redis/redis-sentinel.log"
bind 127.0.0.1 xxxxxxx0
port 26379
sentinel monitor mymaster xxxxxxx0 6379 2
sentinel down-after-milliseconds mymaster 60000
sentinel config-epoch mymaster 0
sentinel leader-epoch mymaster 0
dir "/var/lib/redis"
Of course, there is connectivity between these machines, as the slaves are working OK:
7553:S 22 Dec 13:46:33.285 * Connecting to MASTER xxxxxxxx0:6379 <br/>
7553:S 22 Dec 13:46:33.286 * MASTER <-> SLAVE sync started
7553:S 22 Dec 13:46:33.286 * Non blocking connect for SYNC fired the event.
7553:S 22 Dec 13:46:33.287 * Master replied to PING, replication can continue...
7553:S 22 Dec 13:46:33.288 * Partial resynchronization not possible (no cached master)
7553:S 22 Dec 13:46:33.291 * Full resync from master: f637ca8fe003acd09c6d021aed3f89a0d9994c9b:98290
7553:S 22 Dec 13:46:33.350 * MASTER <-> SLAVE sync: receiving 18 bytes from master
7553:S 22 Dec 13:46:33.350 * MASTER <-> SLAVE sync: Flushing old data
7553:S 22 Dec 13:46:33.350 * MASTER <-> SLAVE sync: Loading DB in memory
7553:S 22 Dec 13:46:33.350 * MASTER <-> SLAVE sync: Finished with success
7553:S 22 Dec 14:01:33.072 * 1 changes in 900 seconds. Saving...
I can answer myself. The problem was that the first IP that appeared in the sentinel conf was the localhost ip. It needs to be the binding IP. Just in case it serves anyone.

Redis Sentinel doesn't resurrect master -> slave until it is restarted

I'm having trouble with resurrecting a master node with Sentinel. Specifically, slaves are promoted properly when the master is lost, but the master upon reboot is never demoted. However, if I restart Sentinel immediately the master node is demoted. Is my configuration bad, or am I missing something basic?
EDIT: Xpost with https://groups.google.com/forum/#!topic/redis-db/4AnGNssqYTw
I setup a few VMs as follows, all with Redis 3.1.999:
192.168.0.101 - Redis Slave
192.168.0.102 - Redis Slave
192.168.0.103 - Redis Master
192.168.0.201 - Sentinel
192.168.0.202 - Sentinel
My Sentinel configuration, for both sentinels:
loglevel verbose
logfile "/tmp/sentinel.log"
sentinel monitor redisA01 192.168.0.101 6379 2
sentinel down-after-milliseconds redisA01 30000
sentinel failover-timeout redisA01 120000
I stop redis on the master node; as expected Sentinel catches it and promotes a slave to master.
3425:X 08 Sep 23:47:43.839 # +sdown master redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:43.896 # +odown master redisA01 192.168.0.103 6379 #quorum 2/2
3425:X 08 Sep 23:47:43.896 # +new-epoch 53
3425:X 08 Sep 23:47:43.896 # +try-failover master redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:43.898 # +vote-for-leader 71de0d8f6250e436e1f76800cbe8cbae56c1be7c 53
3425:X 08 Sep 23:47:43.901 # 192.168.0.201:26379 voted for 71de0d8f6250e436e1f76800cbe8cbae56c1be7c 53
3425:X 08 Sep 23:47:43.975 # +elected-leader master redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:43.976 # +failover-state-select-slave master redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:44.077 # +selected-slave slave 192.168.0.102:6379 192.168.0.102 6379 # redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:44.078 * +failover-state-send-slaveof-noone slave 192.168.0.102:6379 192.168.0.102 6379 # redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:44.977 * +failover-state-wait-promotion slave 192.168.0.102:6379 192.168.0.102 6379 # redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:44.980 - -role-change slave 192.168.0.102:6379 192.168.0.102 6379 # redisA01 192.168.0.103 6379 new reported role is master
3425:X 08 Sep 23:47:44.981 # +promoted-slave slave 192.168.0.102:6379 192.168.0.102 6379 # redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:44.981 # +failover-state-reconf-slaves master redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:45.068 * +slave-reconf-sent slave 192.168.0.101:6379 192.168.0.101 6379 # redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:46.031 * +slave-reconf-inprog slave 192.168.0.101:6379 192.168.0.101 6379 # redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:46.032 * +slave-reconf-done slave 192.168.0.101:6379 192.168.0.101 6379 # redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:46.101 # -odown master redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:46.101 # +failover-end master redisA01 192.168.0.103 6379
3425:X 08 Sep 23:47:46.102 # +switch-master redisA01 192.168.0.103 6379 192.168.0.102 6379
3425:X 08 Sep 23:47:46.103 * +slave slave 192.168.0.101:6379 192.168.0.101 6379 # redisA01 192.168.0.102 6379
3425:X 08 Sep 23:47:46.103 * +slave slave 192.168.0.103:6379 192.168.0.103 6379 # redisA01 192.168.0.102 6379
I wait a few minutes and restart Redis on the former master node. Unexpectedly (to me) the node is not demoted to slave.
3425:X 08 Sep 23:48:16.105 # +sdown slave 192.168.0.103:6379 192.168.0.103 6379 # redisA01 192.168.0.102 6379
3425:X 08 Sep 23:50:09.131 # -sdown slave 192.168.0.103:6379 192.168.0.103 6379 # redisA01 192.168.0.102 6379
After waiting a few more minutes, I restart one of the sentinels. Immediately it detects the dangling former master node and demotes it.
3425:signal-handler (1441758237) Received SIGTERM scheduling shutdown...
...
3670:X 09 Sep 00:23:57.687 # Sentinel ID is 71de0d8f6250e436e1f76800cbe8cbae56c1be7c
3670:X 09 Sep 00:23:57.687 # +monitor master redisA01 192.168.0.102 6379 quorum 2
3670:X 09 Sep 00:23:57.690 - -role-change slave 192.168.0.103:6379 192.168.0.103 6379 # redisA01 192.168.0.102 6379 new reported role is master
3670:X 09 Sep 00:23:58.708 - Accepted 192.168.0.201:49731
3670:X 09 Sep 00:24:07.778 * +convert-to-slave slave 192.168.0.103:6379 192.168.0.103 6379 # redisA01 192.168.0.102 6379
3670:X 09 Sep 00:24:17.801 - +role-change slave 192.168.0.103:6379 192.168.0.103 6379 # redisA01 192.168.0.102 6379 new reported role is slave
I would check for multiple processes on the master, and for possible circular replication. I you look at the end of the first log batch you will see it detects the 103 IP as a slave already via the +slave entry. I would try to look at why upon promotion the new master already shows the old master as a slave.
Upon restart the reconfiguration is happening, according to the logs provided, before slave rediscovery whereupon it detects the slave reporting itself as master.
Try it again, it directly interrogate each node before restarting sentinel to see what they each have for master and slaves. That might illuminate the underlying issue.
Edit: your sentinel configuration described is incorrect. You list the master as 103 in your listing, but the sentinel config file you posted indicates 101, which is a slave according to your listing.
Also, add a third sentinel. Two makes it easy to have split brain, which you may well what you are seeing.