Why does redis forcibly demote master to slave? - redis

Run the container using docker redis:latest, and after about 30 minutes, the master changes to a slave and it is no longer writable.
Also, the slave outputs an error once per second that it cannot find the master.
1:M 08 Jul 2022 03:10:55.899 * DB saved on disk
1:M 08 Jul 2022 03:15:56.087 * 100 changes in 300 seconds. Saving...
1:M 08 Jul 2022 03:15:56.089 * Background saving started by pid 61
61:C 08 Jul 2022 03:15:56.091 * DB saved on disk
61:C 08 Jul 2022 03:15:56.092 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
1:M 08 Jul 2022 03:15:56.189 * Background saving terminated with success
1:S 08 Jul 2022 03:20:12.258 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 08 Jul 2022 03:20:12.258 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 03:20:12.258 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 03:20:12.259 * REPLICAOF 178.20.40.200:8886 enabled (user request from 'id=39 addr=95.182.123.66:36904 laddr=172.31.9.234:6379 fd=11 name= age=1 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=47 qbuf-free=20427 argv-mem=24 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=22320 events=r cmd=slaveof user=default redir=-1 resp=2')
1:S 08 Jul 2022 03:20:12.524 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 03:20:12.791 * Master replied to PING, replication can continue...
1:S 08 Jul 2022 03:20:13.335 * Trying a partial resynchronization (request 6743ff015583c86f3ac7a4305026c42991a1ca18:1).
1:S 08 Jul 2022 03:20:13.603 * Full resync from master: ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ:1
1:S 08 Jul 2022 03:20:13.603 * MASTER <-> REPLICA sync: receiving 54976 bytes from master to disk
1:S 08 Jul 2022 03:20:14.138 * Discarding previously cached master state.
1:S 08 Jul 2022 03:20:14.138 * MASTER <-> REPLICA sync: Flushing old data
1:S 08 Jul 2022 03:20:14.139 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 08 Jul 2022 03:20:14.140 # Wrong signature trying to load DB from file
1:S 08 Jul 2022 03:20:14.140 # Failed trying to load the MASTER synchronization DB from disk: Invalid argument
1:S 08 Jul 2022 03:20:14.140 * Reconnecting to MASTER 178.20.40.200:8886 after failure
1:S 08 Jul 2022 03:20:14.140 * MASTER <-> REPLICA sync started
...
1:S 08 Jul 2022 05:09:50.010 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:50.298 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:50.587 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:50.587 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:51.013 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:51.014 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:51.294 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:51.581 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:51.581 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:52.017 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:52.017 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:52.297 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:52.578 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:52.578 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:53.021 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:53.021 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:53.308 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:53.594 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:53.594 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:54.025 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:54.025 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:54.316 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:54.608 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:54.608 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:55.028 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:55.028 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:55.309 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:55.588 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:55.588 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:56.031 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:56.031 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:56.311 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:56.592 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:56.592 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:57.035 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:57.035 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:57.321 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:57.610 * Master replied to PING, replication can continue...
...
SLAVEOF NO ONE
config set slave-read-only no
If I force the slave to be writable with the above command and try to write, all data will be flushed after about 5 seconds.
I don't want to turn master into slave.
I am getting this error on clean ec2 amazon linux.
I don't know what's causing this error because I also have enough memory.
Why does redis forcibly demote master to slave?

Related

Cannot restart redis-sentinel unit

I'm trying to configure 3 Redis instances and 6 sentinels (3 of them running on the Redises and the rest are on the different hosts). But when I install redis-sentinel package and put my configuration under /etc/redis/sentinel.conf and restart the service using systemctl restart redis-sentinel I get this error:
Job for redis-sentinel.service failed because a timeout was exceeded.
See "systemctl status redis-sentinel.service" and "journalctl -xe" for details.
Here is the output of journalctl -u redis-sentinel:
Jan 01 08:07:07 redis1 systemd[1]: Starting Advanced key-value store...
Jan 01 08:07:07 redis1 redis-sentinel[16269]: 16269:X 01 Jan 2020 08:07:07.263 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Jan 01 08:07:07 redis1 redis-sentinel[16269]: 16269:X 01 Jan 2020 08:07:07.263 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=16269, just started
Jan 01 08:07:07 redis1 redis-sentinel[16269]: 16269:X 01 Jan 2020 08:07:07.263 # Configuration loaded
Jan 01 08:07:07 redis1 systemd[1]: redis-sentinel.service: Can't open PID file /var/run/sentinel/redis-sentinel.pid (yet?) after start: No such file or directory
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Start operation timed out. Terminating.
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Failed with result 'timeout'.
Jan 01 08:08:37 redis1 systemd[1]: Failed to start Advanced key-value store.
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Service hold-off time over, scheduling restart.
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Scheduled restart job, restart counter is at 5.
Jan 01 08:08:37 redis1 systemd[1]: Stopped Advanced key-value store.
Jan 01 08:08:37 redis1 systemd[1]: Starting Advanced key-value store...
Jan 01 08:08:37 redis1 redis-sentinel[16307]: 16307:X 01 Jan 2020 08:08:37.738 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Jan 01 08:08:37 redis1 redis-sentinel[16307]: 16307:X 01 Jan 2020 08:08:37.739 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=16307, just started
Jan 01 08:08:37 redis1 redis-sentinel[16307]: 16307:X 01 Jan 2020 08:08:37.739 # Configuration loaded
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Can't open PID file /var/run/sentinel/redis-sentinel.pid (yet?) after start: No such file or directory
and my sentinel.conf file:
port 26379
daemonize yes
sentinel myid 851994c7364e2138e03ee1cd346fbdc4f1404e4c
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 172.28.128.11 6379 2
sentinel down-after-milliseconds mymaster 5000
# Generated by CONFIG REWRITE
dir "/"
protected-mode no
sentinel failover-timeout mymaster 60000
sentinel config-epoch mymaster 0
sentinel leader-epoch mymaster 0
sentinel current-epoch 0
If you are trying to run your Redis servers on Debian based distribution, add below to your Redis configurations:
pidfile /var/run/redis/redis-sentinel.pid to /etc/redis/sentinel.conf
pidfile /var/run/redis/redis-server.pid to /etc/redis/redis.conf
What's the output in the sentinel log file?
I had a similar issue where Sentinel received a lot of sigterms.
In that case you need to make sure that if you use the daemonize yes setting, the systemd unit file must be using Type=forking.
Also make sure that the location of the PID file specified in the sentinel config matches the location specified in the systemd unit file.
If you face below error in journalctl or systemctl logs,
Jun 26 10:13:02 x systemd[1]: redis-server.service: Failed with result 'exit-code'.
Jun 26 10:13:02 x systemd[1]: redis-server.service: Scheduled restart job, restart counter is at 5.
Jun 26 10:13:02 x systemd[1]: Stopped Advanced key-value store.
Jun 26 10:13:02 x systemd[1]: redis-server.service: Start request repeated too quickly.
Jun 26 10:13:02 x systemd[1]: redis-server.service: Failed with result 'exit-code'.
Jun 26 10:13:02 x systemd[1]: Failed to start Advanced key-value store.
Then check /var/log/redis/redis-server.log for more information
In most cases issue is mentioned there.
i.e if a dump.rdb file is placed in /var/lib/redis then the issue might be with database count or redis version.
or in another scenario disabled IPV6 might be the issue.

Redis node won't go into MASTER mode

I have a simple redis deployment MASTER, SLAVE and 2 SENTINEL running on docker swarm. I run the stack and all services come up. redis-master start as MASTER and I kill it to test SENTINEL and SLAVE recovering. redis-master then recovers and becomes a new SLAVE. If I ecex into it and run SLAVEOF NO ONE the following happens:
1:M 31 Oct 2019 06:28:32.741 * MASTER MODE enabled (user request from 'id=3907 addr=127.0.0.1:39302 fd=36 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=34 qbuf-free=32734 obl=0 oll=0 omem=0 events=r cmd=slaveof')
1:S 31 Oct 2019 06:28:43.060 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 31 Oct 2019 06:28:43.060 * REPLICAOF 10.0.21.49:6379 enabled (user request from 'id=1085 addr=10.0.21.54:34360 fd=16 name=sentinel-592f3b97-cmd age=945 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=150 qbuf-free=32618 obl=36 oll=0 omem=0 events=r cmd=exec')
1:S 31 Oct 2019 06:28:43.701 * Connecting to MASTER 10.0.21.49:6379
1:S 31 Oct 2019 06:28:43.702 * MASTER <-> REPLICA sync started
1:S 31 Oct 2019 06:28:43.702 * Non blocking connect for SYNC fired the event.
1:S 31 Oct 2019 06:28:43.702 * Master replied to PING, replication can continue...
1:S 31 Oct 2019 06:28:43.703 * Trying a partial resynchronization (request a056665afb95a1e3a4227ae7fcb1c9b2e2f3b222:244418).
1:S 31 Oct 2019 06:28:43.703 * Full resync from master: adde2c9daee4fa1e62d3494d74d08dfb7110c798:241829
1:S 31 Oct 2019 06:28:43.703 * Discarding previously cached master state.
1:S 31 Oct 2019 06:28:43.715 * MASTER <-> REPLICA sync: receiving 2229 bytes from master
1:S 31 Oct 2019 06:28:43.715 * MASTER <-> REPLICA sync: Flushing old data
1:S 31 Oct 2019 06:28:43.715 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 31 Oct 2019 06:28:43.715 * MASTER <-> REPLICA sync: Finished with success
MASTER MODE kicks in but then being taken over by REPLICAOF! How can I force redis-master to always be MASTER?
Yes, I think this makes sense.
Sentinel will always remember who joined the master-slave group.
When you manually make the current-slave master, sentinel didn't know if you make this on purpose, or there is a network portion that happened. So sentinels will do a convert-to-slave to avoid two masters exist in a group. (aka, split-brain)
To remove a node out of the group
Check the docs, In short, you need to send SENTINEL RESET mastername to all sentinels, to let them forget the lost node. Then start the lost node as master, it won't join the sentinel's group.
To make the previous (failed) master node stay as a master.
After the lost master coming back as a slave, you can do a SENTINEL failover <master name>, sentinels will do a failover and switch master and slave. But I don't think you can appoint a master when there are more than 3 nodes.

Redis Master Slave Switch after Aof rewrite

This Redis Cluster have 240 nodes (120 masters and 120 slaves), and works well for a long time. But now it get a Master Slave switch almost several hours.
I get some log from Redis Server.
5c541d3a765e087af7775ba308f51ffb2aa54151
10.12.28.165:6502
13306:M 08 Mar 18:55:02.597 * Background append only file rewriting started by pid 15396
13306:M 08 Mar 18:55:41.636 # Cluster state changed: fail
13306:M 08 Mar 18:55:45.321 # Connection with slave client id #112948 lost.
13306:M 08 Mar 18:55:46.243 # Configuration change detected. Reconfiguring myself as a replica of afb6e012db58bd26a7c96182b04f0a2ba6a45768
13306:S 08 Mar 18:55:47.134 * AOF rewrite child asks to stop sending diffs.
15396:C 08 Mar 18:55:47.134 * Parent agreed to stop sending diffs. Finalizing AOF...
15396:C 08 Mar 18:55:47.134 * Concatenating 0.02 MB of AOF diff received from parent.
15396:C 08 Mar 18:55:47.135 * SYNC append only file rewrite performed
15396:C 08 Mar 18:55:47.186 * AOF rewrite: 4067 MB of memory used by copy-on-write
13306:S 08 Mar 18:55:47.209 # Cluster state changed: ok
5ac747878f881349aa6a62b179176ddf603e034c
10.12.30.107:6500
22825:M 08 Mar 18:55:30.534 * FAIL message received from da493af5bb3d15fc563961de09567a47787881be about 5c541d3a765e087af7775ba308f51ffb2aa54151
22825:M 08 Mar 18:55:31.440 # Failover auth granted to afb6e012db58bd26a7c96182b04f0a2ba6a45768 for epoch 323
22825:M 08 Mar 18:55:41.587 * Background append only file rewriting started by pid 23628
22825:M 08 Mar 18:56:24.200 # Cluster state changed: fail
22825:M 08 Mar 18:56:30.002 # Connection with slave client id #382416 lost.
22825:M 08 Mar 18:56:30.830 * FAIL message received from 0decbe940c6f4d4330fae5a9c129f1ad4932405d about 5ac747878f881349aa6a62b179176ddf603e034c
22825:M 08 Mar 18:56:30.840 # Failover auth denied to d46f95da06cfcd8ea5eaa15efabff5bd5e99df55: its master is up
22825:M 08 Mar 18:56:30.843 # Configuration change detected. Reconfiguring myself as a replica of d46f95da06cfcd8ea5eaa15efabff5bd5e99df55
22825:S 08 Mar 18:56:31.030 * Clear FAIL state for node 5ac747878f881349aa6a62b179176ddf603e034c: slave is reachable again.
22825:S 08 Mar 18:56:31.030 * Clear FAIL state for node 5c541d3a765e087af7775ba308f51ffb2aa54151: slave is reachable again.
22825:S 08 Mar 18:56:31.294 # Cluster state changed: ok
22825:S 08 Mar 18:56:31.595 * Connecting to MASTER 10.12.30.104:6404
22825:S 08 Mar 18:56:31.671 * MASTER SLAVE sync started
22825:S 08 Mar 18:56:31.671 * Non blocking connect for SYNC fired the event.
22825:S 08 Mar 18:56:31.672 * Master replied to PING, replication can continue...
22825:S 08 Mar 18:56:31.673 * Partial resynchronization not possible (no cached master)
22825:S 08 Mar 18:56:31.691 * AOF rewrite child asks to stop sending diffs.
It appends that Redis Master Slave Swtich happend after Aof rewtiting.
Here is the config of this cluster.
daemonize no
tcp-backlog 511
timeout 0
tcp-keepalive 60
loglevel notice
databases 16
dir "/var/cachecloud/data"
stop-writes-on-bgsave-error no
repl-timeout 60
repl-ping-slave-period 10
repl-disable-tcp-nodelay no
repl-backlog-size 10000000
repl-backlog-ttl 7200
slave-serve-stale-data yes
slave-read-only yes
slave-priority 100
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 512mb 128mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
port 6401
maxmemory 13000mb
maxmemory-policy volatile-lru
appendonly yes
appendfsync no
appendfilename "appendonly-6401.aof"
dbfilename "dump-6401.rdb"
aof-rewrite-incremental-fsync yes
no-appendfsync-on-rewrite yes
auto-aof-rewrite-min-size 62500kb
auto-aof-rewrite-percentage 86
rdbcompression yes
rdbchecksum yes
repl-diskless-sync no
repl-diskless-sync-delay 5
maxclients 10000
hll-sparse-max-bytes 3000
min-slaves-to-write 0
min-slaves-max-lag 10
aof-load-truncated yes
notify-keyspace-events ""
bind 10.12.26.226
protected-mode no
cluster-enabled yes
cluster-node-timeout 15000
cluster-slave-validity-factor 10
cluster-migration-barrier 1
cluster-config-file "nodes-6401.conf"
cluster-require-full-coverage no
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command KEYS ""
In my option, aof rewrite will not effect the Redis Main Thread. BUT this seems make this node not response other nodes' Ping.
Check THP(Transparent Huge pages) on Linux kernel parameter.
because AOF diff size 0.02MB, copy-on-write size 2067MB.

Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis

I have run below Test-1 and Test-2 for longer run for performance test with redis configuration values specified, still we see the highlighted error-1 & 2 message and cluster is failing for some time, few of our processing failing. How to solve this problem.
please anyone have suggestion to avoid cluster fail which is goes longer than 10seconds, cluster is not coming up within 3 retry attempts (spring retry template we are using for retry mechanism try count is set to 3, and retry after 5sec, its exponential way next attempts) using Jedis client.
Error-1: Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
Error-2: Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
Test-1:
Run the test with Redis Setting:
"appendfsync"="yes"
"appendonly"="no"
[root#rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6379
127.0.0.1:6379> CONFIG GET **aof***
1) "auto-aof-rewrite-percentage"
2) "30"
3) "auto-aof-rewrite-min-size"
4) "67108864"
5) "aof-rewrite-incremental-fsync"
6) "yes"
7) "aof-load-truncated"
8) "yes"
127.0.0.1:6379> exit
[root#rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6380
127.0.0.1:6380> CONFIG GET **aof***
1) "auto-aof-rewrite-percentage"
2) "30"
3) "auto-aof-rewrite-min-size"
4) "67108864"
5) "aof-rewrite-incremental-fsync"
6) "yes"
7) "aof-load-truncated"
8) "yes"
127.0.0.1:6380> clear
Observation:
1. The redis failover occurred for ~40 sec.
2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.
3. This has been happened when ~50% of RAM got utilized.
4. The Master-slave configuration has be reshuffled as below after this failover.
5. Below are the few highlights of the redis log, plese refer that attached log for more detail.
6. I have logs for this, for more details: 30Per_AofRW_2.zip
Redis1 Master log:
2515:C 05 May 11:06:30.343 * DB saved on disk
2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write
837:S 05 May 11:06:30.429 * Background saving terminated with success
837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...
837:S 05 May 11:11:31.067 * Background saving started by pid 2534
837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
2534:C 05 May 11:12:31.110 * DB saved on disk
Redis2 Master log:
837:M 05 May 10:30:22.216 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
837:M 05 May 10:30:22.216 # Cluster state changed: fail
837:M 05 May 10:30:23.148 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12
837:M 05 May 10:30:23.188 # Cluster state changed: ok
837:M 05 May 10:30:27.227 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.
837:M 05 May 10:35:22.017 * 10 changes in 300 seconds. Saving...
.
.
.
837:M 05 May 11:12:23.592 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:M 05 May 11:12:27.045 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
Redis3 Master Log:
833:M 05 May 10:30:22.217 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba
833:M 05 May 10:30:22.217 # Cluster state changed: fail
833:M 05 May 10:30:23.149 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12
833:M 05 May 10:30:23.189 # Cluster state changed: ok
1822:C 05 May 10:30:27.397 * DB saved on disk
1822:C 05 May 10:30:27.428 * RDB: 8 MB of memory used by copy-on-write
833:M 05 May 10:30:27.528 * Background saving terminated with success
833:M 05 May 10:30:27.828 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.
HOST: localhost PORT: 6379
machine master slave
10.2.1.233 0.00 2.00
10.2.1.46 2.00 0.00
10.2.1.202 1.00 1.00
MASTER SLAVE INFO
hashCode master slave hashSlot
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383
ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected
ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922
a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected
Test-2:
Run the test with Redis Setting:
"appendfsync"="no"
"appendonly"="yes"
Observation:
1. The redis failover occurred for ~40 sec.
2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.
3. This has been happened when ~50% of RAM got utilized.
4. The Master-slave configuration has be reshuffled as below after this failover.
5. Below are the few highlights of the redis log, plese refer that attached log for more detail.
30Per_AofRW_2.zip
Redis1 Master log:
2515:C 05 May 11:06:30.343 * DB saved on disk
2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write
837:S 05 May 11:06:30.429 * Background saving terminated with success
837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...
837:S 05 May 11:11:31.067 * Background saving started by pid 2534
837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
2534:C 05 May 11:12:31.110 * DB saved on disk
Redis2 Master log:
5306:M 03 Apr 09:02:36.947 * Background saving terminated with success
5306:M 03 Apr 09:02:49.574 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:02:49.583 * Background append only file rewriting started by pid 12864
5306:M 03 Apr 09:02:54.050 * AOF rewrite child asks to stop sending diffs.
12864:C 03 Apr 09:02:54.051 * Parent agreed to stop sending diffs. Finalizing AOF...
12864:C 03 Apr 09:02:54.051 * Concatenating 0.00 MB of AOF diff received from parent.
12864:C 03 Apr 09:02:54.051 * SYNC append only file rewrite performed
12864:C 03 Apr 09:02:54.058 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:02:54.098 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:02:54.098 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:02:54.098 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:04:01.843 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:04:01.853 * Background append only file rewriting started by pid 12867
5306:M 03 Apr 09:04:11.657 * AOF rewrite child asks to stop sending diffs.
12867:C 03 Apr 09:04:11.657 * Parent agreed to stop sending diffs. Finalizing AOF...
12867:C 03 Apr 09:04:11.657 * Concatenating 0.00 MB of AOF diff received from parent.
12867:C 03 Apr 09:04:11.657 * SYNC append only file rewrite performed
12867:C 03 Apr 09:04:11.664 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:04:11.675 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:04:11.675 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:04:11.675 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:04:48.054 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 09:05:28.571 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:05:28.581 * Background append only file rewriting started by pid 12873
5306:M 03 Apr 09:05:33.300 * AOF rewrite child asks to stop sending diffs.
12873:C 03 Apr 09:05:33.300 * Parent agreed to stop sending diffs. Finalizing AOF...
12873:C 03 Apr 09:05:33.300 * Concatenating 2.09 MB of AOF diff received from parent.
12873:C 03 Apr 09:05:33.329 * SYNC append only file rewrite performed
12873:C 03 Apr 09:05:33.336 * AOF rewrite: 11 MB of memory used by copy-on-write
5306:M 03 Apr 09:05:33.395 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:05:33.395 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:05:33.395 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:07:37.082 * 10 changes in 300 seconds. Saving...
5306:M 03 Apr 09:07:37.092 * Background saving started by pid 12875
12875:C 03 Apr 09:07:47.016 * DB saved on disk
12875:C 03 Apr 09:07:47.024 * RDB: 5 MB of memory used by copy-on-write
5306:M 03 Apr 09:07:47.113 * Background saving terminated with success
5306:M 03 Apr 09:07:51.622 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:07:51.632 * Background append only file rewriting started by pid 12876
5306:M 03 Apr 09:07:56.559 * AOF rewrite child asks to stop sending diffs.
12876:C 03 Apr 09:07:56.559 * Parent agreed to stop sending diffs. Finalizing AOF...
12876:C 03 Apr 09:07:56.559 * Concatenating 0.00 MB of AOF diff received from parent.
12876:C 03 Apr 09:07:56.559 * SYNC append only file rewrite performed
12876:C 03 Apr 09:07:56.567 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:07:56.645 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:07:56.645 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:07:56.645 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:12:48.071 * 10 changes in 300 seconds. Saving...
5306:M 03 Apr 09:12:48.081 * Background saving started by pid 12882
12882:C 03 Apr 09:12:58.381 * DB saved on disk
12882:C 03 Apr 09:12:58.389 * RDB: 5 MB of memory used by copy-on-write
5306:M 03 Apr 09:12:58.403 * Background saving terminated with success
5306:M 03 Apr 10:17:33.005 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 10:22:42.042 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 10:27:51.039 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 11:10:10.606 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
5306:M 03 Apr 11:10:10.607 # Cluster state changed: fail
5306:M 03 Apr 11:10:10.608 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about ac630108d1556786a4df74945cfe35db981d15fa
5306:M 03 Apr 11:10:11.594 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 7
HOST: localhost PORT: 6379
machine master slave
10.2.1.233 0.00 2.00
10.2.1.46 2.00 0.00
10.2.1.202 1.00 1.00
MASTER SLAVE INFO
hashCode master slave hashSlot
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383
ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected
ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922
a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected

Redis Sentinel manual failover command timesout

Redis Sentinel manual failover command timesout
I have one Redis master, one slave, and one Sentinel monitoring them. Everything seems to be working properly including failover when the master is killed. But when I issue the SENTINEL FAILVER command Sentinel gets stuck in the state +failover-state-wait-promotion for a few minutes. It seems like the Slave is not getting the promotion command. This doesn't make sense because there doesn't seem to be any trouble with network communication from the Sentinel host to either of the Redis hosts. I'm running all 3 of the procs in Docker containers, but I'm not sure how that could cause the problem. I can run redis-cli from the Sentinel host (i.e. from inside the Docker container) and can remotely execute the slaveof command. I can also monitor both Redis instances and see SENTINEL pings and info requests. I looked at logs for the master and slave and see nothing abnormal. Looking at THIS post and there does not seem to be any reason why Sentinel would consider the Redis instances invalid.
I'm fairly experienced with Sentinel, but rather new to Docker. Not sure how to proceed determining what the problem is. Any ideas?
Sentinel Log
[8] 01 Jul 01:36:57.317 # Sentinel runid is c337f6f0dfa1d41357338591cd0181c07cb026d0
[8] 01 Jul 01:38:13.135 # +monitor master redis-holt-overflow 10.19.8.2 6380 quorum 1
[8] 01 Jul 01:38:13.135 # +set master redis-holt-overflow 10.19.8.2 6380 down-after-milliseconds 3100
[8] 01 Jul 01:38:13.199 * +slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.288 # Executing user requested FAILOVER of 'redis-holt-overflow'
[8] 01 Jul 01:38:42.288 # +new-epoch 1
[8] 01 Jul 01:38:42.288 # +try-failover master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +vote-for-leader c337f6f0dfa1d41357338591cd0181c07cb026d0 1
[8] 01 Jul 01:38:42.352 # +elected-leader master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +failover-state-select-slave master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 # +selected-slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 * +failover-state-send-slaveof-noone slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.488 * +failover-state-wait-promotion slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:41:42.565 # -failover-abort-slave-timeout master redis-holt-overflow 10.19.8.2 6380
Redis Master Log
[17] 01 Jul 01:13:58.251 # Server started, Redis version 2.8.21
[17] 01 Jul 01:13:58.252 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[17] 01 Jul 01:13:58.252 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[17] 01 Jul 01:13:58.252 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[17] 01 Jul 01:13:58.252 * DB loaded from disk: 0.000 seconds
[17] 01 Jul 01:13:58.252 * The server is now ready to accept connections on port 6380
[17] 01 Jul 01:34:45.796 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:34:45.796 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:34:45.796 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:34:45.797 * Background saving started by pid 20
[20] 01 Jul 01:34:45.798 * DB saved on disk
[20] 01 Jul 01:34:45.799 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:34:45.808 * Background saving terminated with success
[17] 01 Jul 01:34:45.808 * Synchronization with slave 10.196.88.30:6381 succeeded
[17] 01 Jul 01:38:42.343 # Connection with slave 10.196.88.30:6381 lost.
[17] 01 Jul 01:38:43.275 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:38:43.275 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:38:43.275 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:38:43.275 * Background saving started by pid 21
[21] 01 Jul 01:38:43.277 * DB saved on disk
[21] 01 Jul 01:38:43.277 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:38:43.368 * Background saving terminated with success
[17] 01 Jul 01:38:43.368 * Synchronization with slave 10.196.88.30:6381 succeeded
Redis Slave Log
[14] 01 Jul 01:15:51.435 # Server started, Redis version 2.8.21
[14] 01 Jul 01:15:51.435 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[14] 01 Jul 01:15:51.435 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[14] 01 Jul 01:15:51.435 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[14] 01 Jul 01:15:51.435 * DB loaded from disk: 0.000 seconds
[14] 01 Jul 01:15:51.435 * The server is now ready to accept connections on port 6381
[14] 01 Jul 01:34:45.088 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:34:45.947 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:34:45.947 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:34:45.948 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:34:45.948 * Master replied to PING, replication can continue...
[14] 01 Jul 01:34:45.948 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:34:45.948 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:1
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Finished with success
[14] 01 Jul 01:38:42.495 # Connection with master lost.
[14] 01 Jul 01:38:42.495 * Caching the disconnected master state.
[14] 01 Jul 01:38:42.495 * Discarding previously cached master state.
[14] 01 Jul 01:38:42.495 * MASTER MODE enabled (user request)
[14] 01 Jul 01:38:42.495 # CONFIG REWRITE executed with success.
[14] 01 Jul 01:38:42.506 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:38:43.425 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:38:43.426 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:38:43.426 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:38:43.427 * Master replied to PING, replication can continue...
[14] 01 Jul 01:38:43.427 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:38:43.427 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:10930
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Finished with success
Sentinel Config
port 26379
pidfile "/var/run/redis-sentinel.pid"
logfile ""
daemonize no
Generated by CONFIG REWRITE
dir "/data"
sentinel monitor redis-holt-overflow 10.19.8.2 6380 1
sentinel down-after-milliseconds redis-holt-overflow 3100
sentinel config-epoch redis-holt-overflow 0
sentinel leader-epoch redis-holt-overflow 1
sentinel known-slave redis-holt-overflow 10.19.8.3 6381
sentinel current-epoch 1
Redis & Sentinel Info:
redis_version:2.8.21
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:551c16ab9d912477
redis_mode:standalone
os:Linux 3.10.0-123.8.1.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.7.2
process_id:13
run_id:7e1a1b6c844a969424d16f3efa116707ea7a60bf
tcp_port:6380
uptime_in_seconds:1312
uptime_in_days:0
hz:10
lru_clock:9642428
config_file:/usr/local/etc/redis/redis.conf
It appears you are running into the "docker network" issue. If you look in your logs they are showing different IPs. This is due to detection of what IP is connected from during discovery. Are these on different docker hosts?
From the documentation:
Since Sentinels auto detect slaves using masters INFO output information, the detected slaves will not be reachable, and Sentinel will never be able to failover the master, since there are no good slaves from the point of view of the system, so there is currently no way to monitor with Sentinel a set of master and slave instances deployed with Docker, unless you instruct Docker to map the port 1:1.
For sentinel a docker image can be found at https://registry.hub.docker.com/u/joshula/redis-sentinel/ which shows the use of announce-ip and bind to set it up.
For more details, see http://redis.io/topics/sentinel specifically the Docker section where it goes into detail on how to set things up in Docker to handle the situation.
Dog gone it, yeah it was one of the scripts. It was essentially triggering in the interim period when both Redis instances are masters and preemptively reverting the promoted slave back to slave-status. It's been a long week.