redis-cli 'sentinel slaves redis-cluster' returns an empty list with a password protected master - redis

[Redis] [redis-db] 'sentinel slaves ' returns an empty list with a password protected master.
Dear All,
My current redis-cluster setup is the following:
3 Different linux servers
srv 1 => redis master + sentinel 1
srv 2 => redis slaves + sentinel 2
srv 3 => sentinel 3 (sentinel only to avoid split brain situation)
the redis version
redis_version:3.2.3
redis_mode:sentinel
os:Linux 3.10.0-514.21.2.el7.x86_64 x86_64
tcp_port:26379
For some reason sentinel can't find a suitable slave to promote "master" in case of failover.
the redis-cli command "sentinel slaves redis-cluster" returns me an empty list :/ (see my terminal output below) BUT the 3 sentinels can "talk" to each other
the 3 redis-cli sentinel commands I used to get these information:
ip-10-0-0-118.eu-west-1.compute.internal:26379> sentinel slaves redis-cluster
(empty list or set)
ip-10-0-0-118.eu-west-1.compute.internal:26379> sentinel ckquorum redis-cluster
OK 3 usable Sentinels. Quorum and failover authorization can be reached
ip-10-0-0-118.eu-west-1.compute.internal:26379> sentinel failover redis-cluster
(error) NOGOODSLAVE No suitable slave to promote
The configuration files (redis and sentinel) are basics and I used the authentication.
Any idea what would I have misconfigured? so far? :/
Thanks in advance.
kr,
Orsius.
documentation:
https://redis.io/topics/sentinel
http://download.redis.io/redis-stable/sentinel.conf
here are my sentinel logs:
. . .
`2361:X 17 Jul 09:20:55.159 # 04ffbe62cec24e9635abbf8985c804e27bb8899b voted for 2cd4dce89889baadc178ba8909b894cf42f184d9 23
2361:X 17 Jul 09:20:55.170 # f5e93cc7c1a109ca8aa4588b92156f7fb5c29c72 voted for 2cd4dce89889baadc178ba8909b894cf42f184d9 23
2361:X 17 Jul 09:20:55.221 # +elected-leader master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:20:55.221 # +failover-state-select-slave master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:20:55.304 # -failover-abort-no-good-slave master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:20:55.357 # Next failover delay: I will not start a failover before Mon Jul 17 09:26:55 2017
2361:X 17 Jul 09:21:41.876 # +new-epoch 24
2361:X 17 Jul 09:21:41.878 # +vote-for-leader f5e93cc7c1a109ca8aa4588b92156f7fb5c29c72 24
2361:X 17 Jul 09:21:41.920 # Next failover delay: I will not start a failover before Mon Jul 17 09:27:42 2017
2361:X 17 Jul 09:27:42.092 # +new-epoch 25
2361:X 17 Jul 09:27:42.092 # +try-failover master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:27:42.099 # +vote-for-leader 2cd4dce89889baadc178ba8909b894cf42f184d9 25
2361:X 17 Jul 09:27:42.102 # f5e93cc7c1a109ca8aa4588b92156f7fb5c29c72 voted for 2cd4dce89889baadc178ba8909b894cf42f184d9 25
2361:X 17 Jul 09:27:42.103 # 04ffbe62cec24e9635abbf8985c804e27bb8899b voted for 2cd4dce89889baadc178ba8909b894cf42f184d9 25
2361:X 17 Jul 09:27:42.165 # +elected-leader master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:27:42.165 # +failover-state-select-slave master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:27:42.248 # -failover-abort-no-good-slave master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:27:42.314 # Next failover delay: I will not start a failover before Mon Jul 17 09:33:42 2017`
. . .
If I trust the following forum, sentinel only promote good slaves to new master.
source: https://github.com/antirez/redis/issues/1796
some slaves can be good slave to follow below rules.
not slave-priority is 0.
not demote(it was not old master.)
ping reply > info_validity_time
info reply > info_validate_time
not sdown, odown, disconnected.

My problem was actually a misconfiguration in my redis-cluster files (redis.conf & redis-sentinel.conf)which launched my two redis instances in 'standalone' mod.
I put the working configuration on my github repository: [github.com/orsius/redis-cluster][1]
Hope it'll help someone one day.
Keep calm and continue using redis-cluster;)
Orsius.
  [1]: https://github.com/orsius/redis-cluster

Related

Redis service automatically stops after few minutes of running

On my Ubuntu machine, redis server was running fine and suddenly it stops. After I started it, again it automatically stops after few minutes. So I start again, and so on. Why is this happening?
Here are the logs when I start redis:
21479:C 29 Apr 21:59:10.986 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
21479:C 29 Apr 21:59:10.987 # Redis version=4.0.9, bits=64, commit=00000000, modified=0, pid=21479, just started
21479:C 29 Apr 21:59:10.987 # Configuration loaded
21480:M 29 Apr 21:59:10.990 * Increased maximum number of open files to 10032 (it was originally set to 1024).
21480:M 29 Apr 21:59:10.991 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
21480:M 29 Apr 21:59:10.992 # Server initialized
21480:M 29 Apr 21:59:14.588 * DB loaded from disk: 3.596 seconds
21480:M 29 Apr 21:59:14.591 * Ready to accept connections

Cannot restart redis-sentinel unit

I'm trying to configure 3 Redis instances and 6 sentinels (3 of them running on the Redises and the rest are on the different hosts). But when I install redis-sentinel package and put my configuration under /etc/redis/sentinel.conf and restart the service using systemctl restart redis-sentinel I get this error:
Job for redis-sentinel.service failed because a timeout was exceeded.
See "systemctl status redis-sentinel.service" and "journalctl -xe" for details.
Here is the output of journalctl -u redis-sentinel:
Jan 01 08:07:07 redis1 systemd[1]: Starting Advanced key-value store...
Jan 01 08:07:07 redis1 redis-sentinel[16269]: 16269:X 01 Jan 2020 08:07:07.263 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Jan 01 08:07:07 redis1 redis-sentinel[16269]: 16269:X 01 Jan 2020 08:07:07.263 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=16269, just started
Jan 01 08:07:07 redis1 redis-sentinel[16269]: 16269:X 01 Jan 2020 08:07:07.263 # Configuration loaded
Jan 01 08:07:07 redis1 systemd[1]: redis-sentinel.service: Can't open PID file /var/run/sentinel/redis-sentinel.pid (yet?) after start: No such file or directory
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Start operation timed out. Terminating.
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Failed with result 'timeout'.
Jan 01 08:08:37 redis1 systemd[1]: Failed to start Advanced key-value store.
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Service hold-off time over, scheduling restart.
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Scheduled restart job, restart counter is at 5.
Jan 01 08:08:37 redis1 systemd[1]: Stopped Advanced key-value store.
Jan 01 08:08:37 redis1 systemd[1]: Starting Advanced key-value store...
Jan 01 08:08:37 redis1 redis-sentinel[16307]: 16307:X 01 Jan 2020 08:08:37.738 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Jan 01 08:08:37 redis1 redis-sentinel[16307]: 16307:X 01 Jan 2020 08:08:37.739 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=16307, just started
Jan 01 08:08:37 redis1 redis-sentinel[16307]: 16307:X 01 Jan 2020 08:08:37.739 # Configuration loaded
Jan 01 08:08:37 redis1 systemd[1]: redis-sentinel.service: Can't open PID file /var/run/sentinel/redis-sentinel.pid (yet?) after start: No such file or directory
and my sentinel.conf file:
port 26379
daemonize yes
sentinel myid 851994c7364e2138e03ee1cd346fbdc4f1404e4c
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 172.28.128.11 6379 2
sentinel down-after-milliseconds mymaster 5000
# Generated by CONFIG REWRITE
dir "/"
protected-mode no
sentinel failover-timeout mymaster 60000
sentinel config-epoch mymaster 0
sentinel leader-epoch mymaster 0
sentinel current-epoch 0
If you are trying to run your Redis servers on Debian based distribution, add below to your Redis configurations:
pidfile /var/run/redis/redis-sentinel.pid to /etc/redis/sentinel.conf
pidfile /var/run/redis/redis-server.pid to /etc/redis/redis.conf
What's the output in the sentinel log file?
I had a similar issue where Sentinel received a lot of sigterms.
In that case you need to make sure that if you use the daemonize yes setting, the systemd unit file must be using Type=forking.
Also make sure that the location of the PID file specified in the sentinel config matches the location specified in the systemd unit file.
If you face below error in journalctl or systemctl logs,
Jun 26 10:13:02 x systemd[1]: redis-server.service: Failed with result 'exit-code'.
Jun 26 10:13:02 x systemd[1]: redis-server.service: Scheduled restart job, restart counter is at 5.
Jun 26 10:13:02 x systemd[1]: Stopped Advanced key-value store.
Jun 26 10:13:02 x systemd[1]: redis-server.service: Start request repeated too quickly.
Jun 26 10:13:02 x systemd[1]: redis-server.service: Failed with result 'exit-code'.
Jun 26 10:13:02 x systemd[1]: Failed to start Advanced key-value store.
Then check /var/log/redis/redis-server.log for more information
In most cases issue is mentioned there.
i.e if a dump.rdb file is placed in /var/lib/redis then the issue might be with database count or redis version.
or in another scenario disabled IPV6 might be the issue.

Kubernetes Redis Cluster PubSub Channels not getting synched on replica

I have set up a Redis cluster on Kubernetes, the cluster state is OK and the replica is connected to the master. Also as per the logs, the full synchronization is also completed. The logs are as follows:-
9:M 22 Oct 12:24:18.209 * Slave 192.168.1.41:6379 asks for synchronization
9:M 22 Oct 12:24:18.209 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '794b9c74abe40ac90c752f32a102078e063ff636', my replication IDs are '0f499740a46665d12fab921838297273279ad136' and '0000000000000000000000000000000000000000')
9:M 22 Oct 12:24:18.209 * Starting BGSAVE for SYNC with target: disk
9:M 22 Oct 12:24:18.211 * Background saving started by pid 231
231:C 22 Oct 12:24:18.215 * DB saved on disk
231:C 22 Oct 12:24:18.216 * RDB: 4 MB of memory used by copy-on-write
9:M 22 Oct 12:24:18.224 * Background saving terminated with success
9:M 22 Oct 12:24:18.224 * Synchronization with slave 192.168.1.41:6379 succeeded
Still, when I check the List of the PubSub Channels on the replica, it does not show the channels and thus it breaks the PubSub flow.
Any help/advise is appreciated.

Redis Master Slave Switch after Aof rewrite

This Redis Cluster have 240 nodes (120 masters and 120 slaves), and works well for a long time. But now it get a Master Slave switch almost several hours.
I get some log from Redis Server.
5c541d3a765e087af7775ba308f51ffb2aa54151
10.12.28.165:6502
13306:M 08 Mar 18:55:02.597 * Background append only file rewriting started by pid 15396
13306:M 08 Mar 18:55:41.636 # Cluster state changed: fail
13306:M 08 Mar 18:55:45.321 # Connection with slave client id #112948 lost.
13306:M 08 Mar 18:55:46.243 # Configuration change detected. Reconfiguring myself as a replica of afb6e012db58bd26a7c96182b04f0a2ba6a45768
13306:S 08 Mar 18:55:47.134 * AOF rewrite child asks to stop sending diffs.
15396:C 08 Mar 18:55:47.134 * Parent agreed to stop sending diffs. Finalizing AOF...
15396:C 08 Mar 18:55:47.134 * Concatenating 0.02 MB of AOF diff received from parent.
15396:C 08 Mar 18:55:47.135 * SYNC append only file rewrite performed
15396:C 08 Mar 18:55:47.186 * AOF rewrite: 4067 MB of memory used by copy-on-write
13306:S 08 Mar 18:55:47.209 # Cluster state changed: ok
5ac747878f881349aa6a62b179176ddf603e034c
10.12.30.107:6500
22825:M 08 Mar 18:55:30.534 * FAIL message received from da493af5bb3d15fc563961de09567a47787881be about 5c541d3a765e087af7775ba308f51ffb2aa54151
22825:M 08 Mar 18:55:31.440 # Failover auth granted to afb6e012db58bd26a7c96182b04f0a2ba6a45768 for epoch 323
22825:M 08 Mar 18:55:41.587 * Background append only file rewriting started by pid 23628
22825:M 08 Mar 18:56:24.200 # Cluster state changed: fail
22825:M 08 Mar 18:56:30.002 # Connection with slave client id #382416 lost.
22825:M 08 Mar 18:56:30.830 * FAIL message received from 0decbe940c6f4d4330fae5a9c129f1ad4932405d about 5ac747878f881349aa6a62b179176ddf603e034c
22825:M 08 Mar 18:56:30.840 # Failover auth denied to d46f95da06cfcd8ea5eaa15efabff5bd5e99df55: its master is up
22825:M 08 Mar 18:56:30.843 # Configuration change detected. Reconfiguring myself as a replica of d46f95da06cfcd8ea5eaa15efabff5bd5e99df55
22825:S 08 Mar 18:56:31.030 * Clear FAIL state for node 5ac747878f881349aa6a62b179176ddf603e034c: slave is reachable again.
22825:S 08 Mar 18:56:31.030 * Clear FAIL state for node 5c541d3a765e087af7775ba308f51ffb2aa54151: slave is reachable again.
22825:S 08 Mar 18:56:31.294 # Cluster state changed: ok
22825:S 08 Mar 18:56:31.595 * Connecting to MASTER 10.12.30.104:6404
22825:S 08 Mar 18:56:31.671 * MASTER SLAVE sync started
22825:S 08 Mar 18:56:31.671 * Non blocking connect for SYNC fired the event.
22825:S 08 Mar 18:56:31.672 * Master replied to PING, replication can continue...
22825:S 08 Mar 18:56:31.673 * Partial resynchronization not possible (no cached master)
22825:S 08 Mar 18:56:31.691 * AOF rewrite child asks to stop sending diffs.
It appends that Redis Master Slave Swtich happend after Aof rewtiting.
Here is the config of this cluster.
daemonize no
tcp-backlog 511
timeout 0
tcp-keepalive 60
loglevel notice
databases 16
dir "/var/cachecloud/data"
stop-writes-on-bgsave-error no
repl-timeout 60
repl-ping-slave-period 10
repl-disable-tcp-nodelay no
repl-backlog-size 10000000
repl-backlog-ttl 7200
slave-serve-stale-data yes
slave-read-only yes
slave-priority 100
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 512mb 128mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
port 6401
maxmemory 13000mb
maxmemory-policy volatile-lru
appendonly yes
appendfsync no
appendfilename "appendonly-6401.aof"
dbfilename "dump-6401.rdb"
aof-rewrite-incremental-fsync yes
no-appendfsync-on-rewrite yes
auto-aof-rewrite-min-size 62500kb
auto-aof-rewrite-percentage 86
rdbcompression yes
rdbchecksum yes
repl-diskless-sync no
repl-diskless-sync-delay 5
maxclients 10000
hll-sparse-max-bytes 3000
min-slaves-to-write 0
min-slaves-max-lag 10
aof-load-truncated yes
notify-keyspace-events ""
bind 10.12.26.226
protected-mode no
cluster-enabled yes
cluster-node-timeout 15000
cluster-slave-validity-factor 10
cluster-migration-barrier 1
cluster-config-file "nodes-6401.conf"
cluster-require-full-coverage no
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command KEYS ""
In my option, aof rewrite will not effect the Redis Main Thread. BUT this seems make this node not response other nodes' Ping.
Check THP(Transparent Huge pages) on Linux kernel parameter.
because AOF diff size 0.02MB, copy-on-write size 2067MB.

Redis Sentinel manual failover command timesout

Redis Sentinel manual failover command timesout
I have one Redis master, one slave, and one Sentinel monitoring them. Everything seems to be working properly including failover when the master is killed. But when I issue the SENTINEL FAILVER command Sentinel gets stuck in the state +failover-state-wait-promotion for a few minutes. It seems like the Slave is not getting the promotion command. This doesn't make sense because there doesn't seem to be any trouble with network communication from the Sentinel host to either of the Redis hosts. I'm running all 3 of the procs in Docker containers, but I'm not sure how that could cause the problem. I can run redis-cli from the Sentinel host (i.e. from inside the Docker container) and can remotely execute the slaveof command. I can also monitor both Redis instances and see SENTINEL pings and info requests. I looked at logs for the master and slave and see nothing abnormal. Looking at THIS post and there does not seem to be any reason why Sentinel would consider the Redis instances invalid.
I'm fairly experienced with Sentinel, but rather new to Docker. Not sure how to proceed determining what the problem is. Any ideas?
Sentinel Log
[8] 01 Jul 01:36:57.317 # Sentinel runid is c337f6f0dfa1d41357338591cd0181c07cb026d0
[8] 01 Jul 01:38:13.135 # +monitor master redis-holt-overflow 10.19.8.2 6380 quorum 1
[8] 01 Jul 01:38:13.135 # +set master redis-holt-overflow 10.19.8.2 6380 down-after-milliseconds 3100
[8] 01 Jul 01:38:13.199 * +slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.288 # Executing user requested FAILOVER of 'redis-holt-overflow'
[8] 01 Jul 01:38:42.288 # +new-epoch 1
[8] 01 Jul 01:38:42.288 # +try-failover master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +vote-for-leader c337f6f0dfa1d41357338591cd0181c07cb026d0 1
[8] 01 Jul 01:38:42.352 # +elected-leader master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +failover-state-select-slave master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 # +selected-slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 * +failover-state-send-slaveof-noone slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.488 * +failover-state-wait-promotion slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:41:42.565 # -failover-abort-slave-timeout master redis-holt-overflow 10.19.8.2 6380
Redis Master Log
[17] 01 Jul 01:13:58.251 # Server started, Redis version 2.8.21
[17] 01 Jul 01:13:58.252 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[17] 01 Jul 01:13:58.252 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[17] 01 Jul 01:13:58.252 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[17] 01 Jul 01:13:58.252 * DB loaded from disk: 0.000 seconds
[17] 01 Jul 01:13:58.252 * The server is now ready to accept connections on port 6380
[17] 01 Jul 01:34:45.796 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:34:45.796 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:34:45.796 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:34:45.797 * Background saving started by pid 20
[20] 01 Jul 01:34:45.798 * DB saved on disk
[20] 01 Jul 01:34:45.799 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:34:45.808 * Background saving terminated with success
[17] 01 Jul 01:34:45.808 * Synchronization with slave 10.196.88.30:6381 succeeded
[17] 01 Jul 01:38:42.343 # Connection with slave 10.196.88.30:6381 lost.
[17] 01 Jul 01:38:43.275 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:38:43.275 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:38:43.275 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:38:43.275 * Background saving started by pid 21
[21] 01 Jul 01:38:43.277 * DB saved on disk
[21] 01 Jul 01:38:43.277 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:38:43.368 * Background saving terminated with success
[17] 01 Jul 01:38:43.368 * Synchronization with slave 10.196.88.30:6381 succeeded
Redis Slave Log
[14] 01 Jul 01:15:51.435 # Server started, Redis version 2.8.21
[14] 01 Jul 01:15:51.435 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[14] 01 Jul 01:15:51.435 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[14] 01 Jul 01:15:51.435 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[14] 01 Jul 01:15:51.435 * DB loaded from disk: 0.000 seconds
[14] 01 Jul 01:15:51.435 * The server is now ready to accept connections on port 6381
[14] 01 Jul 01:34:45.088 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:34:45.947 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:34:45.947 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:34:45.948 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:34:45.948 * Master replied to PING, replication can continue...
[14] 01 Jul 01:34:45.948 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:34:45.948 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:1
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Finished with success
[14] 01 Jul 01:38:42.495 # Connection with master lost.
[14] 01 Jul 01:38:42.495 * Caching the disconnected master state.
[14] 01 Jul 01:38:42.495 * Discarding previously cached master state.
[14] 01 Jul 01:38:42.495 * MASTER MODE enabled (user request)
[14] 01 Jul 01:38:42.495 # CONFIG REWRITE executed with success.
[14] 01 Jul 01:38:42.506 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:38:43.425 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:38:43.426 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:38:43.426 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:38:43.427 * Master replied to PING, replication can continue...
[14] 01 Jul 01:38:43.427 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:38:43.427 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:10930
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Finished with success
Sentinel Config
port 26379
pidfile "/var/run/redis-sentinel.pid"
logfile ""
daemonize no
Generated by CONFIG REWRITE
dir "/data"
sentinel monitor redis-holt-overflow 10.19.8.2 6380 1
sentinel down-after-milliseconds redis-holt-overflow 3100
sentinel config-epoch redis-holt-overflow 0
sentinel leader-epoch redis-holt-overflow 1
sentinel known-slave redis-holt-overflow 10.19.8.3 6381
sentinel current-epoch 1
Redis & Sentinel Info:
redis_version:2.8.21
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:551c16ab9d912477
redis_mode:standalone
os:Linux 3.10.0-123.8.1.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.7.2
process_id:13
run_id:7e1a1b6c844a969424d16f3efa116707ea7a60bf
tcp_port:6380
uptime_in_seconds:1312
uptime_in_days:0
hz:10
lru_clock:9642428
config_file:/usr/local/etc/redis/redis.conf
It appears you are running into the "docker network" issue. If you look in your logs they are showing different IPs. This is due to detection of what IP is connected from during discovery. Are these on different docker hosts?
From the documentation:
Since Sentinels auto detect slaves using masters INFO output information, the detected slaves will not be reachable, and Sentinel will never be able to failover the master, since there are no good slaves from the point of view of the system, so there is currently no way to monitor with Sentinel a set of master and slave instances deployed with Docker, unless you instruct Docker to map the port 1:1.
For sentinel a docker image can be found at https://registry.hub.docker.com/u/joshula/redis-sentinel/ which shows the use of announce-ip and bind to set it up.
For more details, see http://redis.io/topics/sentinel specifically the Docker section where it goes into detail on how to set things up in Docker to handle the situation.
Dog gone it, yeah it was one of the scripts. It was essentially triggering in the interim period when both Redis instances are masters and preemptively reverting the promoted slave back to slave-status. It's been a long week.