Redis node won't go into MASTER mode - redis

I have a simple redis deployment MASTER, SLAVE and 2 SENTINEL running on docker swarm. I run the stack and all services come up. redis-master start as MASTER and I kill it to test SENTINEL and SLAVE recovering. redis-master then recovers and becomes a new SLAVE. If I ecex into it and run SLAVEOF NO ONE the following happens:
1:M 31 Oct 2019 06:28:32.741 * MASTER MODE enabled (user request from 'id=3907 addr=127.0.0.1:39302 fd=36 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=34 qbuf-free=32734 obl=0 oll=0 omem=0 events=r cmd=slaveof')
1:S 31 Oct 2019 06:28:43.060 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 31 Oct 2019 06:28:43.060 * REPLICAOF 10.0.21.49:6379 enabled (user request from 'id=1085 addr=10.0.21.54:34360 fd=16 name=sentinel-592f3b97-cmd age=945 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=150 qbuf-free=32618 obl=36 oll=0 omem=0 events=r cmd=exec')
1:S 31 Oct 2019 06:28:43.701 * Connecting to MASTER 10.0.21.49:6379
1:S 31 Oct 2019 06:28:43.702 * MASTER <-> REPLICA sync started
1:S 31 Oct 2019 06:28:43.702 * Non blocking connect for SYNC fired the event.
1:S 31 Oct 2019 06:28:43.702 * Master replied to PING, replication can continue...
1:S 31 Oct 2019 06:28:43.703 * Trying a partial resynchronization (request a056665afb95a1e3a4227ae7fcb1c9b2e2f3b222:244418).
1:S 31 Oct 2019 06:28:43.703 * Full resync from master: adde2c9daee4fa1e62d3494d74d08dfb7110c798:241829
1:S 31 Oct 2019 06:28:43.703 * Discarding previously cached master state.
1:S 31 Oct 2019 06:28:43.715 * MASTER <-> REPLICA sync: receiving 2229 bytes from master
1:S 31 Oct 2019 06:28:43.715 * MASTER <-> REPLICA sync: Flushing old data
1:S 31 Oct 2019 06:28:43.715 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 31 Oct 2019 06:28:43.715 * MASTER <-> REPLICA sync: Finished with success
MASTER MODE kicks in but then being taken over by REPLICAOF! How can I force redis-master to always be MASTER?

Yes, I think this makes sense.
Sentinel will always remember who joined the master-slave group.
When you manually make the current-slave master, sentinel didn't know if you make this on purpose, or there is a network portion that happened. So sentinels will do a convert-to-slave to avoid two masters exist in a group. (aka, split-brain)
To remove a node out of the group
Check the docs, In short, you need to send SENTINEL RESET mastername to all sentinels, to let them forget the lost node. Then start the lost node as master, it won't join the sentinel's group.
To make the previous (failed) master node stay as a master.
After the lost master coming back as a slave, you can do a SENTINEL failover <master name>, sentinels will do a failover and switch master and slave. But I don't think you can appoint a master when there are more than 3 nodes.

Related

Why does redis forcibly demote master to slave?

Run the container using docker redis:latest, and after about 30 minutes, the master changes to a slave and it is no longer writable.
Also, the slave outputs an error once per second that it cannot find the master.
1:M 08 Jul 2022 03:10:55.899 * DB saved on disk
1:M 08 Jul 2022 03:15:56.087 * 100 changes in 300 seconds. Saving...
1:M 08 Jul 2022 03:15:56.089 * Background saving started by pid 61
61:C 08 Jul 2022 03:15:56.091 * DB saved on disk
61:C 08 Jul 2022 03:15:56.092 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
1:M 08 Jul 2022 03:15:56.189 * Background saving terminated with success
1:S 08 Jul 2022 03:20:12.258 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 08 Jul 2022 03:20:12.258 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 03:20:12.258 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 03:20:12.259 * REPLICAOF 178.20.40.200:8886 enabled (user request from 'id=39 addr=95.182.123.66:36904 laddr=172.31.9.234:6379 fd=11 name= age=1 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=47 qbuf-free=20427 argv-mem=24 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=22320 events=r cmd=slaveof user=default redir=-1 resp=2')
1:S 08 Jul 2022 03:20:12.524 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 03:20:12.791 * Master replied to PING, replication can continue...
1:S 08 Jul 2022 03:20:13.335 * Trying a partial resynchronization (request 6743ff015583c86f3ac7a4305026c42991a1ca18:1).
1:S 08 Jul 2022 03:20:13.603 * Full resync from master: ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ:1
1:S 08 Jul 2022 03:20:13.603 * MASTER <-> REPLICA sync: receiving 54976 bytes from master to disk
1:S 08 Jul 2022 03:20:14.138 * Discarding previously cached master state.
1:S 08 Jul 2022 03:20:14.138 * MASTER <-> REPLICA sync: Flushing old data
1:S 08 Jul 2022 03:20:14.139 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 08 Jul 2022 03:20:14.140 # Wrong signature trying to load DB from file
1:S 08 Jul 2022 03:20:14.140 # Failed trying to load the MASTER synchronization DB from disk: Invalid argument
1:S 08 Jul 2022 03:20:14.140 * Reconnecting to MASTER 178.20.40.200:8886 after failure
1:S 08 Jul 2022 03:20:14.140 * MASTER <-> REPLICA sync started
...
1:S 08 Jul 2022 05:09:50.010 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:50.298 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:50.587 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:50.587 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:51.013 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:51.014 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:51.294 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:51.581 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:51.581 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:52.017 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:52.017 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:52.297 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:52.578 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:52.578 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:53.021 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:53.021 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:53.308 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:53.594 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:53.594 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:54.025 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:54.025 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:54.316 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:54.608 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:54.608 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:55.028 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:55.028 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:55.309 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:55.588 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:55.588 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:56.031 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:56.031 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:56.311 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:56.592 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:56.592 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:57.035 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:57.035 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:57.321 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:57.610 * Master replied to PING, replication can continue...
...
SLAVEOF NO ONE
config set slave-read-only no
If I force the slave to be writable with the above command and try to write, all data will be flushed after about 5 seconds.
I don't want to turn master into slave.
I am getting this error on clean ec2 amazon linux.
I don't know what's causing this error because I also have enough memory.
Why does redis forcibly demote master to slave?

redis unable to write to AOF due to disk space (AOF rewrite enabled)

redis version: 4.0.14
OS: FreeBSD 12.0-RELEASE-p10 amd64
In the redis documentation around AOF it says after version 2.6 it should automatically trigger BGREWRITEAOF:
Redis is able to automatically rewrite the AOF in background when it gets too big. The rewrite is completely safe as while Redis continues appending to the old file, a completely new one is produced with the minimal set of operations needed to create the current data set, and once this second file is ready Redis switches the two and starts appending to the new one.
It doesn't seem to be doing this. logs:
774:S 04 Mar 17:21:19.992 # MASTER timeout: no data nor PING received...
774:S 04 Mar 17:21:19.992 # Connection with master lost.
774:S 04 Mar 17:21:19.992 * Caching the disconnected master state.
774:S 04 Mar 17:21:19.992 * Connecting to MASTER 192.168.0.100:6412
774:S 04 Mar 17:21:19.992 * MASTER <-> SLAVE sync started
774:S 04 Mar 17:21:19.992 * Non blocking connect for SYNC fired the event.
774:S 04 Mar 17:21:21.596 # Error reply to PING from master: '-MISCONF Errors writing to the AOF file: No space left on device'
774:S 04 Mar 17:21:22.021 * Connecting to MASTER 192.168.0.100:6412
774:S 04 Mar 17:21:22.021 * MASTER <-> SLAVE sync started
774:S 04 Mar 17:21:22.021 * Non blocking connect for SYNC fired the event.
774:S 04 Mar 17:21:23.238 # Error reply to PING from master: '-MISCONF Errors writing to the AOF file: No space left on device'
774:S 04 Mar 17:21:24.050 * Connecting to MASTER 192.168.0.100:6412
774:S 04 Mar 17:21:24.050 * MASTER <-> SLAVE sync started
774:S 04 Mar 17:21:24.050 * Non blocking connect for SYNC fired the event.
774:S 04 Mar 17:21:25.682 # Error reply to PING from master: '-MISCONF Errors writing to the AOF file: No space left on device'
$ info
# Server
redis_version:4.0.14
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:b8839b63a68a4b99
redis_mode:cluster
os:FreeBSD 12.0-RELEASE-p10 amd64
arch_bits:64
multiplexing_api:kqueue
atomicvar_api:atomic-builtin
gcc_version:4.2.1
process_id:766
run_id:bf9d05b240cd776e4547729d062dd6f5a5e0f60d
tcp_port:6412
uptime_in_seconds:40930179
uptime_in_days:473
hz:10
lru_clock:4266405
executable:/usr/local/bin/redis-server
config_file:/usr/local/etc/redis_cluster/6412/redis.conf
# Clients
connected_clients:115
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:11588787188
used_memory_human:10.79G
used_memory_rss:11588984900
used_memory_rss_human:10.79G
used_memory_peak:16230019258
used_memory_peak_human:15.12G
used_memory_peak_perc:71.40%
used_memory_overhead:301057836
used_memory_startup:1928680
used_memory_dataset:11287729352
used_memory_dataset_perc:97.42%
total_system_memory:0
total_system_memory_human:0B
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:27106127360
maxmemory_human:25.24G
maxmemory_policy:allkeys-lru
mem_fragmentation_ratio:1.00
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
# Persistence
loading:0
rdb_changes_since_last_save:11838701
rdb_bgsave_in_progress:0
rdb_last_save_time:1614662573
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:87
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:196
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:err
aof_last_cow_size:0
aof_current_size:6817578976
aof_base_size:5413636704
aof_pending_rewrite:0
aof_buffer_length:630994
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:56
# Stats
total_connections_received:456772
total_commands_processed:4953546081
instantaneous_ops_per_sec:205
total_net_input_bytes:685638114456
total_net_output_bytes:260383113923
instantaneous_input_kbps:51.50
instantaneous_output_kbps:31.54
rejected_connections:0
sync_full:62
sync_partial_ok:0
sync_partial_err:23
expired_keys:176647087
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:863832269
keyspace_misses:891573350
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:1641619
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
# Replication
role:master
connected_slaves:0
master_replid:90da82c2d4a4ff31b64721286b17b8716170be6c
master_replid2:a40e37f9dca7fab0b151a632564d4e5bd3163321
master_repl_offset:310200952366
second_repl_offset:308626210072
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:310199903791
repl_backlog_histlen:1048576
# CPU
used_cpu_sys:334704.88
used_cpu_user:192502.30
used_cpu_sys_children:1007.51
used_cpu_user_children:12385.73
# Cluster
cluster_enabled:1
# Keyspace
db0:keys=5665785,expires=0,avg_ttl=0
(0.50s)
When I try to run BGREWRITEAOF I get the following message:
/: write failed, filesystem is full
Please let me know if I need to put any more information here.
Thanks for your help.

Kubernetes Redis Cluster PubSub Channels not getting synched on replica

I have set up a Redis cluster on Kubernetes, the cluster state is OK and the replica is connected to the master. Also as per the logs, the full synchronization is also completed. The logs are as follows:-
9:M 22 Oct 12:24:18.209 * Slave 192.168.1.41:6379 asks for synchronization
9:M 22 Oct 12:24:18.209 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '794b9c74abe40ac90c752f32a102078e063ff636', my replication IDs are '0f499740a46665d12fab921838297273279ad136' and '0000000000000000000000000000000000000000')
9:M 22 Oct 12:24:18.209 * Starting BGSAVE for SYNC with target: disk
9:M 22 Oct 12:24:18.211 * Background saving started by pid 231
231:C 22 Oct 12:24:18.215 * DB saved on disk
231:C 22 Oct 12:24:18.216 * RDB: 4 MB of memory used by copy-on-write
9:M 22 Oct 12:24:18.224 * Background saving terminated with success
9:M 22 Oct 12:24:18.224 * Synchronization with slave 192.168.1.41:6379 succeeded
Still, when I check the List of the PubSub Channels on the replica, it does not show the channels and thus it breaks the PubSub flow.
Any help/advise is appreciated.

redis-master slave setup failing

I have started server with port 6001 as master with persistence aof turned off,slave with port 6002 as master of 6001.However on startup of slave i am getting below error in infinite loop also note able to find any error logs of the same..
Slave infinite loop logs :
[5556] 20 Aug 21:34:28.499 # Server started, Redis version 3.2.100
[5556] 20 Aug 21:34:28.500 * DB loaded from disk: 0.001 seconds
[5556] 20 Aug 21:34:28.500 * The server is now ready to accept connections on port 6002
[5556] 20 Aug 21:34:28.501 * Connecting to MASTER localhost:6001
[5556] 20 Aug 21:34:28.513 * MASTER <-> SLAVE sync started
[5556] 20 Aug 21:34:29.513 * Non blocking connect for SYNC fired the event.
[5556] 20 Aug 21:34:29.513 # Sending command to master in replication handshake: -Writing to master: Unknown error
[5556] 20 Aug 21:34:29.516 * Connecting to MASTER localhost:6001
[5556] 20 Aug 21:34:29.517 * MASTER <-> SLAVE sync started
Issue resolved,redis.conf contained 127.0.0.1 as bind value,and from slave redis.conf file ,I had SLAVE OF localhost .Replacing localhost with 127.0.0.1 resolved the issue

Redis Sentinel manual failover command timesout

Redis Sentinel manual failover command timesout
I have one Redis master, one slave, and one Sentinel monitoring them. Everything seems to be working properly including failover when the master is killed. But when I issue the SENTINEL FAILVER command Sentinel gets stuck in the state +failover-state-wait-promotion for a few minutes. It seems like the Slave is not getting the promotion command. This doesn't make sense because there doesn't seem to be any trouble with network communication from the Sentinel host to either of the Redis hosts. I'm running all 3 of the procs in Docker containers, but I'm not sure how that could cause the problem. I can run redis-cli from the Sentinel host (i.e. from inside the Docker container) and can remotely execute the slaveof command. I can also monitor both Redis instances and see SENTINEL pings and info requests. I looked at logs for the master and slave and see nothing abnormal. Looking at THIS post and there does not seem to be any reason why Sentinel would consider the Redis instances invalid.
I'm fairly experienced with Sentinel, but rather new to Docker. Not sure how to proceed determining what the problem is. Any ideas?
Sentinel Log
[8] 01 Jul 01:36:57.317 # Sentinel runid is c337f6f0dfa1d41357338591cd0181c07cb026d0
[8] 01 Jul 01:38:13.135 # +monitor master redis-holt-overflow 10.19.8.2 6380 quorum 1
[8] 01 Jul 01:38:13.135 # +set master redis-holt-overflow 10.19.8.2 6380 down-after-milliseconds 3100
[8] 01 Jul 01:38:13.199 * +slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.288 # Executing user requested FAILOVER of 'redis-holt-overflow'
[8] 01 Jul 01:38:42.288 # +new-epoch 1
[8] 01 Jul 01:38:42.288 # +try-failover master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +vote-for-leader c337f6f0dfa1d41357338591cd0181c07cb026d0 1
[8] 01 Jul 01:38:42.352 # +elected-leader master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +failover-state-select-slave master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 # +selected-slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 * +failover-state-send-slaveof-noone slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.488 * +failover-state-wait-promotion slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:41:42.565 # -failover-abort-slave-timeout master redis-holt-overflow 10.19.8.2 6380
Redis Master Log
[17] 01 Jul 01:13:58.251 # Server started, Redis version 2.8.21
[17] 01 Jul 01:13:58.252 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[17] 01 Jul 01:13:58.252 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[17] 01 Jul 01:13:58.252 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[17] 01 Jul 01:13:58.252 * DB loaded from disk: 0.000 seconds
[17] 01 Jul 01:13:58.252 * The server is now ready to accept connections on port 6380
[17] 01 Jul 01:34:45.796 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:34:45.796 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:34:45.796 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:34:45.797 * Background saving started by pid 20
[20] 01 Jul 01:34:45.798 * DB saved on disk
[20] 01 Jul 01:34:45.799 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:34:45.808 * Background saving terminated with success
[17] 01 Jul 01:34:45.808 * Synchronization with slave 10.196.88.30:6381 succeeded
[17] 01 Jul 01:38:42.343 # Connection with slave 10.196.88.30:6381 lost.
[17] 01 Jul 01:38:43.275 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:38:43.275 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:38:43.275 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:38:43.275 * Background saving started by pid 21
[21] 01 Jul 01:38:43.277 * DB saved on disk
[21] 01 Jul 01:38:43.277 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:38:43.368 * Background saving terminated with success
[17] 01 Jul 01:38:43.368 * Synchronization with slave 10.196.88.30:6381 succeeded
Redis Slave Log
[14] 01 Jul 01:15:51.435 # Server started, Redis version 2.8.21
[14] 01 Jul 01:15:51.435 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[14] 01 Jul 01:15:51.435 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[14] 01 Jul 01:15:51.435 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[14] 01 Jul 01:15:51.435 * DB loaded from disk: 0.000 seconds
[14] 01 Jul 01:15:51.435 * The server is now ready to accept connections on port 6381
[14] 01 Jul 01:34:45.088 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:34:45.947 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:34:45.947 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:34:45.948 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:34:45.948 * Master replied to PING, replication can continue...
[14] 01 Jul 01:34:45.948 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:34:45.948 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:1
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Finished with success
[14] 01 Jul 01:38:42.495 # Connection with master lost.
[14] 01 Jul 01:38:42.495 * Caching the disconnected master state.
[14] 01 Jul 01:38:42.495 * Discarding previously cached master state.
[14] 01 Jul 01:38:42.495 * MASTER MODE enabled (user request)
[14] 01 Jul 01:38:42.495 # CONFIG REWRITE executed with success.
[14] 01 Jul 01:38:42.506 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:38:43.425 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:38:43.426 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:38:43.426 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:38:43.427 * Master replied to PING, replication can continue...
[14] 01 Jul 01:38:43.427 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:38:43.427 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:10930
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Finished with success
Sentinel Config
port 26379
pidfile "/var/run/redis-sentinel.pid"
logfile ""
daemonize no
Generated by CONFIG REWRITE
dir "/data"
sentinel monitor redis-holt-overflow 10.19.8.2 6380 1
sentinel down-after-milliseconds redis-holt-overflow 3100
sentinel config-epoch redis-holt-overflow 0
sentinel leader-epoch redis-holt-overflow 1
sentinel known-slave redis-holt-overflow 10.19.8.3 6381
sentinel current-epoch 1
Redis & Sentinel Info:
redis_version:2.8.21
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:551c16ab9d912477
redis_mode:standalone
os:Linux 3.10.0-123.8.1.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.7.2
process_id:13
run_id:7e1a1b6c844a969424d16f3efa116707ea7a60bf
tcp_port:6380
uptime_in_seconds:1312
uptime_in_days:0
hz:10
lru_clock:9642428
config_file:/usr/local/etc/redis/redis.conf
It appears you are running into the "docker network" issue. If you look in your logs they are showing different IPs. This is due to detection of what IP is connected from during discovery. Are these on different docker hosts?
From the documentation:
Since Sentinels auto detect slaves using masters INFO output information, the detected slaves will not be reachable, and Sentinel will never be able to failover the master, since there are no good slaves from the point of view of the system, so there is currently no way to monitor with Sentinel a set of master and slave instances deployed with Docker, unless you instruct Docker to map the port 1:1.
For sentinel a docker image can be found at https://registry.hub.docker.com/u/joshula/redis-sentinel/ which shows the use of announce-ip and bind to set it up.
For more details, see http://redis.io/topics/sentinel specifically the Docker section where it goes into detail on how to set things up in Docker to handle the situation.
Dog gone it, yeah it was one of the scripts. It was essentially triggering in the interim period when both Redis instances are masters and preemptively reverting the promoted slave back to slave-status. It's been a long week.