Redis Sentinel manual failover command timesout - redis

Redis Sentinel manual failover command timesout
I have one Redis master, one slave, and one Sentinel monitoring them. Everything seems to be working properly including failover when the master is killed. But when I issue the SENTINEL FAILVER command Sentinel gets stuck in the state +failover-state-wait-promotion for a few minutes. It seems like the Slave is not getting the promotion command. This doesn't make sense because there doesn't seem to be any trouble with network communication from the Sentinel host to either of the Redis hosts. I'm running all 3 of the procs in Docker containers, but I'm not sure how that could cause the problem. I can run redis-cli from the Sentinel host (i.e. from inside the Docker container) and can remotely execute the slaveof command. I can also monitor both Redis instances and see SENTINEL pings and info requests. I looked at logs for the master and slave and see nothing abnormal. Looking at THIS post and there does not seem to be any reason why Sentinel would consider the Redis instances invalid.
I'm fairly experienced with Sentinel, but rather new to Docker. Not sure how to proceed determining what the problem is. Any ideas?
Sentinel Log
[8] 01 Jul 01:36:57.317 # Sentinel runid is c337f6f0dfa1d41357338591cd0181c07cb026d0
[8] 01 Jul 01:38:13.135 # +monitor master redis-holt-overflow 10.19.8.2 6380 quorum 1
[8] 01 Jul 01:38:13.135 # +set master redis-holt-overflow 10.19.8.2 6380 down-after-milliseconds 3100
[8] 01 Jul 01:38:13.199 * +slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.288 # Executing user requested FAILOVER of 'redis-holt-overflow'
[8] 01 Jul 01:38:42.288 # +new-epoch 1
[8] 01 Jul 01:38:42.288 # +try-failover master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +vote-for-leader c337f6f0dfa1d41357338591cd0181c07cb026d0 1
[8] 01 Jul 01:38:42.352 # +elected-leader master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +failover-state-select-slave master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 # +selected-slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 * +failover-state-send-slaveof-noone slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.488 * +failover-state-wait-promotion slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:41:42.565 # -failover-abort-slave-timeout master redis-holt-overflow 10.19.8.2 6380
Redis Master Log
[17] 01 Jul 01:13:58.251 # Server started, Redis version 2.8.21
[17] 01 Jul 01:13:58.252 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[17] 01 Jul 01:13:58.252 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[17] 01 Jul 01:13:58.252 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[17] 01 Jul 01:13:58.252 * DB loaded from disk: 0.000 seconds
[17] 01 Jul 01:13:58.252 * The server is now ready to accept connections on port 6380
[17] 01 Jul 01:34:45.796 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:34:45.796 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:34:45.796 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:34:45.797 * Background saving started by pid 20
[20] 01 Jul 01:34:45.798 * DB saved on disk
[20] 01 Jul 01:34:45.799 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:34:45.808 * Background saving terminated with success
[17] 01 Jul 01:34:45.808 * Synchronization with slave 10.196.88.30:6381 succeeded
[17] 01 Jul 01:38:42.343 # Connection with slave 10.196.88.30:6381 lost.
[17] 01 Jul 01:38:43.275 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:38:43.275 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:38:43.275 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:38:43.275 * Background saving started by pid 21
[21] 01 Jul 01:38:43.277 * DB saved on disk
[21] 01 Jul 01:38:43.277 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:38:43.368 * Background saving terminated with success
[17] 01 Jul 01:38:43.368 * Synchronization with slave 10.196.88.30:6381 succeeded
Redis Slave Log
[14] 01 Jul 01:15:51.435 # Server started, Redis version 2.8.21
[14] 01 Jul 01:15:51.435 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[14] 01 Jul 01:15:51.435 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[14] 01 Jul 01:15:51.435 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[14] 01 Jul 01:15:51.435 * DB loaded from disk: 0.000 seconds
[14] 01 Jul 01:15:51.435 * The server is now ready to accept connections on port 6381
[14] 01 Jul 01:34:45.088 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:34:45.947 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:34:45.947 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:34:45.948 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:34:45.948 * Master replied to PING, replication can continue...
[14] 01 Jul 01:34:45.948 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:34:45.948 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:1
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Finished with success
[14] 01 Jul 01:38:42.495 # Connection with master lost.
[14] 01 Jul 01:38:42.495 * Caching the disconnected master state.
[14] 01 Jul 01:38:42.495 * Discarding previously cached master state.
[14] 01 Jul 01:38:42.495 * MASTER MODE enabled (user request)
[14] 01 Jul 01:38:42.495 # CONFIG REWRITE executed with success.
[14] 01 Jul 01:38:42.506 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:38:43.425 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:38:43.426 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:38:43.426 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:38:43.427 * Master replied to PING, replication can continue...
[14] 01 Jul 01:38:43.427 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:38:43.427 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:10930
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Finished with success
Sentinel Config
port 26379
pidfile "/var/run/redis-sentinel.pid"
logfile ""
daemonize no
Generated by CONFIG REWRITE
dir "/data"
sentinel monitor redis-holt-overflow 10.19.8.2 6380 1
sentinel down-after-milliseconds redis-holt-overflow 3100
sentinel config-epoch redis-holt-overflow 0
sentinel leader-epoch redis-holt-overflow 1
sentinel known-slave redis-holt-overflow 10.19.8.3 6381
sentinel current-epoch 1
Redis & Sentinel Info:
redis_version:2.8.21
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:551c16ab9d912477
redis_mode:standalone
os:Linux 3.10.0-123.8.1.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.7.2
process_id:13
run_id:7e1a1b6c844a969424d16f3efa116707ea7a60bf
tcp_port:6380
uptime_in_seconds:1312
uptime_in_days:0
hz:10
lru_clock:9642428
config_file:/usr/local/etc/redis/redis.conf

It appears you are running into the "docker network" issue. If you look in your logs they are showing different IPs. This is due to detection of what IP is connected from during discovery. Are these on different docker hosts?
From the documentation:
Since Sentinels auto detect slaves using masters INFO output information, the detected slaves will not be reachable, and Sentinel will never be able to failover the master, since there are no good slaves from the point of view of the system, so there is currently no way to monitor with Sentinel a set of master and slave instances deployed with Docker, unless you instruct Docker to map the port 1:1.
For sentinel a docker image can be found at https://registry.hub.docker.com/u/joshula/redis-sentinel/ which shows the use of announce-ip and bind to set it up.
For more details, see http://redis.io/topics/sentinel specifically the Docker section where it goes into detail on how to set things up in Docker to handle the situation.

Dog gone it, yeah it was one of the scripts. It was essentially triggering in the interim period when both Redis instances are masters and preemptively reverting the promoted slave back to slave-status. It's been a long week.

Related

Why does redis forcibly demote master to slave?

Run the container using docker redis:latest, and after about 30 minutes, the master changes to a slave and it is no longer writable.
Also, the slave outputs an error once per second that it cannot find the master.
1:M 08 Jul 2022 03:10:55.899 * DB saved on disk
1:M 08 Jul 2022 03:15:56.087 * 100 changes in 300 seconds. Saving...
1:M 08 Jul 2022 03:15:56.089 * Background saving started by pid 61
61:C 08 Jul 2022 03:15:56.091 * DB saved on disk
61:C 08 Jul 2022 03:15:56.092 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
1:M 08 Jul 2022 03:15:56.189 * Background saving terminated with success
1:S 08 Jul 2022 03:20:12.258 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 08 Jul 2022 03:20:12.258 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 03:20:12.258 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 03:20:12.259 * REPLICAOF 178.20.40.200:8886 enabled (user request from 'id=39 addr=95.182.123.66:36904 laddr=172.31.9.234:6379 fd=11 name= age=1 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=47 qbuf-free=20427 argv-mem=24 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=22320 events=r cmd=slaveof user=default redir=-1 resp=2')
1:S 08 Jul 2022 03:20:12.524 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 03:20:12.791 * Master replied to PING, replication can continue...
1:S 08 Jul 2022 03:20:13.335 * Trying a partial resynchronization (request 6743ff015583c86f3ac7a4305026c42991a1ca18:1).
1:S 08 Jul 2022 03:20:13.603 * Full resync from master: ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ:1
1:S 08 Jul 2022 03:20:13.603 * MASTER <-> REPLICA sync: receiving 54976 bytes from master to disk
1:S 08 Jul 2022 03:20:14.138 * Discarding previously cached master state.
1:S 08 Jul 2022 03:20:14.138 * MASTER <-> REPLICA sync: Flushing old data
1:S 08 Jul 2022 03:20:14.139 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 08 Jul 2022 03:20:14.140 # Wrong signature trying to load DB from file
1:S 08 Jul 2022 03:20:14.140 # Failed trying to load the MASTER synchronization DB from disk: Invalid argument
1:S 08 Jul 2022 03:20:14.140 * Reconnecting to MASTER 178.20.40.200:8886 after failure
1:S 08 Jul 2022 03:20:14.140 * MASTER <-> REPLICA sync started
...
1:S 08 Jul 2022 05:09:50.010 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:50.298 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:50.587 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:50.587 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:51.013 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:51.014 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:51.294 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:51.581 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:51.581 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:52.017 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:52.017 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:52.297 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:52.578 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:52.578 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:53.021 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:53.021 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:53.308 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:53.594 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:53.594 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:54.025 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:54.025 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:54.316 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:54.608 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:54.608 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:55.028 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:55.028 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:55.309 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:55.588 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:55.588 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:56.031 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:56.031 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:56.311 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:56.592 # Failed to read response from the server: Connection reset by peer
1:S 08 Jul 2022 05:09:56.592 # Master did not respond to command during SYNC handshake
1:S 08 Jul 2022 05:09:57.035 * Connecting to MASTER 178.20.40.200:8886
1:S 08 Jul 2022 05:09:57.035 * MASTER <-> REPLICA sync started
1:S 08 Jul 2022 05:09:57.321 * Non blocking connect for SYNC fired the event.
1:S 08 Jul 2022 05:09:57.610 * Master replied to PING, replication can continue...
...
SLAVEOF NO ONE
config set slave-read-only no
If I force the slave to be writable with the above command and try to write, all data will be flushed after about 5 seconds.
I don't want to turn master into slave.
I am getting this error on clean ec2 amazon linux.
I don't know what's causing this error because I also have enough memory.
Why does redis forcibly demote master to slave?

redis unable to write to AOF due to disk space (AOF rewrite enabled)

redis version: 4.0.14
OS: FreeBSD 12.0-RELEASE-p10 amd64
In the redis documentation around AOF it says after version 2.6 it should automatically trigger BGREWRITEAOF:
Redis is able to automatically rewrite the AOF in background when it gets too big. The rewrite is completely safe as while Redis continues appending to the old file, a completely new one is produced with the minimal set of operations needed to create the current data set, and once this second file is ready Redis switches the two and starts appending to the new one.
It doesn't seem to be doing this. logs:
774:S 04 Mar 17:21:19.992 # MASTER timeout: no data nor PING received...
774:S 04 Mar 17:21:19.992 # Connection with master lost.
774:S 04 Mar 17:21:19.992 * Caching the disconnected master state.
774:S 04 Mar 17:21:19.992 * Connecting to MASTER 192.168.0.100:6412
774:S 04 Mar 17:21:19.992 * MASTER <-> SLAVE sync started
774:S 04 Mar 17:21:19.992 * Non blocking connect for SYNC fired the event.
774:S 04 Mar 17:21:21.596 # Error reply to PING from master: '-MISCONF Errors writing to the AOF file: No space left on device'
774:S 04 Mar 17:21:22.021 * Connecting to MASTER 192.168.0.100:6412
774:S 04 Mar 17:21:22.021 * MASTER <-> SLAVE sync started
774:S 04 Mar 17:21:22.021 * Non blocking connect for SYNC fired the event.
774:S 04 Mar 17:21:23.238 # Error reply to PING from master: '-MISCONF Errors writing to the AOF file: No space left on device'
774:S 04 Mar 17:21:24.050 * Connecting to MASTER 192.168.0.100:6412
774:S 04 Mar 17:21:24.050 * MASTER <-> SLAVE sync started
774:S 04 Mar 17:21:24.050 * Non blocking connect for SYNC fired the event.
774:S 04 Mar 17:21:25.682 # Error reply to PING from master: '-MISCONF Errors writing to the AOF file: No space left on device'
$ info
# Server
redis_version:4.0.14
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:b8839b63a68a4b99
redis_mode:cluster
os:FreeBSD 12.0-RELEASE-p10 amd64
arch_bits:64
multiplexing_api:kqueue
atomicvar_api:atomic-builtin
gcc_version:4.2.1
process_id:766
run_id:bf9d05b240cd776e4547729d062dd6f5a5e0f60d
tcp_port:6412
uptime_in_seconds:40930179
uptime_in_days:473
hz:10
lru_clock:4266405
executable:/usr/local/bin/redis-server
config_file:/usr/local/etc/redis_cluster/6412/redis.conf
# Clients
connected_clients:115
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:11588787188
used_memory_human:10.79G
used_memory_rss:11588984900
used_memory_rss_human:10.79G
used_memory_peak:16230019258
used_memory_peak_human:15.12G
used_memory_peak_perc:71.40%
used_memory_overhead:301057836
used_memory_startup:1928680
used_memory_dataset:11287729352
used_memory_dataset_perc:97.42%
total_system_memory:0
total_system_memory_human:0B
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:27106127360
maxmemory_human:25.24G
maxmemory_policy:allkeys-lru
mem_fragmentation_ratio:1.00
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
# Persistence
loading:0
rdb_changes_since_last_save:11838701
rdb_bgsave_in_progress:0
rdb_last_save_time:1614662573
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:87
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:196
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:err
aof_last_cow_size:0
aof_current_size:6817578976
aof_base_size:5413636704
aof_pending_rewrite:0
aof_buffer_length:630994
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:56
# Stats
total_connections_received:456772
total_commands_processed:4953546081
instantaneous_ops_per_sec:205
total_net_input_bytes:685638114456
total_net_output_bytes:260383113923
instantaneous_input_kbps:51.50
instantaneous_output_kbps:31.54
rejected_connections:0
sync_full:62
sync_partial_ok:0
sync_partial_err:23
expired_keys:176647087
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:863832269
keyspace_misses:891573350
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:1641619
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
# Replication
role:master
connected_slaves:0
master_replid:90da82c2d4a4ff31b64721286b17b8716170be6c
master_replid2:a40e37f9dca7fab0b151a632564d4e5bd3163321
master_repl_offset:310200952366
second_repl_offset:308626210072
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:310199903791
repl_backlog_histlen:1048576
# CPU
used_cpu_sys:334704.88
used_cpu_user:192502.30
used_cpu_sys_children:1007.51
used_cpu_user_children:12385.73
# Cluster
cluster_enabled:1
# Keyspace
db0:keys=5665785,expires=0,avg_ttl=0
(0.50s)
When I try to run BGREWRITEAOF I get the following message:
/: write failed, filesystem is full
Please let me know if I need to put any more information here.
Thanks for your help.

Kubernetes Redis Cluster PubSub Channels not getting synched on replica

I have set up a Redis cluster on Kubernetes, the cluster state is OK and the replica is connected to the master. Also as per the logs, the full synchronization is also completed. The logs are as follows:-
9:M 22 Oct 12:24:18.209 * Slave 192.168.1.41:6379 asks for synchronization
9:M 22 Oct 12:24:18.209 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '794b9c74abe40ac90c752f32a102078e063ff636', my replication IDs are '0f499740a46665d12fab921838297273279ad136' and '0000000000000000000000000000000000000000')
9:M 22 Oct 12:24:18.209 * Starting BGSAVE for SYNC with target: disk
9:M 22 Oct 12:24:18.211 * Background saving started by pid 231
231:C 22 Oct 12:24:18.215 * DB saved on disk
231:C 22 Oct 12:24:18.216 * RDB: 4 MB of memory used by copy-on-write
9:M 22 Oct 12:24:18.224 * Background saving terminated with success
9:M 22 Oct 12:24:18.224 * Synchronization with slave 192.168.1.41:6379 succeeded
Still, when I check the List of the PubSub Channels on the replica, it does not show the channels and thus it breaks the PubSub flow.
Any help/advise is appreciated.

Redis Master Slave Switch after Aof rewrite

This Redis Cluster have 240 nodes (120 masters and 120 slaves), and works well for a long time. But now it get a Master Slave switch almost several hours.
I get some log from Redis Server.
5c541d3a765e087af7775ba308f51ffb2aa54151
10.12.28.165:6502
13306:M 08 Mar 18:55:02.597 * Background append only file rewriting started by pid 15396
13306:M 08 Mar 18:55:41.636 # Cluster state changed: fail
13306:M 08 Mar 18:55:45.321 # Connection with slave client id #112948 lost.
13306:M 08 Mar 18:55:46.243 # Configuration change detected. Reconfiguring myself as a replica of afb6e012db58bd26a7c96182b04f0a2ba6a45768
13306:S 08 Mar 18:55:47.134 * AOF rewrite child asks to stop sending diffs.
15396:C 08 Mar 18:55:47.134 * Parent agreed to stop sending diffs. Finalizing AOF...
15396:C 08 Mar 18:55:47.134 * Concatenating 0.02 MB of AOF diff received from parent.
15396:C 08 Mar 18:55:47.135 * SYNC append only file rewrite performed
15396:C 08 Mar 18:55:47.186 * AOF rewrite: 4067 MB of memory used by copy-on-write
13306:S 08 Mar 18:55:47.209 # Cluster state changed: ok
5ac747878f881349aa6a62b179176ddf603e034c
10.12.30.107:6500
22825:M 08 Mar 18:55:30.534 * FAIL message received from da493af5bb3d15fc563961de09567a47787881be about 5c541d3a765e087af7775ba308f51ffb2aa54151
22825:M 08 Mar 18:55:31.440 # Failover auth granted to afb6e012db58bd26a7c96182b04f0a2ba6a45768 for epoch 323
22825:M 08 Mar 18:55:41.587 * Background append only file rewriting started by pid 23628
22825:M 08 Mar 18:56:24.200 # Cluster state changed: fail
22825:M 08 Mar 18:56:30.002 # Connection with slave client id #382416 lost.
22825:M 08 Mar 18:56:30.830 * FAIL message received from 0decbe940c6f4d4330fae5a9c129f1ad4932405d about 5ac747878f881349aa6a62b179176ddf603e034c
22825:M 08 Mar 18:56:30.840 # Failover auth denied to d46f95da06cfcd8ea5eaa15efabff5bd5e99df55: its master is up
22825:M 08 Mar 18:56:30.843 # Configuration change detected. Reconfiguring myself as a replica of d46f95da06cfcd8ea5eaa15efabff5bd5e99df55
22825:S 08 Mar 18:56:31.030 * Clear FAIL state for node 5ac747878f881349aa6a62b179176ddf603e034c: slave is reachable again.
22825:S 08 Mar 18:56:31.030 * Clear FAIL state for node 5c541d3a765e087af7775ba308f51ffb2aa54151: slave is reachable again.
22825:S 08 Mar 18:56:31.294 # Cluster state changed: ok
22825:S 08 Mar 18:56:31.595 * Connecting to MASTER 10.12.30.104:6404
22825:S 08 Mar 18:56:31.671 * MASTER SLAVE sync started
22825:S 08 Mar 18:56:31.671 * Non blocking connect for SYNC fired the event.
22825:S 08 Mar 18:56:31.672 * Master replied to PING, replication can continue...
22825:S 08 Mar 18:56:31.673 * Partial resynchronization not possible (no cached master)
22825:S 08 Mar 18:56:31.691 * AOF rewrite child asks to stop sending diffs.
It appends that Redis Master Slave Swtich happend after Aof rewtiting.
Here is the config of this cluster.
daemonize no
tcp-backlog 511
timeout 0
tcp-keepalive 60
loglevel notice
databases 16
dir "/var/cachecloud/data"
stop-writes-on-bgsave-error no
repl-timeout 60
repl-ping-slave-period 10
repl-disable-tcp-nodelay no
repl-backlog-size 10000000
repl-backlog-ttl 7200
slave-serve-stale-data yes
slave-read-only yes
slave-priority 100
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 512mb 128mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
port 6401
maxmemory 13000mb
maxmemory-policy volatile-lru
appendonly yes
appendfsync no
appendfilename "appendonly-6401.aof"
dbfilename "dump-6401.rdb"
aof-rewrite-incremental-fsync yes
no-appendfsync-on-rewrite yes
auto-aof-rewrite-min-size 62500kb
auto-aof-rewrite-percentage 86
rdbcompression yes
rdbchecksum yes
repl-diskless-sync no
repl-diskless-sync-delay 5
maxclients 10000
hll-sparse-max-bytes 3000
min-slaves-to-write 0
min-slaves-max-lag 10
aof-load-truncated yes
notify-keyspace-events ""
bind 10.12.26.226
protected-mode no
cluster-enabled yes
cluster-node-timeout 15000
cluster-slave-validity-factor 10
cluster-migration-barrier 1
cluster-config-file "nodes-6401.conf"
cluster-require-full-coverage no
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command KEYS ""
In my option, aof rewrite will not effect the Redis Main Thread. BUT this seems make this node not response other nodes' Ping.
Check THP(Transparent Huge pages) on Linux kernel parameter.
because AOF diff size 0.02MB, copy-on-write size 2067MB.

redis-cli 'sentinel slaves redis-cluster' returns an empty list with a password protected master

[Redis] [redis-db] 'sentinel slaves ' returns an empty list with a password protected master.
Dear All,
My current redis-cluster setup is the following:
3 Different linux servers
srv 1 => redis master + sentinel 1
srv 2 => redis slaves + sentinel 2
srv 3 => sentinel 3 (sentinel only to avoid split brain situation)
the redis version
redis_version:3.2.3
redis_mode:sentinel
os:Linux 3.10.0-514.21.2.el7.x86_64 x86_64
tcp_port:26379
For some reason sentinel can't find a suitable slave to promote "master" in case of failover.
the redis-cli command "sentinel slaves redis-cluster" returns me an empty list :/ (see my terminal output below) BUT the 3 sentinels can "talk" to each other
the 3 redis-cli sentinel commands I used to get these information:
ip-10-0-0-118.eu-west-1.compute.internal:26379> sentinel slaves redis-cluster
(empty list or set)
ip-10-0-0-118.eu-west-1.compute.internal:26379> sentinel ckquorum redis-cluster
OK 3 usable Sentinels. Quorum and failover authorization can be reached
ip-10-0-0-118.eu-west-1.compute.internal:26379> sentinel failover redis-cluster
(error) NOGOODSLAVE No suitable slave to promote
The configuration files (redis and sentinel) are basics and I used the authentication.
Any idea what would I have misconfigured? so far? :/
Thanks in advance.
kr,
Orsius.
documentation:
https://redis.io/topics/sentinel
http://download.redis.io/redis-stable/sentinel.conf
here are my sentinel logs:
. . .
`2361:X 17 Jul 09:20:55.159 # 04ffbe62cec24e9635abbf8985c804e27bb8899b voted for 2cd4dce89889baadc178ba8909b894cf42f184d9 23
2361:X 17 Jul 09:20:55.170 # f5e93cc7c1a109ca8aa4588b92156f7fb5c29c72 voted for 2cd4dce89889baadc178ba8909b894cf42f184d9 23
2361:X 17 Jul 09:20:55.221 # +elected-leader master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:20:55.221 # +failover-state-select-slave master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:20:55.304 # -failover-abort-no-good-slave master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:20:55.357 # Next failover delay: I will not start a failover before Mon Jul 17 09:26:55 2017
2361:X 17 Jul 09:21:41.876 # +new-epoch 24
2361:X 17 Jul 09:21:41.878 # +vote-for-leader f5e93cc7c1a109ca8aa4588b92156f7fb5c29c72 24
2361:X 17 Jul 09:21:41.920 # Next failover delay: I will not start a failover before Mon Jul 17 09:27:42 2017
2361:X 17 Jul 09:27:42.092 # +new-epoch 25
2361:X 17 Jul 09:27:42.092 # +try-failover master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:27:42.099 # +vote-for-leader 2cd4dce89889baadc178ba8909b894cf42f184d9 25
2361:X 17 Jul 09:27:42.102 # f5e93cc7c1a109ca8aa4588b92156f7fb5c29c72 voted for 2cd4dce89889baadc178ba8909b894cf42f184d9 25
2361:X 17 Jul 09:27:42.103 # 04ffbe62cec24e9635abbf8985c804e27bb8899b voted for 2cd4dce89889baadc178ba8909b894cf42f184d9 25
2361:X 17 Jul 09:27:42.165 # +elected-leader master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:27:42.165 # +failover-state-select-slave master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:27:42.248 # -failover-abort-no-good-slave master redis-cluster 10.0.0.223 6379
2361:X 17 Jul 09:27:42.314 # Next failover delay: I will not start a failover before Mon Jul 17 09:33:42 2017`
. . .
If I trust the following forum, sentinel only promote good slaves to new master.
source: https://github.com/antirez/redis/issues/1796
some slaves can be good slave to follow below rules.
not slave-priority is 0.
not demote(it was not old master.)
ping reply > info_validity_time
info reply > info_validate_time
not sdown, odown, disconnected.
My problem was actually a misconfiguration in my redis-cluster files (redis.conf & redis-sentinel.conf)which launched my two redis instances in 'standalone' mod.
I put the working configuration on my github repository: [github.com/orsius/redis-cluster][1]
Hope it'll help someone one day.
Keep calm and continue using redis-cluster;)
Orsius.
  [1]: https://github.com/orsius/redis-cluster