Redis closing connections to clients while saving changed keys in background - redis

I have one Redis server on EC2 with 2 app servers connected to it. All small/medium. Traffic is not high; only 10 changed keys in 300 seconds. I started noticing on the app servers connection errors to the Redis machine:
Cannot get Jedis connection; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
At first I thought it was a problem with my pool configuration or the java client I am using to interface with Redis, however I quickly debunked this theory when I noticed that both app servers would always generate these exceptions at the exact same time and they'd always come in bunches. Then I looked at redis.log and noticed the following output while the errors would appear:
[13939] 08 May 22:31:05.051 * 10 changes in 300 seconds. Saving...
[13939] 08 May 22:31:05.342 * Background saving started by pid 13945
[13939] 08 May 22:31:09.357 - DB 0: 606477 keys (0 volatile) in 1048576 slots HT.
[13939] 08 May 22:31:09.357 - 3 clients connected (0 slaves), 764180208 bytes in use
[13939] 08 May 22:31:14.542 - DB 0: 606477 keys (0 volatile) in 1048576 slots HT.
[13939] 08 May 22:31:25.947 - 3 clients connected (0 slaves), 764180208 bytes in use
[13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.947 - Accepted 10.123.29.90:56301
[13939] 08 May 22:31:25.947 - Accepted 10.42.105.60:35315
[13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.947 - Accepted 10.123.29.90:56302
[13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.947 - Accepted 10.42.105.60:35317
[13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.947 - Accepted 10.123.29.90:56306
[13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.948 - Accepted 10.42.105.60:35318
[13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.948 - Accepted 10.123.29.90:56308
[13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.948 - Accepted 10.42.105.60:35319
[13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.948 - Accepted 10.42.105.60:35320
[13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.948 - Accepted 10.123.29.90:56310
[13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer
[13939] 08 May 22:31:25.948 - Accepted 10.42.105.60:35322
[13939] 08 May 22:31:27.652 - Accepted 10.42.105.60:35327
[13939] 08 May 22:31:27.872 - Accepted 10.42.105.60:35329
[13945] 08 May 22:31:27.926 * DB saved on disk
The errors only occur while Redis is saving new data in the background. I am using Redis 2.6. Any help is appreciated.
EDIT: Redis connection pool configuration below using spring-data
<bean id="redisPoolConfig" class="redis.clients.jedis.JedisPoolConfig" lazy-init="false"
p:maxTotal="500"
p:maxIdle="20"
p:testOnBorrow="true"
p:testOnCreate="true"
p:testOnReturn="true"
p:maxWaitMillis="30000"
/>
<bean id="jedisConnectionFactory" class="org.springframework.data.redis.connection.jedis.JedisConnectionFactory"
p:hostName="${REDIS_HOST}"
p:port="${REDIS_PORT}"
p:usePool="true"
p:poolConfig-ref="redisPoolConfig"
/>

I think I found the answer. After upgrading to 2.8.9 and restarting, the log alerted me to my OS limiting the number of open file descriptors to 1024, which can easily be exceeded with a running Redis instance. So what probably was happening was the background save process was opening up new file descriptors, which prevented new connections from being accepted since each new client connection opens a file. After increasing the limit to 10000 via ulimit -n everything seems to be functioning correctly.

Related

MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk

i use reddison client but when the client has error “MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk”
Unable to send PING command over channel: [id: 0x04130153, L:/171.20.0.8:38080 - R:10.3.236.102/10.3.236.102:6379]
org.redisson.client.RedisException: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.. channel: [id: 0x04130153, L:/171.20.0.8:38080 - R:10.3.236.102/10.3.236.102:6379] command: (PING), params: []
the redis server has no error
{"log":"3443340:C 09 Apr 00:12:41.648 * DB saved on disk\n","stream":"stdout","time":"2022-04-09T00:12:41.649083457Z"}
{"log":"3443340:C 09 Apr 00:12:41.772 * RDB: 38 MB of memory used by copy-on-write\n","stream":"stdout","time":"2022-04-09T00:12:41.77335587Z"}
{"log":"7:M 09 Apr 00:12:42.024 * Background saving terminated with success\n","stream":"stdout","time":"2022-04-09T00:12:42.025019006Z"}
{"log":"7:M 09 Apr 00:12:45.027 *
beacuse the server time is not correct

Aerospike Cluster Nodes intermittently going down and coming back up

I have an Aerospike cluster of 15 nodes. This cluster performs fairly well under a normal load 10k TPS. I did some tests today, on a higher TPS. I raised the TPS to around 130k-150k TPS.
I observed that some nodes intermittently went down, and automatically came back up after a few seconds. Due to these nodes going down, we are getting heartbeat timeouts, and hence, read timeouts.
One cluster node configuration: 8 cores. 120GB RAM. I am storing data in memory.
All nodes have sufficient space remaining. Out of a total cluster space of 1.2TB (15*120), only 275 GB of space is used up.
Also, the network in not at all flaky. All these machines are in a data centre, and are high bandwidth machines.
Some observations made by monitoring AMC:
Saw some nodes (around 5-6) become inactive for a few seconds
There were a high number of client connections on few of these nodes that went down. For example: there were 6000-7000 client connections on all other nodes. One of the node had an unusual 25000 client connections.
Some error logs in cluster nodes:
Sep 15 2020 17:00:43 GMT: WARNING (hb): (hb.c:4864) (repeated:5) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:00:43 GMT: WARNING (socket): (socket.c:808) (repeated:5) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:740) (repeated:3) Timeout while connecting
Sep 15 2020 17:00:53 GMT: WARNING (hb): (hb.c:4864) (repeated:3) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:808) (repeated:3) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting
Sep 15 2020 17:01:03 GMT: WARNING (hb): (hb.c:4864) (repeated:1) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:808) (repeated:1) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:740) (repeated:2) Timeout while connecting
Sep 15 2020 17:01:13 GMT: WARNING (hb): (hb.c:4864) (repeated:2) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:808) (repeated:2) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:740) Timeout while connecting
Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:808) Error while connecting socket to 10.33.54.144:2057
Sep 15 2020 17:02:44 GMT: WARNING (hb): (hb.c:4864) could not create heartbeat connection to node {10.33.54.144:2057}
Sep 15 2020 17:02:53 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting
We also saw some of these error logs in nodes that were going down:
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9280f220a0102 on fd 4155 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b676220a0102 on fd 4149 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9fbd6200a0102 on fd 42 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb96d3d220a0102 on fd 4444 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb99036210a0102 on fd 4278 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9f102220a0102 on fd 4143 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb91822210a0102 on fd 4515 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9e5ff200a0102 on fd 4173 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb93f65200a0102 on fd 38 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9132f220a0102 on fd 4414 failed : Connection reset by peer
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb939be210a0102 on fd 4567 failed : Connection reset by peer
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b19a220a0102 on fd 4165 failed : Broken pipe
Attaching the aerospike.conf file here:
service {
user root
group root
service-threads 12
transaction-queues 12
transaction-threads-per-queue 4
proto-fd-max 50000
migrate-threads 1
pidfile /var/run/aerospike/asd.pid
}
logging {
file /var/log/aerospike/aerospike.log {
context any info
context migrate debug
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 2057
mesh-seed-address-port 10.34.154.177 2057
mesh-seed-address-port 10.34.15.40 2057
mesh-seed-address-port 10.32.255.229 2057
mesh-seed-address-port 10.33.54.144 2057
mesh-seed-address-port 10.32.190.157 2057
mesh-seed-address-port 10.32.101.63 2057
mesh-seed-address-port 10.34.2.241 2057
mesh-seed-address-port 10.32.214.251 2057
mesh-seed-address-port 10.34.30.114 2057
mesh-seed-address-port 10.33.162.134 2057
mesh-seed-address-port 10.33.190.57 2057
mesh-seed-address-port 10.34.61.109 2057
mesh-seed-address-port 10.34.47.19 2057
mesh-seed-address-port 10.33.34.24 2057
mesh-seed-address-port 10.34.118.182 2057
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace PS1 {
replication-factor 2
memory-size 70G
single-bin false
data-in-index false
storage-engine memory
stop-writes-pct 90
high-water-memory-pct 75
}
namespace LS1 {
replication-factor 2
memory-size 30G
single-bin false
data-in-index false
storage-engine memory
stop-writes-pct 90
high-water-memory-pct 75
}
Any possible explanations for this?
Seems like the nodes are having network connectivity issues at such higher throughput. This can have different root causes, from simple network related bottleneck (bandwidth, packets per second), to something on the node itself getting in the way of interfacing properly with the network (soft interrupts surge, improper distribution of network queues, CPU thrashing). This would prevent heartbeat connections/messages from going through, resulting in nodes leaving the cluster until it recovers. If running on a cloud/virtualized environment, some hosts may have noisier neighbors than others, etc...
The increase in number of connections is a symptom as any slow down on a node would cause the client to compensate by increasing the throughput (which will increase the number of connections, which can also lead to a downward spiraling effect).
Finally, a single node leaving or joining the cluster shouldn't impact read transactions much. Check your policy and make sure you have the socketTimeout / totalTimeout / maxRetries, etc... set correctly so that a read can quickly retry against a different replica.
This article can help on this latest point: https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852/3

Failed opening .rdb for saving: Permission denied - started after a while of running successfully

I have had a node web service running successfully on an aws ubuntu server for over a month, with the requests cached using redis.
Yesterday I started getting the following error from some of my routes:
MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
I was able to stop the error occurring by using:
config set stop-writes-on-bgsave-error no
as suggested in the answers to this question, but it doesn't actually solve the underlying problem.
To find the underlying problem I checked the logs and found the following had started happening:
[1105] 09 Aug 13:17:14.800 - 0 clients connected (0 slaves), 797680 bytes in use
[1105] 09 Aug 13:17:15.101 * 1 changes in 900 seconds. Saving...
[1105] 09 Aug 13:17:15.101 * Background saving started by pid 28090
[28090] 09 Aug 13:17:15.101 # Failed opening .rdb for saving: Permission denied
[1105] 09 Aug 13:17:15.201 # Background saving error
Over the weekend no one had been using the server, but before the weekend the logs were fine, and we were getting no errors:
[12521] 06 Aug 04:49:27.308 - 0 clients connected (0 slaves), 803352 bytes in use
[12521] 06 Aug 04:49:29.012 * 1 changes in 900 seconds. Saving...
[12521] 06 Aug 04:49:29.012 * Background saving started by pid 26663
[26663] 06 Aug 04:49:29.014 * DB saved on disk
[26663] 06 Aug 04:49:29.014 * RDB: 2 MB of memory used by copy-on-write
[12521] 06 Aug 04:49:29.112 * Background saving terminated with success
As I said, no one has touched this server in the intervening time.
Looking around for people having the same problem I found this question. I checked the ownership and permissions on the directory and db file as suggested in the answers there:
drwxr-xr-x 2 redis redis 26 Aug 6 06:55 redis
-rw-r--r-- 1 redis redis 18 Aug 6 06:55 dump-6379.rdb
The permissions and ownership both look ok to me, but I have noticed that the date on the file and folder is between the last time I saw the service working and the first time it failed. Unfortunately that hasn't really helped me with what to do next and I am at a bit of a loss.
I am looking for suggestions for next steps to find the cause of the problem, or at least a way of making redis able to write again.

Redis Sentinel manual failover command timesout

Redis Sentinel manual failover command timesout
I have one Redis master, one slave, and one Sentinel monitoring them. Everything seems to be working properly including failover when the master is killed. But when I issue the SENTINEL FAILVER command Sentinel gets stuck in the state +failover-state-wait-promotion for a few minutes. It seems like the Slave is not getting the promotion command. This doesn't make sense because there doesn't seem to be any trouble with network communication from the Sentinel host to either of the Redis hosts. I'm running all 3 of the procs in Docker containers, but I'm not sure how that could cause the problem. I can run redis-cli from the Sentinel host (i.e. from inside the Docker container) and can remotely execute the slaveof command. I can also monitor both Redis instances and see SENTINEL pings and info requests. I looked at logs for the master and slave and see nothing abnormal. Looking at THIS post and there does not seem to be any reason why Sentinel would consider the Redis instances invalid.
I'm fairly experienced with Sentinel, but rather new to Docker. Not sure how to proceed determining what the problem is. Any ideas?
Sentinel Log
[8] 01 Jul 01:36:57.317 # Sentinel runid is c337f6f0dfa1d41357338591cd0181c07cb026d0
[8] 01 Jul 01:38:13.135 # +monitor master redis-holt-overflow 10.19.8.2 6380 quorum 1
[8] 01 Jul 01:38:13.135 # +set master redis-holt-overflow 10.19.8.2 6380 down-after-milliseconds 3100
[8] 01 Jul 01:38:13.199 * +slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.288 # Executing user requested FAILOVER of 'redis-holt-overflow'
[8] 01 Jul 01:38:42.288 # +new-epoch 1
[8] 01 Jul 01:38:42.288 # +try-failover master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +vote-for-leader c337f6f0dfa1d41357338591cd0181c07cb026d0 1
[8] 01 Jul 01:38:42.352 # +elected-leader master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.352 # +failover-state-select-slave master redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 # +selected-slave slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.404 * +failover-state-send-slaveof-noone slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:38:42.488 * +failover-state-wait-promotion slave 10.19.8.3:6381 10.19.8.3 6381 # redis-holt-overflow 10.19.8.2 6380
[8] 01 Jul 01:41:42.565 # -failover-abort-slave-timeout master redis-holt-overflow 10.19.8.2 6380
Redis Master Log
[17] 01 Jul 01:13:58.251 # Server started, Redis version 2.8.21
[17] 01 Jul 01:13:58.252 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[17] 01 Jul 01:13:58.252 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[17] 01 Jul 01:13:58.252 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[17] 01 Jul 01:13:58.252 * DB loaded from disk: 0.000 seconds
[17] 01 Jul 01:13:58.252 * The server is now ready to accept connections on port 6380
[17] 01 Jul 01:34:45.796 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:34:45.796 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:34:45.796 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:34:45.797 * Background saving started by pid 20
[20] 01 Jul 01:34:45.798 * DB saved on disk
[20] 01 Jul 01:34:45.799 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:34:45.808 * Background saving terminated with success
[17] 01 Jul 01:34:45.808 * Synchronization with slave 10.196.88.30:6381 succeeded
[17] 01 Jul 01:38:42.343 # Connection with slave 10.196.88.30:6381 lost.
[17] 01 Jul 01:38:43.275 * Slave 10.196.88.30:6381 asks for synchronization
[17] 01 Jul 01:38:43.275 * Full resync requested by slave 10.196.88.30:6381
[17] 01 Jul 01:38:43.275 * Starting BGSAVE for SYNC with target: disk
[17] 01 Jul 01:38:43.275 * Background saving started by pid 21
[21] 01 Jul 01:38:43.277 * DB saved on disk
[21] 01 Jul 01:38:43.277 * RDB: 0 MB of memory used by copy-on-write
[17] 01 Jul 01:38:43.368 * Background saving terminated with success
[17] 01 Jul 01:38:43.368 * Synchronization with slave 10.196.88.30:6381 succeeded
Redis Slave Log
[14] 01 Jul 01:15:51.435 # Server started, Redis version 2.8.21
[14] 01 Jul 01:15:51.435 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[14] 01 Jul 01:15:51.435 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[14] 01 Jul 01:15:51.435 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[14] 01 Jul 01:15:51.435 * DB loaded from disk: 0.000 seconds
[14] 01 Jul 01:15:51.435 * The server is now ready to accept connections on port 6381
[14] 01 Jul 01:34:45.088 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:34:45.947 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:34:45.947 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:34:45.948 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:34:45.948 * Master replied to PING, replication can continue...
[14] 01 Jul 01:34:45.948 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:34:45.948 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:1
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:34:45.960 * MASTER <-> SLAVE sync: Finished with success
[14] 01 Jul 01:38:42.495 # Connection with master lost.
[14] 01 Jul 01:38:42.495 * Caching the disconnected master state.
[14] 01 Jul 01:38:42.495 * Discarding previously cached master state.
[14] 01 Jul 01:38:42.495 * MASTER MODE enabled (user request)
[14] 01 Jul 01:38:42.495 # CONFIG REWRITE executed with success.
[14] 01 Jul 01:38:42.506 * SLAVE OF 10.196.88.29:6380 enabled (user request)
[14] 01 Jul 01:38:43.425 * Connecting to MASTER 10.196.88.29:6380
[14] 01 Jul 01:38:43.426 * MASTER <-> SLAVE sync started
[14] 01 Jul 01:38:43.426 * Non blocking connect for SYNC fired the event.
[14] 01 Jul 01:38:43.427 * Master replied to PING, replication can continue...
[14] 01 Jul 01:38:43.427 * Partial resynchronization not possible (no cached master)
[14] 01 Jul 01:38:43.427 * Full resync from master: b912b647401917d52742c0eac3ae2f795f59f48f:10930
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Flushing old data
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Loading DB in memory
[14] 01 Jul 01:38:43.520 * MASTER <-> SLAVE sync: Finished with success
Sentinel Config
port 26379
pidfile "/var/run/redis-sentinel.pid"
logfile ""
daemonize no
Generated by CONFIG REWRITE
dir "/data"
sentinel monitor redis-holt-overflow 10.19.8.2 6380 1
sentinel down-after-milliseconds redis-holt-overflow 3100
sentinel config-epoch redis-holt-overflow 0
sentinel leader-epoch redis-holt-overflow 1
sentinel known-slave redis-holt-overflow 10.19.8.3 6381
sentinel current-epoch 1
Redis & Sentinel Info:
redis_version:2.8.21
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:551c16ab9d912477
redis_mode:standalone
os:Linux 3.10.0-123.8.1.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.7.2
process_id:13
run_id:7e1a1b6c844a969424d16f3efa116707ea7a60bf
tcp_port:6380
uptime_in_seconds:1312
uptime_in_days:0
hz:10
lru_clock:9642428
config_file:/usr/local/etc/redis/redis.conf
It appears you are running into the "docker network" issue. If you look in your logs they are showing different IPs. This is due to detection of what IP is connected from during discovery. Are these on different docker hosts?
From the documentation:
Since Sentinels auto detect slaves using masters INFO output information, the detected slaves will not be reachable, and Sentinel will never be able to failover the master, since there are no good slaves from the point of view of the system, so there is currently no way to monitor with Sentinel a set of master and slave instances deployed with Docker, unless you instruct Docker to map the port 1:1.
For sentinel a docker image can be found at https://registry.hub.docker.com/u/joshula/redis-sentinel/ which shows the use of announce-ip and bind to set it up.
For more details, see http://redis.io/topics/sentinel specifically the Docker section where it goes into detail on how to set things up in Docker to handle the situation.
Dog gone it, yeah it was one of the scripts. It was essentially triggering in the interim period when both Redis instances are masters and preemptively reverting the promoted slave back to slave-status. It's been a long week.

What's the hard limit for apache ThreadsPerChild parameter in httpd.conf?

i'm using the ibm http server which is based on Apache. When i tried to increase the parameter ThreadsPerChild more than 1000, the http server always only start up 1000 worker threads. Below is the related information:
error log:
[Thu Jul 05 10:50:45 2012] [debug] mpm_winnt.c(564): Child 9040: retrieved 2 listeners from parent
[Thu Jul 05 10:50:45 2012] [notice] Child 9040: Acquired the start mutex.
[Thu Jul 05 10:50:45 2012] [notice] Child 9040: Starting 1000 worker threads.
[Thu Jul 05 10:50:45 2012] [notice] Child 9040: Starting thread to listen on port 81.
[Thu Jul 05 10:50:45 2012] [notice] Child 9040: Starting thread to listen on port 80.
httpd.conf
<IfModule mpm_winnt.c>
ThreadLimit 2048<br>
ThreadsPerChild 2000
MaxRequestsPerChild 0
</IfModule>
IHS 7.0.0.0
OS winNT
BTW, another concern with ThreadsPerChild is whether one Apache thread handles one client connection here, or one thread can take care of more than one client connection?
Please help me out.
Thanks very much
On limits of ThreadsPerChild setting, quoting from IBM HTTP Server Performance Tuning ;
On 64-bit Windows OS'es, each instance of is limited to approximately
2500 ThreadsPerChild. On 32-bit Windows, this number is closer to
5000. These numbers are not exact limits, because the real limits are the sum of the fixed startup cost of memory for each thread + the
maximum runtime memory usage per thread, which varies based on
configuration and workload. Raising ThreadsPerChild and approaching
these limits risks child process crashes when runtime memory usage
puts the process address space over the 2GB or 3GB barrier.
The interesting to note here is ThreadsPerChild is not the only parameter for tuning concurrent connections to IHS. You may find information about other parameters (like maxClients) and tuning methodology at the following link;
Tuning IBM HTTP Server to maximize the number of client connections to WebSphere Application Server