How to prevent RabbitMQ to deadletter messages on connection loss - rabbitmq
we use RabbitMQ to communicate between our components. For some reason some components could not communicate with the server any more tonight. On the server the issue looked like that:
...
=ERROR EVENT==== Tue, 06 Dec 2022 21:18:15 GMT ===
2022-12-06 21:18:15.144351+00:00 [erro] <0.9186.1093> closing AMQP connection <0.9186.1093> (123.123.123.123:26688 -> 234.234.234.234:5671 - rabbitConnectionFactory#1f48fa72:0):
2022-12-06 21:18:15.144351+00:00 [erro] <0.9186.1093> missed heartbeats from client, timeout: 60s
=ERROR EVENT==== Tue, 06 Dec 2022 21:18:17 GMT ===
2022-12-06 21:18:17.549345+00:00 [erro] <0.4340.1090> closing AMQP connection <0.4340.1090> (123.123.123.123:39360 -> 234.234.234.234:5671 - rabbitConnectionFactory#e4348c0:0):
2022-12-06 21:18:17.549345+00:00 [erro] <0.4340.1090> missed heartbeats from client, timeout: 60s
=ERROR EVENT==== Tue, 06 Dec 2022 21:18:27 GMT ===
2022-12-06 21:18:27.177133+00:00 [erro] <0.5137.403> closing AMQP connection <0.5137.403> (123.123.123.123:39360 -> 234.234.234.234:5671 - rabbitConnectionFactory#6f071d0c:0):
2022-12-06 21:18:27.177133+00:00 [erro] <0.5137.403> missed heartbeats from client, timeout: 60s
=ERROR EVENT==== Tue, 06 Dec 2022 21:18:31 GMT ===
2022-12-06 21:18:31.701088+00:00 [erro] <0.25212.1093> closing AMQP connection <0.25212.1093> (123.123.123.123:30528 -> 234.234.234.234:5671 - some-component):
2022-12-06 21:18:31.701088+00:00 [erro] <0.25212.1093> missed heartbeats from client, timeout: 60s
=ERROR EVENT==== Tue, 06 Dec 2022 21:18:36 GMT ===
2022-12-06 21:18:36.922101+00:00 [erro] <0.10334.685> closing AMQP connection <0.10334.685> (123.123.123.123:24960 -> 234.234.234.234:5671 - SpringAMQP#1a61721e:0):
2022-12-06 21:18:36.922101+00:00 [erro] <0.10334.685> missed heartbeats from client, timeout: 60s
...
The components reconnected, and continued their operation, though for one queue some messages were deadlettered. Which took me some time to find out why they were deadlettered and are delayed in the processing by a few hours as they were manually shoveled back to the in-queue after the investigation.
I assume that these deadlettered messages might have been unacked at the time of the disconnect.
My quesion, how can I prevent that behaviour, so that the messages are reqeued and then processed normally (either by some other instance that has still a connection or the same instance when it reestablishes the connection) (Analog if the consumer would nack the message with requeue=true?
Related
Aerospike Cluster Nodes intermittently going down and coming back up
I have an Aerospike cluster of 15 nodes. This cluster performs fairly well under a normal load 10k TPS. I did some tests today, on a higher TPS. I raised the TPS to around 130k-150k TPS. I observed that some nodes intermittently went down, and automatically came back up after a few seconds. Due to these nodes going down, we are getting heartbeat timeouts, and hence, read timeouts. One cluster node configuration: 8 cores. 120GB RAM. I am storing data in memory. All nodes have sufficient space remaining. Out of a total cluster space of 1.2TB (15*120), only 275 GB of space is used up. Also, the network in not at all flaky. All these machines are in a data centre, and are high bandwidth machines. Some observations made by monitoring AMC: Saw some nodes (around 5-6) become inactive for a few seconds There were a high number of client connections on few of these nodes that went down. For example: there were 6000-7000 client connections on all other nodes. One of the node had an unusual 25000 client connections. Some error logs in cluster nodes: Sep 15 2020 17:00:43 GMT: WARNING (hb): (hb.c:4864) (repeated:5) could not create heartbeat connection to node {10.33.162.134:2057} Sep 15 2020 17:00:43 GMT: WARNING (socket): (socket.c:808) (repeated:5) Error while connecting socket to 10.33.162.134:2057 Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:740) (repeated:3) Timeout while connecting Sep 15 2020 17:00:53 GMT: WARNING (hb): (hb.c:4864) (repeated:3) could not create heartbeat connection to node {10.33.162.134:2057} Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:808) (repeated:3) Error while connecting socket to 10.33.162.134:2057 Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting Sep 15 2020 17:01:03 GMT: WARNING (hb): (hb.c:4864) (repeated:1) could not create heartbeat connection to node {10.33.162.134:2057} Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:808) (repeated:1) Error while connecting socket to 10.33.162.134:2057 Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:740) (repeated:2) Timeout while connecting Sep 15 2020 17:01:13 GMT: WARNING (hb): (hb.c:4864) (repeated:2) could not create heartbeat connection to node {10.33.162.134:2057} Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:808) (repeated:2) Error while connecting socket to 10.33.162.134:2057 Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:740) Timeout while connecting Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:808) Error while connecting socket to 10.33.54.144:2057 Sep 15 2020 17:02:44 GMT: WARNING (hb): (hb.c:4864) could not create heartbeat connection to node {10.33.54.144:2057} Sep 15 2020 17:02:53 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting We also saw some of these error logs in nodes that were going down: Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9280f220a0102 on fd 4155 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b676220a0102 on fd 4149 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9fbd6200a0102 on fd 42 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb96d3d220a0102 on fd 4444 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb99036210a0102 on fd 4278 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9f102220a0102 on fd 4143 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb91822210a0102 on fd 4515 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9e5ff200a0102 on fd 4173 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb93f65200a0102 on fd 38 failed : Broken pipe Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9132f220a0102 on fd 4414 failed : Connection reset by peer Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb939be210a0102 on fd 4567 failed : Connection reset by peer Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b19a220a0102 on fd 4165 failed : Broken pipe Attaching the aerospike.conf file here: service { user root group root service-threads 12 transaction-queues 12 transaction-threads-per-queue 4 proto-fd-max 50000 migrate-threads 1 pidfile /var/run/aerospike/asd.pid } logging { file /var/log/aerospike/aerospike.log { context any info context migrate debug } } network { service { address any port 3000 } heartbeat { mode mesh port 2057 mesh-seed-address-port 10.34.154.177 2057 mesh-seed-address-port 10.34.15.40 2057 mesh-seed-address-port 10.32.255.229 2057 mesh-seed-address-port 10.33.54.144 2057 mesh-seed-address-port 10.32.190.157 2057 mesh-seed-address-port 10.32.101.63 2057 mesh-seed-address-port 10.34.2.241 2057 mesh-seed-address-port 10.32.214.251 2057 mesh-seed-address-port 10.34.30.114 2057 mesh-seed-address-port 10.33.162.134 2057 mesh-seed-address-port 10.33.190.57 2057 mesh-seed-address-port 10.34.61.109 2057 mesh-seed-address-port 10.34.47.19 2057 mesh-seed-address-port 10.33.34.24 2057 mesh-seed-address-port 10.34.118.182 2057 interval 150 timeout 20 } fabric { port 3001 } info { port 3003 } } namespace PS1 { replication-factor 2 memory-size 70G single-bin false data-in-index false storage-engine memory stop-writes-pct 90 high-water-memory-pct 75 } namespace LS1 { replication-factor 2 memory-size 30G single-bin false data-in-index false storage-engine memory stop-writes-pct 90 high-water-memory-pct 75 } Any possible explanations for this?
Seems like the nodes are having network connectivity issues at such higher throughput. This can have different root causes, from simple network related bottleneck (bandwidth, packets per second), to something on the node itself getting in the way of interfacing properly with the network (soft interrupts surge, improper distribution of network queues, CPU thrashing). This would prevent heartbeat connections/messages from going through, resulting in nodes leaving the cluster until it recovers. If running on a cloud/virtualized environment, some hosts may have noisier neighbors than others, etc... The increase in number of connections is a symptom as any slow down on a node would cause the client to compensate by increasing the throughput (which will increase the number of connections, which can also lead to a downward spiraling effect). Finally, a single node leaving or joining the cluster shouldn't impact read transactions much. Check your policy and make sure you have the socketTimeout / totalTimeout / maxRetries, etc... set correctly so that a read can quickly retry against a different replica. This article can help on this latest point: https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852/3
httpd.service active failed
Can someone help me what's wrong? I tried to install let's encrypt but when I do "status httpd.service" this error occur httpd.service - The Apache HTTP Server Loaded: loaded (/usr/lib/systemd/system/httpd.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2018-05-07 10:32:15 CST; 5min ago Docs: man:httpd(8) man:apachectl(8) Process: 21861 ExecStop=/bin/kill -WINCH ${MAINPID} (code=exited, status=1/FAILURE) Process: 21859 ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND (code=exited, status=1/FAILURE) Main PID: 21859 (code=exited, status=1/FAILURE) May 07 10:32:15 JobSmile httpd[21859]: AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using 10.24.141.99. Set the 'ServerName' directive globally to suppress this message May 07 10:32:15 JobSmile httpd[21859]: (98)Address already in use: AH00072: make_sock: could not bind to address 0.0.0.0:80 May 07 10:32:15 JobSmile httpd[21859]: no listening sockets available, shutting down May 07 10:32:15 JobSmile httpd[21859]: AH00015: Unable to open logs May 07 10:32:15 JobSmile systemd[1]: httpd.service: main process exited, code=exited, status=1/FAILURE May 07 10:32:15 JobSmile kill[21861]: kill: cannot find process "" May 07 10:32:15 JobSmile systemd[1]: httpd.service: control process exited, code=exited status=1 May 07 10:32:15 JobSmile systemd[1]: Failed to start The Apache HTTP Server. May 07 10:32:15 JobSmile systemd[1]: Unit httpd.service entered failed state. May 07 10:32:15 JobSmile systemd[1]: httpd.service failed.
Restoring rabbitmq-server database to a different host
Im running rabbitmq-server-3.6.6-1.el7.noarch on a CentOS 7.4.1708 server. The /var/lib/ directory, on an ext4 lvm partition, had insufficient storage issues which required us to extend the lvm online, in order to make more space. This seemed to fix the problems at the time, but a rabbitmq-server service restart was needed, which, when attempted, hung. The service never started up anymore. In order to get rabbitmq working again, the old mnesia directory was backed up, and a new one was created. In order to recover the messages in the broken service, I have moved the old mnesia to a new server, added NODENAME=rabbit#oldserver to /etc/rabbitmq/rabbitmq-env.conf on the new server and tried to start it, but it keeps failing to start. Please how can I start the old rabbitmq database on the new host? [root#newserver]# cat /etc/rabbitmq/rabbitmq-env.conf NODENAME=rabbit#oldserver When I try to start the service on the new server: [root#newserver rabbitmq]# systemctl status rabbitmq-server.service -l ? rabbitmq-server.service - RabbitMQ broker Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2018-05-07 16:25:23 WAT; 1h 32min ago Process: 6484 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS) Process: 5997 ExecStart=/usr/sbin/rabbitmq-server (code=exited, status=1/FAILURE) Main PID: 5997 (code=exited, status=1/FAILURE) Status: "Exited." May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: * epmd reports: node 'rabbit' not running at all May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: other nodes on oldserver: ['rabbitmq-cli-20'] May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: * suggestion: start the node May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: current node details: May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: - node name: 'rabbitmq-cli-20#newserver' May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: - home dir: . May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: - cookie hash: edMXQlaNlKXH72ZvAXFhbw== May 07 16:25:23 newserver.dom.local systemd[1]: Failed to start RabbitMQ broker. May 07 16:25:23 newserver.dom.local systemd[1]: Unit rabbitmq-server.service entered failed state. May 07 16:25:23 newserver.dom.local systemd[1]: rabbitmq-server.service failed. =ERROR REPORT==== 7-May-2018::16:16:43 === ** Generic server <0.145.0> terminating ** Last message in was {'$gen_cast', {submit_async, #Fun<rabbit_queue_index.32.103862237>}} ** When Server state == undefined ** Reason for termination == ** {function_clause, [{rabbit_queue_index,parse_segment_entries, [<<1,23,0,255,54,241,95,251,201,20,69,202,0,0,0,0,0,0,0,0,0,0,0, 240,0,0,1,176,131,104,6,100,0,13,98,97,115,105,99,95,109,101, 115,115,97,103,101,104,4,100,0,8,114,101,115,111,117,114,99,101, 109,0,0,0,8,47,98,105,108,108,105,110,103,100,0,8,101,120,99, 104,97,110,103,101,109,0,0,0,7,98,105,108,108,105,110,103,108,0, 0,0,1,109,0,0,0,13,98,105,108,108,105,110,103,95,113,117,101, 117,101,106,104,6,100,0,7,99,111,110,116,101,110,116,97,60,100, 0,4,110,111,110,101,109,0,0,0,7,48,0,0,0,0,0,2,100,0,25,114,97, 98,98,105,116,95,102,114,97,109,105,110,103,95,97,109,113,112, 95,48,95,57,95,49,108,0,0,0,1,109,0,0,0,240,123,34,115,104,111, 114,116,67,111,100,101,34,58,34,52,50,54,95,109,101,110,117,115, 34,44,34,116,105,109,101,115,116,97,109,112,34,58,34,50,48,49, 56,45,48,52,45,49,56,84,49,48,58,53,57,58,50,55,46,57,51,53,90, 34,44,34,109,115,105,115,100,110,34,58,34,50,51,52,57,48,57,51, 52,52,57,50,57,51,34,44,34,105,100,34,58,34,50,55,50,49,56,95, 50,51,52,57,48,57,51,52,52,57,50,57,51,95,49,53,50,52,48,52,57, 48,52>>, ------snip-------goes-on-forever---- 100,0,4,116,114,117,101>>}, no_del,no_ack}, undefined,undefined,undefined,undefined,undefined, undefined,undefined,undefined}, 10,10,10,10}, 100,100,100,100,100}, 1000,1000,1000,1000,1000,1000,1000}, 10000,10000,10000,10000,10000,10000,10000,10000,10000}}, 8988}], [{file,"src/rabbit_queue_index.erl"},{line,1067}]}, {rabbit_queue_index,'-recover_journal/1-fun-0-',1, [{file,"src/rabbit_queue_index.erl"},{line,863}]}, {lists,map,2,[{file,"lists.erl"},{line,1224}]}, {rabbit_queue_index,segment_map,2, [{file,"src/rabbit_queue_index.erl"},{line,989}]}, {rabbit_queue_index,recover_journal,1, [{file,"src/rabbit_queue_index.erl"},{line,856}]}, {rabbit_queue_index,scan_segments,3, [{file,"src/rabbit_queue_index.erl"},{line,676}]}, {rabbit_queue_index,queue_index_walker_reader,2, [{file,"src/rabbit_queue_index.erl"},{line,664}]}, {rabbit_queue_index,'-queue_index_walker/1-fun-0-',2, [{file,"src/rabbit_queue_index.erl"},{line,645}]}]}
How to setup an aerospike cluster with a single node?
I currently have a working cluster with two nodes. Following is the content of /etc/aerospike/aerospike.conf - network { service { address any port 3000 } heartbeat { mode mesh port 3002 # Heartbeat port for this node. # List one or more other nodes, one ip-address & port per line: mesh-seed-address-port <existing server's ip> 3002 mesh-seed-address-port <other server's ip> 3002 interval 250 timeout 10 } fabric { port 3001 } info { port 3003 } } I tried by changing the heartbeat setting by removing the address port of the other node - heartbeat { mode mesh port 3002 # Heartbeat port for this node. # List one or more other nodes, one ip-address & port per line: mesh-seed-address-port <existing server's ip> 3002 interval 250 timeout 10 } Then I restarted the aerospike and the amc services - service aerospike restart service amc restart However, still the /var/log/aerospike/aerospike.log file shows two nodes present - Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:249) system-memory: free-kbytes 125756260 free-pct 99 heap-kbytes (2343074,2344032,2417664) heap-efficiency-pct 96.9 Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:263) in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0 Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:285) fds: proto (20,23,3) heartbeat (1,1,0) fabric (19,19,0) Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:294) heartbeat-received: self 0 foreign 1488 Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:348) {FC} objects: all 0 master 0 prole 0 Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:409) {FC} migrations: complete Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:428) {FC} memory-usage: total-bytes 0 index-bytes 0 sindex-bytes 0 data-bytes 0 used-pct 0.00 Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:348) {TARGETPARAMS} objects: all 0 master 0 prole 0 Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:409) {TARGETPARAMS} migrations: complete Mar 07 2017 13:16:28 GMT: INFO (info): (ticker.c:428) {TARGETPARAMS} memory-usage: total-bytes 0 index-bytes 0 sindex-bytes 0 data-bytes 0 used-pct 0.00 Mar 07 2017 13:16:38 GMT: INFO (info): (ticker.c:169) NODE-ID bb93c00b70b0022 CLUSTER-SIZE 2 Mar 07 2017 13:16:38 GMT: INFO (info): (ticker.c:249) system-memory: free-kbytes 125756196 free-pct 99 heap-kbytes (2343073,2344032,2417664) heap-efficiency-pct 96.9 So does the AMC console.
This should help: http://www.aerospike.com/docs/operations/manage/cluster_mng/removing_node Once the node is removed properly, you can restart it with the different heartbeat config so that it doesn't join the other node. For version, simply do asd --version. You can also use asinfo -v build. The version is also printed within asadm / AMC and in the logs right at startup.
Redis closing connections to clients while saving changed keys in background
I have one Redis server on EC2 with 2 app servers connected to it. All small/medium. Traffic is not high; only 10 changed keys in 300 seconds. I started noticing on the app servers connection errors to the Redis machine: Cannot get Jedis connection; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool At first I thought it was a problem with my pool configuration or the java client I am using to interface with Redis, however I quickly debunked this theory when I noticed that both app servers would always generate these exceptions at the exact same time and they'd always come in bunches. Then I looked at redis.log and noticed the following output while the errors would appear: [13939] 08 May 22:31:05.051 * 10 changes in 300 seconds. Saving... [13939] 08 May 22:31:05.342 * Background saving started by pid 13945 [13939] 08 May 22:31:09.357 - DB 0: 606477 keys (0 volatile) in 1048576 slots HT. [13939] 08 May 22:31:09.357 - 3 clients connected (0 slaves), 764180208 bytes in use [13939] 08 May 22:31:14.542 - DB 0: 606477 keys (0 volatile) in 1048576 slots HT. [13939] 08 May 22:31:25.947 - 3 clients connected (0 slaves), 764180208 bytes in use [13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.947 - Accepted 10.123.29.90:56301 [13939] 08 May 22:31:25.947 - Accepted 10.42.105.60:35315 [13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.947 - Accepted 10.123.29.90:56302 [13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.947 - Accepted 10.42.105.60:35317 [13939] 08 May 22:31:25.947 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.947 - Accepted 10.123.29.90:56306 [13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.948 - Accepted 10.42.105.60:35318 [13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.948 - Accepted 10.123.29.90:56308 [13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.948 - Accepted 10.42.105.60:35319 [13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.948 - Accepted 10.42.105.60:35320 [13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.948 - Accepted 10.123.29.90:56310 [13939] 08 May 22:31:25.948 - Error writing to client: Connection reset by peer [13939] 08 May 22:31:25.948 - Accepted 10.42.105.60:35322 [13939] 08 May 22:31:27.652 - Accepted 10.42.105.60:35327 [13939] 08 May 22:31:27.872 - Accepted 10.42.105.60:35329 [13945] 08 May 22:31:27.926 * DB saved on disk The errors only occur while Redis is saving new data in the background. I am using Redis 2.6. Any help is appreciated. EDIT: Redis connection pool configuration below using spring-data <bean id="redisPoolConfig" class="redis.clients.jedis.JedisPoolConfig" lazy-init="false" p:maxTotal="500" p:maxIdle="20" p:testOnBorrow="true" p:testOnCreate="true" p:testOnReturn="true" p:maxWaitMillis="30000" /> <bean id="jedisConnectionFactory" class="org.springframework.data.redis.connection.jedis.JedisConnectionFactory" p:hostName="${REDIS_HOST}" p:port="${REDIS_PORT}" p:usePool="true" p:poolConfig-ref="redisPoolConfig" />
I think I found the answer. After upgrading to 2.8.9 and restarting, the log alerted me to my OS limiting the number of open file descriptors to 1024, which can easily be exceeded with a running Redis instance. So what probably was happening was the background save process was opening up new file descriptors, which prevented new connections from being accepted since each new client connection opens a file. After increasing the limit to 10000 via ulimit -n everything seems to be functioning correctly.