Redis OOM issue - redis

We are using Redis 6.0.0. After OOM, I see this log. What is RDB memory usage?
3489 MB is very close to max memory that we have. Does it indicate that we are storing a lot of data in Redis ? Or its just being caused by RDB overhead.
1666:M 01 Jun 2022 19:23:32.268 # Server initialized
1666:M 01 Jun 2022 19:23:32.270 * Loading RDB produced by version 6.0.6
1666:M 01 Jun 2022 19:23:32.270 * RDB age 339 seconds
1666:M 01 Jun 2022 19:23:32.270 * RDB memory usage when created **3489.20 Mb**
Can we rule out fragmentation? Given that RDB memory usage itself indicated 3489 MB.

Related

Redis on GKE is running out of disk space

I just installed redis (actually a reinstall and upgrade) on GKE via helm. It was a pretty standard install and nothing too out of the norm. Unfortunately my "redis-master" container logs are showing sync errors over and over again:
Info 2022-02-01 12:58:22.733 MST redis1:M 01 Feb 2022 19:58:22.733 * Waiting for end of BGSAVE for SYNC
Info 2022-02-01 12:58:22.733 MST redis 8085:C 01 Feb 2022 19:58:22.733 # Write error saving DB on disk: No space left on device
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # Background saving error
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # Connection with replica redis-replicas-0.:6379 lost.
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # SYNC failed. BGSAVE child returned an error
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # Connection with replica redis-replicas-1.:6379 lost.
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # SYNC failed. BGSAVE child returned an error
Info 2022-02-01 12:58:22.832 MST redis 1:M 01 Feb 2022 19:58:22.832 * Replica redis-replicas-0.:6379 asks for synchronization
Info 2022-02-01 12:58:22.832 MST redis 1:M 01 Feb 2022 19:58:22.832 * Full resync requested by replica redis-replicas-0.:6379
Info 2022-02-01 12:58:22.832 MST redis 1:M 01 Feb 2022 19:58:22.832 * Starting BGSAVE for SYNC with target: disk
Info 2022-02-01 12:58:22.833 MST redis 1:M 01 Feb 2022 19:58:22.833 * Background saving started by pid 8086
I then looked at my persistent volume claim specification "redis-data" and it is in the "Pending" Phase and never seems to get out of that phase. If I look at all my PVCs though then they are all bound and appear to be healthy.
Clearly something isn't as healthy as it seems but I am not sure how to diagnose. Any help would be appreciated.
i know it late to the party but to add more if any of get stuck into the same scenario and can't delete the PVC they can increase size of the PVC in GKE.
Check storageclass :
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
…
provisioner: kubernetes.io/gce-pd
allowVolumeExpansion: true
Edit the PVC
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Field that you need to update in PVC
spec:
accessModes:
- ReadWriteOnce
resources:
requests: <== make sure in requests section
storage: 30Gi <=========
Once changes are applied for PVC and saved just Restart the POD now.
Sharing linke below : https://medium.com/#harsh.manvar111/resizing-pvc-disk-in-gke-c5b882c90f7b
So I was pretty close on the heels of it, in my case when I uninstalled redis it didn't remove the PVC (which makes some sense) and then when I reinstalled it tried to use the same PVC.
Unfortunately, that pvc had run out of memory.
I was able to manually delete the PVC's that previously existed (we didn't need to keep the data) and then reinstall redis via helm. At that point, it created new PVC's and worked fine.

Redis crashing without any log errors

I'm debugging some weird behavior in my redis, where it's crashing each 2 days more or less, but not showing any errors whatsoever, only this on the logs:
1:C 10 Sep 2020 15:44:14.517 # Configuration loaded
1:M 10 Sep 2020 15:44:14.522 * Running mode=standalone, port=6379.
1:M 10 Sep 2020 15:44:14.522 # Server initialized
1:M 10 Sep 2020 15:44:14.524 * Ready to accept connections
1:C 12 Sep 2020 13:20:23.751 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2020 13:20:23.751 # Redis version=6.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2020 13:20:23.751 # Configuration loaded
1:M 12 Sep 2020 13:20:23.757 * Running mode=standalone, port=6379.
1:M 12 Sep 2020 13:20:23.757 # Server initialized
1:M 12 Sep 2020 13:20:23.758 * Ready to accept connections
That's all redis says to me.
I have lots of RAM available, but I have redis running as a single instance on a docker container, could the lack of processing power cause this? Should I use multiple nodes? I don't want to setup a cluster just to find out the problem was another, how can I trace down the actually cause of the problem?
So, in the end, it was exactly what I thought it was not: a memory leak!
I had 16GB that was slowly being consumed until redis crashed with no warnings, nor the operating system/docker. I fixed the app that caused the leak and the problem was gone.

Kubernetes Redis Cluster PubSub Channels not getting synched on replica

I have set up a Redis cluster on Kubernetes, the cluster state is OK and the replica is connected to the master. Also as per the logs, the full synchronization is also completed. The logs are as follows:-
9:M 22 Oct 12:24:18.209 * Slave 192.168.1.41:6379 asks for synchronization
9:M 22 Oct 12:24:18.209 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '794b9c74abe40ac90c752f32a102078e063ff636', my replication IDs are '0f499740a46665d12fab921838297273279ad136' and '0000000000000000000000000000000000000000')
9:M 22 Oct 12:24:18.209 * Starting BGSAVE for SYNC with target: disk
9:M 22 Oct 12:24:18.211 * Background saving started by pid 231
231:C 22 Oct 12:24:18.215 * DB saved on disk
231:C 22 Oct 12:24:18.216 * RDB: 4 MB of memory used by copy-on-write
9:M 22 Oct 12:24:18.224 * Background saving terminated with success
9:M 22 Oct 12:24:18.224 * Synchronization with slave 192.168.1.41:6379 succeeded
Still, when I check the List of the PubSub Channels on the replica, it does not show the channels and thus it breaks the PubSub flow.
Any help/advise is appreciated.

Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis

I have run below Test-1 and Test-2 for longer run for performance test with redis configuration values specified, still we see the highlighted error-1 & 2 message and cluster is failing for some time, few of our processing failing. How to solve this problem.
please anyone have suggestion to avoid cluster fail which is goes longer than 10seconds, cluster is not coming up within 3 retry attempts (spring retry template we are using for retry mechanism try count is set to 3, and retry after 5sec, its exponential way next attempts) using Jedis client.
Error-1: Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
Error-2: Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
Test-1:
Run the test with Redis Setting:
"appendfsync"="yes"
"appendonly"="no"
[root#rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6379
127.0.0.1:6379> CONFIG GET **aof***
1) "auto-aof-rewrite-percentage"
2) "30"
3) "auto-aof-rewrite-min-size"
4) "67108864"
5) "aof-rewrite-incremental-fsync"
6) "yes"
7) "aof-load-truncated"
8) "yes"
127.0.0.1:6379> exit
[root#rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6380
127.0.0.1:6380> CONFIG GET **aof***
1) "auto-aof-rewrite-percentage"
2) "30"
3) "auto-aof-rewrite-min-size"
4) "67108864"
5) "aof-rewrite-incremental-fsync"
6) "yes"
7) "aof-load-truncated"
8) "yes"
127.0.0.1:6380> clear
Observation:
1. The redis failover occurred for ~40 sec.
2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.
3. This has been happened when ~50% of RAM got utilized.
4. The Master-slave configuration has be reshuffled as below after this failover.
5. Below are the few highlights of the redis log, plese refer that attached log for more detail.
6. I have logs for this, for more details: 30Per_AofRW_2.zip
Redis1 Master log:
2515:C 05 May 11:06:30.343 * DB saved on disk
2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write
837:S 05 May 11:06:30.429 * Background saving terminated with success
837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...
837:S 05 May 11:11:31.067 * Background saving started by pid 2534
837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
2534:C 05 May 11:12:31.110 * DB saved on disk
Redis2 Master log:
837:M 05 May 10:30:22.216 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
837:M 05 May 10:30:22.216 # Cluster state changed: fail
837:M 05 May 10:30:23.148 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12
837:M 05 May 10:30:23.188 # Cluster state changed: ok
837:M 05 May 10:30:27.227 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.
837:M 05 May 10:35:22.017 * 10 changes in 300 seconds. Saving...
.
.
.
837:M 05 May 11:12:23.592 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:M 05 May 11:12:27.045 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
Redis3 Master Log:
833:M 05 May 10:30:22.217 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba
833:M 05 May 10:30:22.217 # Cluster state changed: fail
833:M 05 May 10:30:23.149 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12
833:M 05 May 10:30:23.189 # Cluster state changed: ok
1822:C 05 May 10:30:27.397 * DB saved on disk
1822:C 05 May 10:30:27.428 * RDB: 8 MB of memory used by copy-on-write
833:M 05 May 10:30:27.528 * Background saving terminated with success
833:M 05 May 10:30:27.828 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.
HOST: localhost PORT: 6379
machine master slave
10.2.1.233 0.00 2.00
10.2.1.46 2.00 0.00
10.2.1.202 1.00 1.00
MASTER SLAVE INFO
hashCode master slave hashSlot
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383
ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected
ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922
a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected
Test-2:
Run the test with Redis Setting:
"appendfsync"="no"
"appendonly"="yes"
Observation:
1. The redis failover occurred for ~40 sec.
2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.
3. This has been happened when ~50% of RAM got utilized.
4. The Master-slave configuration has be reshuffled as below after this failover.
5. Below are the few highlights of the redis log, plese refer that attached log for more detail.
30Per_AofRW_2.zip
Redis1 Master log:
2515:C 05 May 11:06:30.343 * DB saved on disk
2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write
837:S 05 May 11:06:30.429 * Background saving terminated with success
837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...
837:S 05 May 11:11:31.067 * Background saving started by pid 2534
837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
2534:C 05 May 11:12:31.110 * DB saved on disk
Redis2 Master log:
5306:M 03 Apr 09:02:36.947 * Background saving terminated with success
5306:M 03 Apr 09:02:49.574 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:02:49.583 * Background append only file rewriting started by pid 12864
5306:M 03 Apr 09:02:54.050 * AOF rewrite child asks to stop sending diffs.
12864:C 03 Apr 09:02:54.051 * Parent agreed to stop sending diffs. Finalizing AOF...
12864:C 03 Apr 09:02:54.051 * Concatenating 0.00 MB of AOF diff received from parent.
12864:C 03 Apr 09:02:54.051 * SYNC append only file rewrite performed
12864:C 03 Apr 09:02:54.058 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:02:54.098 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:02:54.098 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:02:54.098 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:04:01.843 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:04:01.853 * Background append only file rewriting started by pid 12867
5306:M 03 Apr 09:04:11.657 * AOF rewrite child asks to stop sending diffs.
12867:C 03 Apr 09:04:11.657 * Parent agreed to stop sending diffs. Finalizing AOF...
12867:C 03 Apr 09:04:11.657 * Concatenating 0.00 MB of AOF diff received from parent.
12867:C 03 Apr 09:04:11.657 * SYNC append only file rewrite performed
12867:C 03 Apr 09:04:11.664 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:04:11.675 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:04:11.675 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:04:11.675 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:04:48.054 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 09:05:28.571 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:05:28.581 * Background append only file rewriting started by pid 12873
5306:M 03 Apr 09:05:33.300 * AOF rewrite child asks to stop sending diffs.
12873:C 03 Apr 09:05:33.300 * Parent agreed to stop sending diffs. Finalizing AOF...
12873:C 03 Apr 09:05:33.300 * Concatenating 2.09 MB of AOF diff received from parent.
12873:C 03 Apr 09:05:33.329 * SYNC append only file rewrite performed
12873:C 03 Apr 09:05:33.336 * AOF rewrite: 11 MB of memory used by copy-on-write
5306:M 03 Apr 09:05:33.395 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:05:33.395 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:05:33.395 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:07:37.082 * 10 changes in 300 seconds. Saving...
5306:M 03 Apr 09:07:37.092 * Background saving started by pid 12875
12875:C 03 Apr 09:07:47.016 * DB saved on disk
12875:C 03 Apr 09:07:47.024 * RDB: 5 MB of memory used by copy-on-write
5306:M 03 Apr 09:07:47.113 * Background saving terminated with success
5306:M 03 Apr 09:07:51.622 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:07:51.632 * Background append only file rewriting started by pid 12876
5306:M 03 Apr 09:07:56.559 * AOF rewrite child asks to stop sending diffs.
12876:C 03 Apr 09:07:56.559 * Parent agreed to stop sending diffs. Finalizing AOF...
12876:C 03 Apr 09:07:56.559 * Concatenating 0.00 MB of AOF diff received from parent.
12876:C 03 Apr 09:07:56.559 * SYNC append only file rewrite performed
12876:C 03 Apr 09:07:56.567 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:07:56.645 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:07:56.645 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:07:56.645 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:12:48.071 * 10 changes in 300 seconds. Saving...
5306:M 03 Apr 09:12:48.081 * Background saving started by pid 12882
12882:C 03 Apr 09:12:58.381 * DB saved on disk
12882:C 03 Apr 09:12:58.389 * RDB: 5 MB of memory used by copy-on-write
5306:M 03 Apr 09:12:58.403 * Background saving terminated with success
5306:M 03 Apr 10:17:33.005 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 10:22:42.042 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 10:27:51.039 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 11:10:10.606 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
5306:M 03 Apr 11:10:10.607 # Cluster state changed: fail
5306:M 03 Apr 11:10:10.608 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about ac630108d1556786a4df74945cfe35db981d15fa
5306:M 03 Apr 11:10:11.594 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 7
HOST: localhost PORT: 6379
machine master slave
10.2.1.233 0.00 2.00
10.2.1.46 2.00 0.00
10.2.1.202 1.00 1.00
MASTER SLAVE INFO
hashCode master slave hashSlot
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383
ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected
ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922
a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected

Failed opening .rdb for saving: Permission denied - started after a while of running successfully

I have had a node web service running successfully on an aws ubuntu server for over a month, with the requests cached using redis.
Yesterday I started getting the following error from some of my routes:
MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
I was able to stop the error occurring by using:
config set stop-writes-on-bgsave-error no
as suggested in the answers to this question, but it doesn't actually solve the underlying problem.
To find the underlying problem I checked the logs and found the following had started happening:
[1105] 09 Aug 13:17:14.800 - 0 clients connected (0 slaves), 797680 bytes in use
[1105] 09 Aug 13:17:15.101 * 1 changes in 900 seconds. Saving...
[1105] 09 Aug 13:17:15.101 * Background saving started by pid 28090
[28090] 09 Aug 13:17:15.101 # Failed opening .rdb for saving: Permission denied
[1105] 09 Aug 13:17:15.201 # Background saving error
Over the weekend no one had been using the server, but before the weekend the logs were fine, and we were getting no errors:
[12521] 06 Aug 04:49:27.308 - 0 clients connected (0 slaves), 803352 bytes in use
[12521] 06 Aug 04:49:29.012 * 1 changes in 900 seconds. Saving...
[12521] 06 Aug 04:49:29.012 * Background saving started by pid 26663
[26663] 06 Aug 04:49:29.014 * DB saved on disk
[26663] 06 Aug 04:49:29.014 * RDB: 2 MB of memory used by copy-on-write
[12521] 06 Aug 04:49:29.112 * Background saving terminated with success
As I said, no one has touched this server in the intervening time.
Looking around for people having the same problem I found this question. I checked the ownership and permissions on the directory and db file as suggested in the answers there:
drwxr-xr-x 2 redis redis 26 Aug 6 06:55 redis
-rw-r--r-- 1 redis redis 18 Aug 6 06:55 dump-6379.rdb
The permissions and ownership both look ok to me, but I have noticed that the date on the file and folder is between the last time I saw the service working and the first time it failed. Unfortunately that hasn't really helped me with what to do next and I am at a bit of a loss.
I am looking for suggestions for next steps to find the cause of the problem, or at least a way of making redis able to write again.