Redis on GKE is running out of disk space - redis

I just installed redis (actually a reinstall and upgrade) on GKE via helm. It was a pretty standard install and nothing too out of the norm. Unfortunately my "redis-master" container logs are showing sync errors over and over again:
Info 2022-02-01 12:58:22.733 MST redis1:M 01 Feb 2022 19:58:22.733 * Waiting for end of BGSAVE for SYNC
Info 2022-02-01 12:58:22.733 MST redis 8085:C 01 Feb 2022 19:58:22.733 # Write error saving DB on disk: No space left on device
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # Background saving error
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # Connection with replica redis-replicas-0.:6379 lost.
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # SYNC failed. BGSAVE child returned an error
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # Connection with replica redis-replicas-1.:6379 lost.
Info 2022-02-01 12:58:22.830 MST redis 1:M 01 Feb 2022 19:58:22.829 # SYNC failed. BGSAVE child returned an error
Info 2022-02-01 12:58:22.832 MST redis 1:M 01 Feb 2022 19:58:22.832 * Replica redis-replicas-0.:6379 asks for synchronization
Info 2022-02-01 12:58:22.832 MST redis 1:M 01 Feb 2022 19:58:22.832 * Full resync requested by replica redis-replicas-0.:6379
Info 2022-02-01 12:58:22.832 MST redis 1:M 01 Feb 2022 19:58:22.832 * Starting BGSAVE for SYNC with target: disk
Info 2022-02-01 12:58:22.833 MST redis 1:M 01 Feb 2022 19:58:22.833 * Background saving started by pid 8086
I then looked at my persistent volume claim specification "redis-data" and it is in the "Pending" Phase and never seems to get out of that phase. If I look at all my PVCs though then they are all bound and appear to be healthy.
Clearly something isn't as healthy as it seems but I am not sure how to diagnose. Any help would be appreciated.

i know it late to the party but to add more if any of get stuck into the same scenario and can't delete the PVC they can increase size of the PVC in GKE.
Check storageclass :
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
…
provisioner: kubernetes.io/gce-pd
allowVolumeExpansion: true
Edit the PVC
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Field that you need to update in PVC
spec:
accessModes:
- ReadWriteOnce
resources:
requests: <== make sure in requests section
storage: 30Gi <=========
Once changes are applied for PVC and saved just Restart the POD now.
Sharing linke below : https://medium.com/#harsh.manvar111/resizing-pvc-disk-in-gke-c5b882c90f7b

So I was pretty close on the heels of it, in my case when I uninstalled redis it didn't remove the PVC (which makes some sense) and then when I reinstalled it tried to use the same PVC.
Unfortunately, that pvc had run out of memory.
I was able to manually delete the PVC's that previously existed (we didn't need to keep the data) and then reinstall redis via helm. At that point, it created new PVC's and worked fine.

Related

Redis OOM issue

We are using Redis 6.0.0. After OOM, I see this log. What is RDB memory usage?
3489 MB is very close to max memory that we have. Does it indicate that we are storing a lot of data in Redis ? Or its just being caused by RDB overhead.
1666:M 01 Jun 2022 19:23:32.268 # Server initialized
1666:M 01 Jun 2022 19:23:32.270 * Loading RDB produced by version 6.0.6
1666:M 01 Jun 2022 19:23:32.270 * RDB age 339 seconds
1666:M 01 Jun 2022 19:23:32.270 * RDB memory usage when created **3489.20 Mb**
Can we rule out fragmentation? Given that RDB memory usage itself indicated 3489 MB.

Redis crashing without any log errors

I'm debugging some weird behavior in my redis, where it's crashing each 2 days more or less, but not showing any errors whatsoever, only this on the logs:
1:C 10 Sep 2020 15:44:14.517 # Configuration loaded
1:M 10 Sep 2020 15:44:14.522 * Running mode=standalone, port=6379.
1:M 10 Sep 2020 15:44:14.522 # Server initialized
1:M 10 Sep 2020 15:44:14.524 * Ready to accept connections
1:C 12 Sep 2020 13:20:23.751 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2020 13:20:23.751 # Redis version=6.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2020 13:20:23.751 # Configuration loaded
1:M 12 Sep 2020 13:20:23.757 * Running mode=standalone, port=6379.
1:M 12 Sep 2020 13:20:23.757 # Server initialized
1:M 12 Sep 2020 13:20:23.758 * Ready to accept connections
That's all redis says to me.
I have lots of RAM available, but I have redis running as a single instance on a docker container, could the lack of processing power cause this? Should I use multiple nodes? I don't want to setup a cluster just to find out the problem was another, how can I trace down the actually cause of the problem?
So, in the end, it was exactly what I thought it was not: a memory leak!
I had 16GB that was slowly being consumed until redis crashed with no warnings, nor the operating system/docker. I fixed the app that caused the leak and the problem was gone.

Redis timeout with almost no data in the database, using the .NET client

I received this error:
StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms),
next: GET RetryCount, inst: 3, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 7, in-pipe: 0, out-pipe: 0,
serverEndpoint: redis:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: 18745af38fec,
IOCP: (Busy=0,Free=1000,Min=1,Max=1000),
WORKER: (Busy=6,Free=32761,Min=1,Max=32767), v: 2.1.58.34321
(Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
We can see that there is only a single message in the queue (qs=1) and that there are only 7 bytes waiting to be read (in=7). Redis is used by 2 processes and holds settings for the system and store logs.
It was a re-install so no logs were written and the database has probably 2-3kb of data :)
This is the only output from Redis:
1:C 12 Sep 2020 15:20:49.293 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2020 15:20:49.293 # Redis version=6.0.8, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2020 15:20:49.293 # Configuration loaded
1:M 12 Sep 2020 15:20:49.296 * Running mode=standalone, port=6379.
1:M 12 Sep 2020 15:20:49.296 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 12 Sep 2020 15:20:49.296 # Server initialized
1:M 12 Sep 2020 15:20:49.296 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memor
y=1' for this to take effect.
1:M 12 Sep 2020 15:20:49.296 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepag
e/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
1:M 12 Sep 2020 15:20:49.305 * DB loaded from append only file: 0.000 seconds
1:M 12 Sep 2020 15:20:49.305 * Ready to accept connections
so it looks like nothing went wrong on that side.
The 2 processes accessing it are in docker containers, so does Redis. All on a single AWS instance with a lot of ram and disk available.
this is also a one time event, it has never happened before with the same config.
I'm not very experienced with Redis; is there anything in the error message that would look suspicious?

Kubernetes Redis Cluster PubSub Channels not getting synched on replica

I have set up a Redis cluster on Kubernetes, the cluster state is OK and the replica is connected to the master. Also as per the logs, the full synchronization is also completed. The logs are as follows:-
9:M 22 Oct 12:24:18.209 * Slave 192.168.1.41:6379 asks for synchronization
9:M 22 Oct 12:24:18.209 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '794b9c74abe40ac90c752f32a102078e063ff636', my replication IDs are '0f499740a46665d12fab921838297273279ad136' and '0000000000000000000000000000000000000000')
9:M 22 Oct 12:24:18.209 * Starting BGSAVE for SYNC with target: disk
9:M 22 Oct 12:24:18.211 * Background saving started by pid 231
231:C 22 Oct 12:24:18.215 * DB saved on disk
231:C 22 Oct 12:24:18.216 * RDB: 4 MB of memory used by copy-on-write
9:M 22 Oct 12:24:18.224 * Background saving terminated with success
9:M 22 Oct 12:24:18.224 * Synchronization with slave 192.168.1.41:6379 succeeded
Still, when I check the List of the PubSub Channels on the replica, it does not show the channels and thus it breaks the PubSub flow.
Any help/advise is appreciated.

Failed opening .rdb for saving: Permission denied - started after a while of running successfully

I have had a node web service running successfully on an aws ubuntu server for over a month, with the requests cached using redis.
Yesterday I started getting the following error from some of my routes:
MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
I was able to stop the error occurring by using:
config set stop-writes-on-bgsave-error no
as suggested in the answers to this question, but it doesn't actually solve the underlying problem.
To find the underlying problem I checked the logs and found the following had started happening:
[1105] 09 Aug 13:17:14.800 - 0 clients connected (0 slaves), 797680 bytes in use
[1105] 09 Aug 13:17:15.101 * 1 changes in 900 seconds. Saving...
[1105] 09 Aug 13:17:15.101 * Background saving started by pid 28090
[28090] 09 Aug 13:17:15.101 # Failed opening .rdb for saving: Permission denied
[1105] 09 Aug 13:17:15.201 # Background saving error
Over the weekend no one had been using the server, but before the weekend the logs were fine, and we were getting no errors:
[12521] 06 Aug 04:49:27.308 - 0 clients connected (0 slaves), 803352 bytes in use
[12521] 06 Aug 04:49:29.012 * 1 changes in 900 seconds. Saving...
[12521] 06 Aug 04:49:29.012 * Background saving started by pid 26663
[26663] 06 Aug 04:49:29.014 * DB saved on disk
[26663] 06 Aug 04:49:29.014 * RDB: 2 MB of memory used by copy-on-write
[12521] 06 Aug 04:49:29.112 * Background saving terminated with success
As I said, no one has touched this server in the intervening time.
Looking around for people having the same problem I found this question. I checked the ownership and permissions on the directory and db file as suggested in the answers there:
drwxr-xr-x 2 redis redis 26 Aug 6 06:55 redis
-rw-r--r-- 1 redis redis 18 Aug 6 06:55 dump-6379.rdb
The permissions and ownership both look ok to me, but I have noticed that the date on the file and folder is between the last time I saw the service working and the first time it failed. Unfortunately that hasn't really helped me with what to do next and I am at a bit of a loss.
I am looking for suggestions for next steps to find the cause of the problem, or at least a way of making redis able to write again.