Diagnosing unexpected redis-server failure - redis

One of my redis servers is repeatedly going down today without any overt, diagnosable cause. My users all end up getting Error 111 connecting to unix socket: /var/run/redis/redis2.sock. Connection refused errors.
Looking into the logs at /var/log/redis, the last few lines capture nothing more nefarious than a scheduled backup:
[8248] 09 Mar 07:48:17.090 * 10 changes in 21600 seconds. Saving...
[8248] 09 Mar 07:48:17.374 * Background saving started by pid 47613
[47613] 09 Mar 07:51:02.257 * DB saved on disk
[47613] 09 Mar 07:51:02.486 * RDB: 526 MB of memory used by copy-on-write
[8248] 09 Mar 07:51:02.920 * Background saving terminated with success
The pid file still exists too. Which implies the server wasn't formally shut down, and redis was still daemonized?
I logged into my system and did sudo service redis-server restart twice to get it up and running. Apart from these logs, how else can I diagnose what might have gone wrong?
Update: I noticed that at the time of the first crash, disk swapping started taking place. This hasn't happened before. Moreover, cat /proc/sys/vm/swappiness confirms swappiness is set to 2.
free -m shows (after normal operation):
total used free shared buffers cached
Mem: 28136 27015 1120 305 80 6586
-/+ buffers/cache: 20349 7787
Swap: 1023 991 32
free -m shows (after the redis server goes down):
total used free shared buffers cached
Mem: 28136 8770 19365 305 60 441
-/+ buffers/cache: 8268 19868
Swap: 1023 1022 1

This sounds like the work of the OS' OOM killer - you can verify/discredit the hypothesis by reviewing the /var/log/syslog.
In this case, the persistence job's overhead triggered the killer. You need to provision for that by setting maxmemory and allocating enough RAM to accommodate persistence's requirements, including COW.
Note that free isn't useful after the fact - you need to monitor your resources continuously.
As for swap, if you don't care about latency then you can certainly do that.

Related

Apache Intermittant Hang is it Network Lag?

I have an intermittent lag on the web applications I am serving from Apache on a Debian box. Apache and MySQL check out. I am far from fully utilizing the box CPU/Memory. Still there is an intermittent lag. My theory is there is a network rate limit needing to be tweaked. Stats below.
Apache Server Status
Current Time: Tuesday, 02-Jun-2020 14:36:53 EDT
Restart Time: Monday, 01-Jun-2020 01:00:03 EDT
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 1 day 13 hours 36 minutes 50 seconds
Server load: 2.95 3.23 3.09
Total accesses: 1213060 - Total Traffic: 22.0 GB - Total Duration: 32311929295
CPU Usage: u396.94 s164.31 cu2065.15 cs789.27 - 2.52% CPU load
8.96 requests/sec - 170.5 kB/second - 19.0 kB/request - 26636.7 ms/request
296 requests currently being processed, 66 idle workers
WR.WWWW.KWW_W._W_KWWWWWWKWWWWW_WWWWK_WK_WWW_WW_RWWWWWKCWWWWWW._W
_WW_R_W_.__K_WWWW__WWWWWWKKWWWWWWKWWWW_W____WWWWWWWW_WWW_KWWWWWW
WWWWWWWW_.WWWWWK_WWW_WWKWWWWWWKWWKWK_WWWWWRKWWW.WW_KKWKWWWKW_WWW
WW.W_.K._WWWK_WW_K_K._WW..WWWWWWW_.W_WWWW_W_W.W_WWWW_.WWKWK_WKWW
_W_WWWW_W.WWWWWW.WWWW_K__..W.WW_WWWWWWWWKRW_WWW_C.W_KW_WWW_KW.._
..WWWWWWWCWWW.WWW_WKKWWWW_._WWW.....WWW.W_W.W._.KW...W...WWW.WWW
W..W..K..WW_.W._................W..._W.W.....K.W.K_...R..K...W.W
...W..W.............................................
top
top - 14:31:14 up 79 days, 21:39, 3 users, load average: 2.26, 2.57, 2.86
Tasks: 717 total, 1 running, 716 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.3 us, 0.7 sy, 0.2 ni, 95.7 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 64365.1 total, 539.8 free, 8847.0 used, 54978.4 buff/cache
MiB Swap: 65477.0 total, 63810.0 free, 1667.0 used. 54580.5 avail Mem
ss -s
Total: 1934
TCP: 2362 (estab 1233, closed 1105, orphaned 2, timewait 1104)
Transport Total IP IPv6
RAW 0 0 0
UDP 0 0 0
TCP 1257 430 827
INET 1257 430 827
FRAG 0 0 0
ulimit -n
1024
ss -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n
1 Local
6 192.XXX.XXX.XXX
100 127.0.0.1
340 10.0.0.XX
866 [
ss -ntu | awk '{print $6}' | cut -d: -f1 | sort | uniq -c | sort -n
..........
lists # of ip connections. Besides 127.0.0.1 and [ there are 2 ips over 50.
74 104.xxx.xxx.xxx
91 12.xxx.xxx.xxx
MySQL
No processes running more than a second. Number of processes well within limits.
I do not know what stats would be relevant beyond these in diagnosing network rate limiting issues. Any pointers would be appreciated.
EDITED
CPU
lscpu https://pastebin.com/Jha6F7J8
Apache Config
apachectl -t -D DUMP_RUN_CFG https://pastebin.com/i1L2hnjH
Mysql
SHOW GLOBAL STATUS https://pastebin.com/aQX4D01k
SHOW GLOBAL VARIABLES https://pastebin.com/L8EfmHfn
SHOW FULL PROCESSLIST https://pastebin.com/GtqK2tET
mysqltuner https://pastebin.com/GLhhKA9q
Optional Very Helpful Information
top -bn1 https://pastebin.com/r94vpXe6
iostat -xm 5 3 https://pastebin.com/R8YLK3QU
ulimit -a https://pastebin.com/KUC3wqxU
Dorothy, Your system is very busy with activity. Not knowing the frequency and duration of the intermittent hangs puts us at a disadvantage. One possible cause is com_drop_table had 3,318 uses in your 83 days of uptime. Another possible cause is volume of data read and written. It appears innodb_data_written was 484TB in 83 days and yet MySQLTuner reports only 800K of data in 10 tables. Our General Log Analysis could likely identify the cause of this high activity. These suggestions will be a starting effort, more analysis and changes should be accomplished.
From your OS command prompt,
ulimit -n 96000 would enable many more Open Files (handles) above today's 1024 limit.
This is a dynamic operation in Linux and does not require OS restart to be implemented.
For this change to persist across OS stop/start the following URL could be used as a guide.
Please use 96000, not 500000 - as in their example documentation.
https://glassonionblog.wordpress.com/2013/01/27/increase-ulimit-and-file-descriptors-limit/
Rate Per Second = RPS
Suggestions to consider for your my.cnf [mysqld] section
innodb_io_capacity=1900 # from 200 if you have SSD, 900 if you have magnetic storage to improve IOPS
net_buffer_length=32K # from 16K to reduce malloc operations
innodb_lru_scan_depth=100 # from 1024 to conserve 90% of CPU cycles used for function
key_cache_segments=16 # from 0 to reduce mutex contention with MyISAM opens
key_cache_division_limit=50 # from 100 for Hot/Warm storage to reduce key_page_reads RPS of 18
aria_pagecache_division_limit=50 # from 100 for Hot/Warm storage to reduce aria_pagecache_reads RPS of 5K
read_rnd_buffer_size=64K # from 256K to reduce handler_read_rnd_next RPS of 27,707
These changes should reduce elapsed time to complete most queries.
Additional areas to consider include the use of Slow Query Log analysis to find where an index could avoid a table scan. MySQLTuner reported more than 4 million joins performed without indexes. Our FAQ page includes information on how you could find the tables needing indexes to avoid scans. Let us know how these suggestions work for you.
Skype Talk works very well if you have the flexibility to use that form of communication.

QEMU KVM disk IO/SQL replication issue, on one of two identical clone VM's

I have a system running two QEMU KVM virtual machines, identical clones of one another. Both VM's are replicating from the same Master MySQL DB. One VM (vm-01) is carrying an active load, and is running fine. However, the other (standby) VM (vm-02) suddenly fell behind with replication, at 08:00 this morning, and even though replication is running properly, it keeps falling further behind at a slow rate (1s behind for every 10s of real time). vm-02 has been running perfectly for months to date.
After checking all the usual suspects (CPU load, disk space, SQL query errors etc. etc.) it turns out that everything is just fine... except for the virtual disk IO - specifically the write requests (WRRQ). On the host machine:
virt-top 16:01:35 - x86_64 16/16CPU 1596MHz 128915MB
3 domains, 2 active, 2 running, 0 sleeping, 0 paused, 1 inactive D:0 O:0 X:0
CPU: 1.8% Mem: 32768 MB (32768 MB by guests)
ID S RDRQ WRRQ RXBY TXBY %CPU %MEM TIME NAME
3 R 3 1 113K 20K 1.3 12.0 62d21:21 vm-01-ubuntu
9 R 0 563 97K 11K 0.5 12.0 83:09:51 vm-02-ubuntu
- (vm-Clone-ubuntu)
Both VM's have bin-logs disabled, so they only write the relay-bin-log. The active machine (vm-01-ubuntu) is running thousands of radius requests just fine, in addition to the exact same master SQL commands... and it is happily running with a few write requests. But the standby machine falls behind, with hundreds of write requests... perhaps related to replication catching-up... but so slowly?
Checking disk IO on the VM's:
vm-01:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad01) 18/09/2019 _i686_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12,04 0,02 9,85 13,87 0,13 64,09
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0,00 13,91 0,91 147,67 5,20 16,05 0,29 0,11 0,72 0,57 0,73 0,04 0,65
vm-02:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad02) 18/09/2019 _i686_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0,26 0,01 0,25 6,46 0,09 92,93
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0,00 1,22 0,00 34,19 0,20 21,43 1,26 0,00 0,14 0,96 0,14 0,03 0,09
Doesn't yield any glaring issues, especially since the busier VM (vm-01) is doing more as expected.
The host machine has 128Gb of RAM, tons of SSD drive space, and is only running at 30% CPU usage. There are no RAID or drive issues.
Any suggestions on where to check next, given that the WRRQ count is the only evidence to date of vm-02 falling behind. Or am I chasing a red herring?
The issue is related to the guest OS, not the VM setup.
On Ubuntu the apt auto-update feature is quite aggressive, and in the case of the two suspect VM's, apt was attempting to constantly update the repos, writing at 16mB/s constantly. This is probably related to the fact that the Guest OS is Ubuntu 14.04, and the repos are no longer maintained.
The solution was to disable auto-updates, and rather run updates manually.
As root:
service unattended-upgrades stop
echo manual | tee /etc/init/unattended-upgrades.override
Then, edit apt configs to disable packages auto-refresh. Replace "APT::Periodic::Update-Package-Lists "1";" with "0":
cd /etc/apt/apt.conf.d/
cp 10periodic 10periodic.original
cat 10periodic | awk -F" " '$1=="APT::Periodic::Update-Package-Lists" {printf "%s %s\n",$1,"\"0\";"; next}1' > 10periodic
And lastly, disable the repos from the auto-upgrade list:
nano /etc/apt/apt.conf.d/50unattended-upgrades
Find section "Unattended-Upgrade::Allowed-Origins" and comment out the lines:
//"${distro_id}:${distro_codename}-security";
//"${distro_id}ESM:${distro_codename}";
I then rebooted the VM, and all has been well.

Redis – Failed opening .rdb for saving: Permission denied

I am using redis version 3.0.6. The redis-server process is being run by the redis user.
Suddenly from 5 days after 24 hours redis began failing "opening .rdb for saving." It was working properly before this.
As you can see in the snippet from the logs below, Redis was behaving normally, and then started failing. Power-cycling the server later resolved the issue.
1427:M 24 May 01:09:05.102 * Background saving started by pid 2493
2493:C 24 May 01:09:34.916 * DB saved on disk
2493:C 24 May 01:09:34.917 * RDB: 310 MB of memory used by copy-on-write
1427:M 24 May 01:09:34.950 * Background saving terminated with success
1427:M 24 May 01:14:35.026 * 10 changes in 300 seconds. Saving...
1427:M 24 May 01:14:35.036 * Background saving started by pid 2494
2494:C 24 May 01:15:04.329 * DB saved on disk
2494:C 24 May 01:15:04.330 * RDB: 298 MB of memory used by copy-on-write
1427:M 24 May 01:15:04.408 * Background saving terminated with success
1427:M 24 May 01:20:05.008 * 10 changes in 300 seconds. Saving...
1427:M 24 May 01:20:05.018 * Background saving started by pid 2499
2499:C 24 May 01:20:33.830 * DB saved on disk
2499:C 24 May 01:20:33.831 * RDB: 330 MB of memory used by copy-on-write
1427:M 24 May 01:20:33.843 * Background saving terminated with success
1427:M 24 May 01:23:46.966 # Failed opening .rdb for saving: Read-only file system
1427:M 24 May 01:25:34.029 * 10 changes in 300 seconds. Saving...
1427:M 24 May 01:25:34.038 * Background saving started by pid 2500
2500:C 24 May 01:25:34.038 # Failed opening .rdb for saving: Read-only file system
1427:M 24 May 01:25:34.139 # Background saving error
1427:M 24 May 01:25:40.059 * 10 changes in 300 seconds. Saving...
1427:M 24 May 01:25:40.064 * Background saving started by pid 2501
2501:C 24 May 01:25:40.064 # Failed opening .rdb for saving: Read-only file system
1427:M 24 May 01:25:40.165 # Background saving error
1427:M 24 May 01:25:46.080 * 10 changes in 300 seconds. Saving...
1427:M 24 May 01:25:46.085 * Background saving started by pid 2502
2502:C 24 May 01:25:46.085 # Failed opening .rdb for saving: Read-only file system
1427:M 24 May 01:25:46.186 # Background saving error
1427:M 24 May 01:25:52.100 * 10 changes in 300 seconds. Saving...
1427:M 24 May 01:25:52.105 * Background saving started by pid 2503
2503:C 24 May 01:25:52.105 # Failed opening .rdb for saving: Read-only file system
1427:M 24 May 01:25:52.206 # Background saving error
So, my question: how could this happen?
Please give me proper solution for this.
The "Read-only file system" I think is the key here. It's possible the device it's trying to write to is mounted incorrectly but since it happened randomly, the system may have forced the filesystem into readonly mode. There's a number of conditions that can trigger the operating system to put the filesystem into a read-only mode. This can mean that the filesystem became corrupt or there was some other filesystem consistency issue. If you're hosting on a cloud provider and the disk is network-backed like EBS in AWS, this can be triggered by a temporary network issue. Sometimes the issues are momentary and either force remounting the partition (or power cycling the server) will fix the issue. Other times it's permanent, but since your server came back up just fine, that would appear to not be the case. But the true fix for this would lie in your hardware setup which wasn't detailed.
This answer is related albeit thin on the "why": Failed opening the RDB file ... Read-only file system
After an upgrade.. (Ubuntu 14.04 LTS)
I had redis complain of this.. the file system was not RO. It was fine.
kill -9 REDIS-PROCESS # Otherwise it wouldn't die. looping on the error.
Deleted the dump.rdb file that already existed..
Started REDIS again, and the problem appeared to go away. (I only just did it.. so things may come back..)
It looks like it may have been an upgrade issue..
you can check your redis.conf, in this configure file you can find where the dbfilename is,
give the permission 755 'dir' which include dbfilename, it is /var/lib/redis (centos),
and the user and group to 'redis', but it should be 644 for the files in the dir.
restart redis.

What could cause Redis RDB Snapshoting to Stall?

I have a redis install on Ubuntu 14.04, and I seem to have nearly weekly issues with RDB snapshots completing. Redis version is 3.0.4 64 bit.
3838:M 24 Feb 09:46:28.826 * Background saving terminated with success
3838:M 24 Feb 09:47:29.088 * 100000 changes in 60 seconds. Saving...
3838:M 24 Feb 09:47:29.230 * Background saving started by pid 17281 17281:signal-handler (1456338079) Received SIGTERM scheduling shutdown...
3838:M 24 Feb 13:24:19.358 # Background saving terminated by signal 9
3838:M 24 Feb 13:24:19.622 * 10 changes in 900 seconds. Saving...
3838:M 24 Feb 13:24:19.730 * Background saving started by pid 17477
What you see there is that at 9:47am the background save started, but when I found it at 1:24pm it appeared to be completely stalled. I found the forked process to have basically no activity - the amount of memory it was consuming wasn't increasing. I tried to "kill" the child process, but it never actually quit, so i had to kill it with extreme prejudice (-9).
When things are getting bad, I get the following errors in my app:
2016-02-24 13:11:12,046 [2344] ERROR kCollectors.Main - Error while adding to Redis: No connection is available to service this operation: SADD ALLCH
My redis config is to do rdb snapshots only (no AOF). The load is modification heavy, with thousands of writes per second.
Currently I'm at the point where no redis background save is succeeding, and the background process becomes so much larger than the regular process that my VM starts swapping. Here's my TOP. 3838 is my redis instance, and 17477 is the background save process (as noted above):
top - 14:06:42 up 118 days, 2:05, 1 user, load average: 1.07, 1.07, 1.13
Tasks: 81 total, 3 running, 78 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 1.5 sy, 0.0 ni, 45.8 id, 51.3 wa, 0.0 hi,
0.5 si, 0.0 st
KiB Mem: 8176996 total, 8036792 used, 140204 free, 120 buffers
KiB Swap: 6289404 total, 3968236 used, 2321168 free. 4044 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
36 root 20 0 0 0 0 S 2.3 0.0
288:05.05 kswapd0
3838 rrr 20 0 7791836 3.734g 612 S 2.0
47.9 330:08.65 redis-server
17477 rrr 20 0 7792228 6.606g 364 D 1.0 84.7 0:43.49 redis-server
This is very interesting since I don't remember to ever read of such issues, so to discover the root cause could be very useful.
So here you are reporting a child process that stays a long time active, and even continues to allocate memory. I've no explanation for this if not a data corruption in the process memory, causing the RDB process to find unexpected conditions and looping forever in some way.
A few questions:
Does this happen even if you restart the process? (However please DON'T DO IT if you can avoid restarting and you did not restated yet, otherwise we may no longer understand the root cause).
While the RDB saving is active, do you see the CPU usage to be high and the process running with ps/top?
Could you try to interrupt the process with gdb -p <pid> and obtain a stack trace of the process?
Could you provide Redis INFO output to check version and other configuration things and state?
Could you check free output while this happens?
TLDR: is it possible the system is out of memory and is swapping a lot? So the child process while saving the RDB file visited all the pages and forced everything to be in the Resident Set. The system can't cope with so much I/O so it takes ages to complete the RDB saving.
EDIT: I just noticed you reported memory info:
KiB Mem: 8176996 total, 8036792 used, 140204 free, 120 buffers
So the system is out of memory and is swapping like crazy, and this results in the above behavior. As RDB saving starts, COW will use a lot of additional memory pushing the server on the memory limits.
Thanks.

redis config question?

I am using redis for caching but recently I ran into a problem with the amount of memory used - had to restart my server since all ram had been consumed.
It's not the biggest machine but how should I configure redis to avoid the same problem again?
free -m
total used free shared buffers cached
Mem: 240 222 17 0 6 38
-/+ buffers/cache: 177 62
Swap: 255 46 209
I have changed the following settings:
timeout 60
databases 1
save 300 1
save 60 100
maxmemory 104857600
top
top - 14:15:28 up 1:19, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 49 total, 1 running, 48 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 245956k total, 228420k used, 17536k free, 6916k buffers
Swap: 262136k total, 47628k used, 214508k free, 39540k cached
you can use the "maxmemory" directive in the config file: when this amount of memory is exceeded then Redis will expire earlier keys having already an expire set (the keys that would expire sooner are the first that will be removed).
Unlike memcached, redis is supposed to be a databse; so it won't automatically remove old values to make room for new ones.
You have to explicitly set the expire time for each key/value, and even then you could overflow if you create key/values faster than that.
Use Redis virtual memory in Redis 2.0:
http://antirez.com/post/redis-virtual-memory-story.html