Is there any infinite loops? - valgrind

I have run the program without the valgrind and it just ended fine just in a minute.
When I try to run in through the callgrind(valgrind --tool=callgrind), the program is never ended(at least for a six hour), and here is the top command output
PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND
3722 vhovhann 17 0 75 52:38.95 13.5 10.4g 9.6g 34m R callgrind-amd64
PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND
3722 vhovhann 17 0 100 53:21.40 13.6 10.4g 9.6g 34m R callgrind-amd64
I am wondering why the program is not ended with the valgrind?

It depends on the program. For instance, if you are using threads, the program might have got stuck in a deadlock since threads behaves different on Valgrind (Valgrind serializes threaded applications).

Related

QEMU KVM disk IO/SQL replication issue, on one of two identical clone VM's

I have a system running two QEMU KVM virtual machines, identical clones of one another. Both VM's are replicating from the same Master MySQL DB. One VM (vm-01) is carrying an active load, and is running fine. However, the other (standby) VM (vm-02) suddenly fell behind with replication, at 08:00 this morning, and even though replication is running properly, it keeps falling further behind at a slow rate (1s behind for every 10s of real time). vm-02 has been running perfectly for months to date.
After checking all the usual suspects (CPU load, disk space, SQL query errors etc. etc.) it turns out that everything is just fine... except for the virtual disk IO - specifically the write requests (WRRQ). On the host machine:
virt-top 16:01:35 - x86_64 16/16CPU 1596MHz 128915MB
3 domains, 2 active, 2 running, 0 sleeping, 0 paused, 1 inactive D:0 O:0 X:0
CPU: 1.8% Mem: 32768 MB (32768 MB by guests)
ID S RDRQ WRRQ RXBY TXBY %CPU %MEM TIME NAME
3 R 3 1 113K 20K 1.3 12.0 62d21:21 vm-01-ubuntu
9 R 0 563 97K 11K 0.5 12.0 83:09:51 vm-02-ubuntu
- (vm-Clone-ubuntu)
Both VM's have bin-logs disabled, so they only write the relay-bin-log. The active machine (vm-01-ubuntu) is running thousands of radius requests just fine, in addition to the exact same master SQL commands... and it is happily running with a few write requests. But the standby machine falls behind, with hundreds of write requests... perhaps related to replication catching-up... but so slowly?
Checking disk IO on the VM's:
vm-01:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad01) 18/09/2019 _i686_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12,04 0,02 9,85 13,87 0,13 64,09
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0,00 13,91 0,91 147,67 5,20 16,05 0,29 0,11 0,72 0,57 0,73 0,04 0,65
vm-02:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad02) 18/09/2019 _i686_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0,26 0,01 0,25 6,46 0,09 92,93
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0,00 1,22 0,00 34,19 0,20 21,43 1,26 0,00 0,14 0,96 0,14 0,03 0,09
Doesn't yield any glaring issues, especially since the busier VM (vm-01) is doing more as expected.
The host machine has 128Gb of RAM, tons of SSD drive space, and is only running at 30% CPU usage. There are no RAID or drive issues.
Any suggestions on where to check next, given that the WRRQ count is the only evidence to date of vm-02 falling behind. Or am I chasing a red herring?
The issue is related to the guest OS, not the VM setup.
On Ubuntu the apt auto-update feature is quite aggressive, and in the case of the two suspect VM's, apt was attempting to constantly update the repos, writing at 16mB/s constantly. This is probably related to the fact that the Guest OS is Ubuntu 14.04, and the repos are no longer maintained.
The solution was to disable auto-updates, and rather run updates manually.
As root:
service unattended-upgrades stop
echo manual | tee /etc/init/unattended-upgrades.override
Then, edit apt configs to disable packages auto-refresh. Replace "APT::Periodic::Update-Package-Lists "1";" with "0":
cd /etc/apt/apt.conf.d/
cp 10periodic 10periodic.original
cat 10periodic | awk -F" " '$1=="APT::Periodic::Update-Package-Lists" {printf "%s %s\n",$1,"\"0\";"; next}1' > 10periodic
And lastly, disable the repos from the auto-upgrade list:
nano /etc/apt/apt.conf.d/50unattended-upgrades
Find section "Unattended-Upgrade::Allowed-Origins" and comment out the lines:
//"${distro_id}:${distro_codename}-security";
//"${distro_id}ESM:${distro_codename}";
I then rebooted the VM, and all has been well.

Use ssh into machine and get back load info

I have ssh access to a list of ~20 machines. I need to find the load status for all of them in a list. The program 'top' does a good job giving info on the machine status in its header.
Example:
top - 13:29:53 up 107 days, 20:13, 47 users, load average: 3.80, 3.74, 3.62
Tasks: 794 total, 2 running, 787 sleeping, 3 stopped, 2 zombie
Cpu(s): 2.6%us, 0.8%sy, 0.0%ni, 84.7%id, 11.9%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 99055876k total, 47947572k used, 51108304k free, 697684k buffers
Swap: 26148860k total, 17145136k used, 9003724k free, 35844820k cached
Today I manually do ssh into each machine, do the 'top' copy the data and store it. I was wondering if this task can be automated. I found out that ssh has the option of giving a unix cmd as an argument to be executed on the remote machine. But how to capture the output from 'top'? Or is there a batch-too giving the same header output? It would be great to have just one script that does the table for me.
Thanks,
Gert
For Ubuntu:
[12:15 AM] borlaze#mac: /tmp $ ssh USER#HOST 'top -b -n 1 | head -n 5' >123.txt
[12:15 AM] borlaze#mac: /tmp $ cat 123.txt
top - 00:16:06 up 35 days, 10:58, 1 user, load average: 0,34, 0,36, 0,29
Tasks: 277 total, 1 running, 274 sleeping, 0 stopped, 2 zombie
%Cpu(s): 7,1 us, 5,7 sy, 0,0 ni, 87,0 id, 0,1 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem : 24671340 total, 1066056 free, 12822724 used, 10782560 buff/cache
KiB Swap: 16756732 total, 16094308 free, 662424 used. 11208916 avail Mem

Erlang VM killed when creating millions of processes

So after Joe Armstrongs' claims that erlang processes are cheap and vm can handle millions of them. I decided to test it on my machine:
process_galore(N)->
io:format("process limit: ~p~n", [erlang:system_info(process_limit)]),
statistics(runtime),
statistics(wall_clock),
L = for(0, N, fun()-> spawn(fun() -> wait() end) end),
{_, Rt} = statistics(runtime),
{_, Wt} = statistics(wall_clock),
lists:foreach(fun(Pid)-> Pid ! die end, L),
io:format("Processes created: ~p~n
Run time ms: ~p~n
Wall time ms: ~p~n
Average run time: ~p microseconds!~n", [N, Rt, Wt, (Rt/N)*1000]).
wait()->
receive die ->
done
end.
for(N, N, _)->
[];
for(I, N, Fun) when I < N ->
[Fun()|for(I+1, N, Fun)].
Results are impressive for million processes - I get aprox 6.6 micro! seconds average spawn time. But when starting 3m processes, OS shell prints "Killed" with erlang runtime gone.
I run erl with +P 5000000 flag, system is: arch linux with quadcore i7 and 8GB ram.
Erlang processes are cheap, but they're not free. Erlang processes spawned by spawn use 338 words of memory, which is 2704 bytes on a 64 bit system. Spawning 3 million processes will use at least 8112 MB of RAM, not counting the overhead of creating the linked list of pids and the anonymous function created for each process (I'm not sure if they're shared if they're created like you're creating.) You'll probably need 10-12GB of free RAM to spawn and keep alive 3 million (almost) empty processes.
As I pointed out in the comments (and you later verified), the "Killed" message was printed by the Linux Kernel when it killed the Erlang VM, most likely for using up too much RAM. More information here.

What could cause Redis RDB Snapshoting to Stall?

I have a redis install on Ubuntu 14.04, and I seem to have nearly weekly issues with RDB snapshots completing. Redis version is 3.0.4 64 bit.
3838:M 24 Feb 09:46:28.826 * Background saving terminated with success
3838:M 24 Feb 09:47:29.088 * 100000 changes in 60 seconds. Saving...
3838:M 24 Feb 09:47:29.230 * Background saving started by pid 17281 17281:signal-handler (1456338079) Received SIGTERM scheduling shutdown...
3838:M 24 Feb 13:24:19.358 # Background saving terminated by signal 9
3838:M 24 Feb 13:24:19.622 * 10 changes in 900 seconds. Saving...
3838:M 24 Feb 13:24:19.730 * Background saving started by pid 17477
What you see there is that at 9:47am the background save started, but when I found it at 1:24pm it appeared to be completely stalled. I found the forked process to have basically no activity - the amount of memory it was consuming wasn't increasing. I tried to "kill" the child process, but it never actually quit, so i had to kill it with extreme prejudice (-9).
When things are getting bad, I get the following errors in my app:
2016-02-24 13:11:12,046 [2344] ERROR kCollectors.Main - Error while adding to Redis: No connection is available to service this operation: SADD ALLCH
My redis config is to do rdb snapshots only (no AOF). The load is modification heavy, with thousands of writes per second.
Currently I'm at the point where no redis background save is succeeding, and the background process becomes so much larger than the regular process that my VM starts swapping. Here's my TOP. 3838 is my redis instance, and 17477 is the background save process (as noted above):
top - 14:06:42 up 118 days, 2:05, 1 user, load average: 1.07, 1.07, 1.13
Tasks: 81 total, 3 running, 78 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 1.5 sy, 0.0 ni, 45.8 id, 51.3 wa, 0.0 hi,
0.5 si, 0.0 st
KiB Mem: 8176996 total, 8036792 used, 140204 free, 120 buffers
KiB Swap: 6289404 total, 3968236 used, 2321168 free. 4044 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
36 root 20 0 0 0 0 S 2.3 0.0
288:05.05 kswapd0
3838 rrr 20 0 7791836 3.734g 612 S 2.0
47.9 330:08.65 redis-server
17477 rrr 20 0 7792228 6.606g 364 D 1.0 84.7 0:43.49 redis-server
This is very interesting since I don't remember to ever read of such issues, so to discover the root cause could be very useful.
So here you are reporting a child process that stays a long time active, and even continues to allocate memory. I've no explanation for this if not a data corruption in the process memory, causing the RDB process to find unexpected conditions and looping forever in some way.
A few questions:
Does this happen even if you restart the process? (However please DON'T DO IT if you can avoid restarting and you did not restated yet, otherwise we may no longer understand the root cause).
While the RDB saving is active, do you see the CPU usage to be high and the process running with ps/top?
Could you try to interrupt the process with gdb -p <pid> and obtain a stack trace of the process?
Could you provide Redis INFO output to check version and other configuration things and state?
Could you check free output while this happens?
TLDR: is it possible the system is out of memory and is swapping a lot? So the child process while saving the RDB file visited all the pages and forced everything to be in the Resident Set. The system can't cope with so much I/O so it takes ages to complete the RDB saving.
EDIT: I just noticed you reported memory info:
KiB Mem: 8176996 total, 8036792 used, 140204 free, 120 buffers
So the system is out of memory and is swapping like crazy, and this results in the above behavior. As RDB saving starts, COW will use a lot of additional memory pushing the server on the memory limits.
Thanks.

mount and loop0 preempting user processes for long time

I created some processes in user space and tried to visualize its working in kernelshark with the trace recorded using trace-cmd. But processes like the ones shown below are preempting my processes with real-time priority 98 as long as 4 seconds. These kind of processes don't even have a real-time priority
PID PPID CLS PRI RTPRIO CMD
359 1 TS 19 - mount.ntfs /dev/disk/by-uuid/AC72763B72760A7A /root
366 2 TS 39 - [loop0]
What can I do to run my processes uninterrupted by these kind of kernel processes?