How to collect hardware event counts with perf stat on KVM host/guest?

How to collect hardware event counts with perf stat on KVM host/guest? - kvm

On a 6.4 host (2.6.32-358) on SandyBridge, I am trying to collect hardware event counts for guest activity with perf stat. Although virt-top reports healthy activity in the guests, I get the following with the ":G" modifier
# perf stat -e cycles:G sleep 10
Performance counter stats for 'sleep 10':
0 cycles:G # 0.000 GHz
I tried collecting inside the guest, but get the following:
# perf stat -e cycles -A -a sleep 1
Performance counter stats for 'sleep 1':
CPU0 < not supported> cycles
I see that there is a perf kvm, but this only has top/record/report and seems to be intended for profiling an application using sampling, not collecting hardware counts.
On the host, how do I get perf stat to count the guest activity; and on the guest, what is needed to expose hardware event counting to perf stat?

Related

Apache Intermittant Hang is it Network Lag?

I have an intermittent lag on the web applications I am serving from Apache on a Debian box. Apache and MySQL check out. I am far from fully utilizing the box CPU/Memory. Still there is an intermittent lag. My theory is there is a network rate limit needing to be tweaked. Stats below.
Apache Server Status
Current Time: Tuesday, 02-Jun-2020 14:36:53 EDT
Restart Time: Monday, 01-Jun-2020 01:00:03 EDT
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 1 day 13 hours 36 minutes 50 seconds
Server load: 2.95 3.23 3.09
Total accesses: 1213060 - Total Traffic: 22.0 GB - Total Duration: 32311929295
CPU Usage: u396.94 s164.31 cu2065.15 cs789.27 - 2.52% CPU load
8.96 requests/sec - 170.5 kB/second - 19.0 kB/request - 26636.7 ms/request
296 requests currently being processed, 66 idle workers
WR.WWWW.KWW_W._W_KWWWWWWKWWWWW_WWWWK_WK_WWW_WW_RWWWWWKCWWWWWW._W
_WW_R_W_.__K_WWWW__WWWWWWKKWWWWWWKWWWW_W____WWWWWWWW_WWW_KWWWWWW
WWWWWWWW_.WWWWWK_WWW_WWKWWWWWWKWWKWK_WWWWWRKWWW.WW_KKWKWWWKW_WWW
WW.W_.K._WWWK_WW_K_K._WW..WWWWWWW_.W_WWWW_W_W.W_WWWW_.WWKWK_WKWW
_W_WWWW_W.WWWWWW.WWWW_K__..W.WW_WWWWWWWWKRW_WWW_C.W_KW_WWW_KW.._
..WWWWWWWCWWW.WWW_WKKWWWW_._WWW.....WWW.W_W.W._.KW...W...WWW.WWW
W..W..K..WW_.W._................W..._W.W.....K.W.K_...R..K...W.W
...W..W.............................................
top
top - 14:31:14 up 79 days, 21:39, 3 users, load average: 2.26, 2.57, 2.86
Tasks: 717 total, 1 running, 716 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.3 us, 0.7 sy, 0.2 ni, 95.7 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 64365.1 total, 539.8 free, 8847.0 used, 54978.4 buff/cache
MiB Swap: 65477.0 total, 63810.0 free, 1667.0 used. 54580.5 avail Mem
ss -s
Total: 1934
TCP: 2362 (estab 1233, closed 1105, orphaned 2, timewait 1104)
Transport Total IP IPv6
RAW 0 0 0
UDP 0 0 0
TCP 1257 430 827
INET 1257 430 827
FRAG 0 0 0
ulimit -n
1024
ss -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n
1 Local
6 192.XXX.XXX.XXX
100 127.0.0.1
340 10.0.0.XX
866 [
ss -ntu | awk '{print $6}' | cut -d: -f1 | sort | uniq -c | sort -n
..........
lists # of ip connections. Besides 127.0.0.1 and [ there are 2 ips over 50.
74 104.xxx.xxx.xxx
91 12.xxx.xxx.xxx
MySQL
No processes running more than a second. Number of processes well within limits.
I do not know what stats would be relevant beyond these in diagnosing network rate limiting issues. Any pointers would be appreciated.
EDITED
CPU
lscpu https://pastebin.com/Jha6F7J8
Apache Config
apachectl -t -D DUMP_RUN_CFG https://pastebin.com/i1L2hnjH
Mysql
SHOW GLOBAL STATUS https://pastebin.com/aQX4D01k
SHOW GLOBAL VARIABLES https://pastebin.com/L8EfmHfn
SHOW FULL PROCESSLIST https://pastebin.com/GtqK2tET
mysqltuner https://pastebin.com/GLhhKA9q
Optional Very Helpful Information
top -bn1 https://pastebin.com/r94vpXe6
iostat -xm 5 3 https://pastebin.com/R8YLK3QU
ulimit -a https://pastebin.com/KUC3wqxU

Dorothy, Your system is very busy with activity. Not knowing the frequency and duration of the intermittent hangs puts us at a disadvantage. One possible cause is com_drop_table had 3,318 uses in your 83 days of uptime. Another possible cause is volume of data read and written. It appears innodb_data_written was 484TB in 83 days and yet MySQLTuner reports only 800K of data in 10 tables. Our General Log Analysis could likely identify the cause of this high activity. These suggestions will be a starting effort, more analysis and changes should be accomplished.
From your OS command prompt,
ulimit -n 96000 would enable many more Open Files (handles) above today's 1024 limit.
This is a dynamic operation in Linux and does not require OS restart to be implemented.
For this change to persist across OS stop/start the following URL could be used as a guide.
Please use 96000, not 500000 - as in their example documentation.
https://glassonionblog.wordpress.com/2013/01/27/increase-ulimit-and-file-descriptors-limit/
Rate Per Second = RPS
Suggestions to consider for your my.cnf [mysqld] section
innodb_io_capacity=1900 # from 200 if you have SSD, 900 if you have magnetic storage to improve IOPS
net_buffer_length=32K # from 16K to reduce malloc operations
innodb_lru_scan_depth=100 # from 1024 to conserve 90% of CPU cycles used for function
key_cache_segments=16 # from 0 to reduce mutex contention with MyISAM opens
key_cache_division_limit=50 # from 100 for Hot/Warm storage to reduce key_page_reads RPS of 18
aria_pagecache_division_limit=50 # from 100 for Hot/Warm storage to reduce aria_pagecache_reads RPS of 5K
read_rnd_buffer_size=64K # from 256K to reduce handler_read_rnd_next RPS of 27,707
These changes should reduce elapsed time to complete most queries.
Additional areas to consider include the use of Slow Query Log analysis to find where an index could avoid a table scan. MySQLTuner reported more than 4 million joins performed without indexes. Our FAQ page includes information on how you could find the tables needing indexes to avoid scans. Let us know how these suggestions work for you.
Skype Talk works very well if you have the flexibility to use that form of communication.

QEMU KVM disk IO/SQL replication issue, on one of two identical clone VM's

I have a system running two QEMU KVM virtual machines, identical clones of one another. Both VM's are replicating from the same Master MySQL DB. One VM (vm-01) is carrying an active load, and is running fine. However, the other (standby) VM (vm-02) suddenly fell behind with replication, at 08:00 this morning, and even though replication is running properly, it keeps falling further behind at a slow rate (1s behind for every 10s of real time). vm-02 has been running perfectly for months to date.
After checking all the usual suspects (CPU load, disk space, SQL query errors etc. etc.) it turns out that everything is just fine... except for the virtual disk IO - specifically the write requests (WRRQ). On the host machine:
virt-top 16:01:35 - x86_64 16/16CPU 1596MHz 128915MB
3 domains, 2 active, 2 running, 0 sleeping, 0 paused, 1 inactive D:0 O:0 X:0
CPU: 1.8% Mem: 32768 MB (32768 MB by guests)
ID S RDRQ WRRQ RXBY TXBY %CPU %MEM TIME NAME
3 R 3 1 113K 20K 1.3 12.0 62d21:21 vm-01-ubuntu
9 R 0 563 97K 11K 0.5 12.0 83:09:51 vm-02-ubuntu
- (vm-Clone-ubuntu)
Both VM's have bin-logs disabled, so they only write the relay-bin-log. The active machine (vm-01-ubuntu) is running thousands of radius requests just fine, in addition to the exact same master SQL commands... and it is happily running with a few write requests. But the standby machine falls behind, with hundreds of write requests... perhaps related to replication catching-up... but so slowly?
Checking disk IO on the VM's:
vm-01:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad01) 18/09/2019 _i686_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12,04 0,02 9,85 13,87 0,13 64,09
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0,00 13,91 0,91 147,67 5,20 16,05 0,29 0,11 0,72 0,57 0,73 0,04 0,65
vm-02:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad02) 18/09/2019 _i686_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0,26 0,01 0,25 6,46 0,09 92,93
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0,00 1,22 0,00 34,19 0,20 21,43 1,26 0,00 0,14 0,96 0,14 0,03 0,09
Doesn't yield any glaring issues, especially since the busier VM (vm-01) is doing more as expected.
The host machine has 128Gb of RAM, tons of SSD drive space, and is only running at 30% CPU usage. There are no RAID or drive issues.
Any suggestions on where to check next, given that the WRRQ count is the only evidence to date of vm-02 falling behind. Or am I chasing a red herring?

The issue is related to the guest OS, not the VM setup.
On Ubuntu the apt auto-update feature is quite aggressive, and in the case of the two suspect VM's, apt was attempting to constantly update the repos, writing at 16mB/s constantly. This is probably related to the fact that the Guest OS is Ubuntu 14.04, and the repos are no longer maintained.
The solution was to disable auto-updates, and rather run updates manually.
As root:
service unattended-upgrades stop
echo manual | tee /etc/init/unattended-upgrades.override
Then, edit apt configs to disable packages auto-refresh. Replace "APT::Periodic::Update-Package-Lists "1";" with "0":
cd /etc/apt/apt.conf.d/
cp 10periodic 10periodic.original
cat 10periodic | awk -F" " '$1=="APT::Periodic::Update-Package-Lists" {printf "%s %s\n",$1,"\"0\";"; next}1' > 10periodic
And lastly, disable the repos from the auto-upgrade list:
nano /etc/apt/apt.conf.d/50unattended-upgrades
Find section "Unattended-Upgrade::Allowed-Origins" and comment out the lines:
//"${distro_id}:${distro_codename}-security";
//"${distro_id}ESM:${distro_codename}";
I then rebooted the VM, and all has been well.

How to calculate return instructions after a period of the program (based on the aspect of computer hardware)

I've been working on rop recently. When using perf to count hardware information, I want to measure the number of return instructions executed by a given piece of code. But the perf interface only provides branch instructions.

If you're only x86 with a recent Intel CPU:
perf list on my Skylake shows there's a hardware counter for br_inst_retired.near_return. That will count only ret instructions, not other branches. But see erratum SKL091 for branch-instruction counters.
perf stat -e instructions,br_inst_retired.near_return,... ./a.out may be what you're looking for. Or maybe attaching perf stat to an already-running program, or maybe -I 1000 to print accumulated counts over intervals.
But note that if you're looking for ROP gadgets, you can find a C3 opcode inside what normally decodes as some other instruction. So restricting yourself only to ret instructions that actually run during the target program's normal execution is more limiting than it needs to be.
e.g. a 4-byte immediate might usefully decode as something + ret if you jump to the immediate.

Erlang VM killed when creating millions of processes

So after Joe Armstrongs' claims that erlang processes are cheap and vm can handle millions of them. I decided to test it on my machine:
process_galore(N)->
io:format("process limit: ~p~n", [erlang:system_info(process_limit)]),
statistics(runtime),
statistics(wall_clock),
L = for(0, N, fun()-> spawn(fun() -> wait() end) end),
{_, Rt} = statistics(runtime),
{_, Wt} = statistics(wall_clock),
lists:foreach(fun(Pid)-> Pid ! die end, L),
io:format("Processes created: ~p~n
Run time ms: ~p~n
Wall time ms: ~p~n
Average run time: ~p microseconds!~n", [N, Rt, Wt, (Rt/N)*1000]).
wait()->
receive die ->
done
end.
for(N, N, _)->
[];
for(I, N, Fun) when I < N ->
[Fun()|for(I+1, N, Fun)].
Results are impressive for million processes - I get aprox 6.6 micro! seconds average spawn time. But when starting 3m processes, OS shell prints "Killed" with erlang runtime gone.
I run erl with +P 5000000 flag, system is: arch linux with quadcore i7 and 8GB ram.

Erlang processes are cheap, but they're not free. Erlang processes spawned by spawn use 338 words of memory, which is 2704 bytes on a 64 bit system. Spawning 3 million processes will use at least 8112 MB of RAM, not counting the overhead of creating the linked list of pids and the anonymous function created for each process (I'm not sure if they're shared if they're created like you're creating.) You'll probably need 10-12GB of free RAM to spawn and keep alive 3 million (almost) empty processes.
As I pointed out in the comments (and you later verified), the "Killed" message was printed by the Linux Kernel when it killed the Erlang VM, most likely for using up too much RAM. More information here.

How to understand redis-cli's result vs redis-benchmark's result

First, I am new to Redis.
So, I measure latency with redis-cli:
$ redis-cli --latency
min: 0, max: 31, avg: 0.55 (5216 samples)^C
OK, on average I get response in 0.55 milliseconds. From this I assume that using only one connection in 1 second I can get: 1000ms / 0.55ms = 1800 requests per second.
Then on the same computer I run redis-benchmark using only one connection and get more than 6000 requests per second:
$ redis-benchmark -q -n 100000 -c 1 -P 1
PING_INLINE: 5953.80 requests per second
PING_BULK: 6189.65 requests per second
So having measured latency I expected to get around 2000 request per seconds at best. However I got 6000 request per second. I cannot find explanation for it. Am I correct when I calculate: 1000ms / 0.55ms = 1800 requests per second?

Yes, your maths are correct.
IMO, the discrepancy comes from scheduling artifacts (i.e. to the behavior of the operating system scheduler or the network loopback).
redis-cli latency is implemented by a loop which only sends a PING command before waiting for 10 ms. Let's try an experiment and compare the result of redis-cli --latency with the 10 ms wait state and without.
In order to be accurate, we first make sure the client and server are always scheduled on deterministic CPU cores. Note: it is generally a good idea to do it for benchmarking purpose on NUMA boxes. Also, make sure the frequency of the CPUs is blocked to a given value (i.e. no power mode management).
# Starting Redis
numactl -C 2 src/redis-server redis.conf
# Running benchmark
numactl -C 4 src/redis-benchmark -n 100000 -c 1 -q -P 1 -t PING
PING_INLINE: 26336.58 requests per second
PING_BULK: 27166.53 requests per second
Now let's look at the latency (with the 10 ms wait state):
numactl -C 4 src/redis-cli --latency
min: 0, max: 1, avg: 0.17761 (2376 samples)
It seems too high compared to the throughput result of redis-benchmark.
Then, we alter the source code of redis-cli.c to remove the wait state, and we recompile. The code has also been modified to display more accurate figures (but less frequently, because there is no wait state anymore).
Here is the diff against redis 3.0.5:
1123,1128c1123
< avg = ((double) tot)/((double)count);
< }
< if ( count % 1024 == 0 ) {
< printf("\x1b[0G\x1b[2Kmin: %lld, max: %lld, avg: %.5f (%lld samples)",
< min, max, avg, count);
< fflush(stdout);
---
> avg = (double) tot/count;
1129a1125,1127
> printf("\x1b[0G\x1b[2Kmin: %lld, max: %lld, avg: %.2f (%lld samples)",
> min, max, avg, count);
> fflush(stdout);
1135a1134
> usleep(LATENCY_SAMPLE_RATE * 1000);
Note that this patch should not be used against a real system, since it will make the redis-client --latency feature expensive and intrusive for the performance of the server. Its purpose is just to illustrate my point for the current discussion.
Here we go again:
numactl -C 4 src/redis-cli --latency
min: 0, max: 1, avg: 0.03605 (745280 samples)
Surprise! The average latency is now much lower. Furthermore, 1000/0.03605=27739.25, which is completely in line with the result of redis-benchmark.
Morality: the more the client loop is scheduled by the OS, the lower the average latency. It is wise to trust redis-benchmark over redis-cli --latency if your Redis clients are active enough. And anyway keep in mind the average latency does not mean much for the performance of a system (i.e. you should also look at the latency distribution, the high percentiles, etc. ..)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas