How can a specific application be monitored by perf inside the kvm? - kvm

I have an application which I want to monitor it via perf stat when running inside a kvm VM.
After Googling I have found that perf kvm stat can do this. However there is an error by running the command:
sudo perf kvm stat record -p appPID
which results in help representation ...
usage: perf kvm stat record [<options>]
-p, --pid <pid> record events on existing process id
-t, --tid <tid> record events on existing thread id
-r, --realtime <n> collect data with this RT SCHED_FIFO priority
--no-buffering collect data without buffering
-a, --all-cpus system-wide collection from all CPUs
-C, --cpu <cpu> list of cpus to monitor
-c, --count <n> event period to sample
-o, --output <file> output file name
-i, --no-inherit child tasks do not inherit counters
-m, --mmap-pages <pages[,pages]>
number of mmap data pages and AUX area tracing mmap pages
-v, --verbose be more verbose (show counter open errors, etc)
-q, --quiet don't print any message
Does any one know what is the problem?

Use kvm with vPMU (virtualization of PMU counters) - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html "2.2. VIRTUAL PERFORMANCE MONITORING UNIT (VPMU)"). Then run perf record -p $pid and perf stat -p $pid inside the guest.
Host system has no knowledge (tables) of guest processes (they are managed by guest kernel, which can be non Linux, or different version of linux with incompatible table format), so host kernel can't profile some specific guest process. It only can profile whole guest (and there is perf kvm command - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/chap-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools.html#sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-perf_kvm)

Related

monitor bash script execution using monit

We have just started using monit for process monitor and pretty much new in monit. I have a bash script at /home/ubuntu/launch_example.sh. This is continuously running. is it possible to monitor this using monit? monit should start the script if it bash scripts terminates. What should be syntax.I tried below syntax but all the commands are not being executed as ubuntu user, like shell script calls some python scripts.
check process launch_example
matching "launch_example"
start program = "/bin/bash -c '/home/ubuntu/launch_example.sh'"
as uid ubuntu and gid ubuntu
stop program = "/bin/bash -c '/home/ubuntu/launch_example.sh'"
as uid ubuntu and gid ubuntu
The simple answer is "no". Monit is just for monitoring and is not some kind of supervisor/process manager. So if you want to monitor your long running executable, you have to wrap it.
check process launch_example with pidfile /run/launch.pid
start program = "/bin/bash -c 'nohup /home/ubuntu/launch_example.sh &'"
as uid ubuntu and gid ubuntu
stop program = "/bin/bash -c 'kill $(cat /run/launch.pid)'"
as uid ubuntu and gid ubuntu
This quick'n'dirty way also needs an additional line for your launch_example.sh to write the pidfile (pidfile matching should always be preferred over string matching) - it could be just the first line after she shebang. It simply writes the current process ID to the pidfile. Nothing fancy here ;)
echo $$ > /run/launch.pid
In fact, it's not even hard to convert your script into a systemd unit. Here is an example on how to. User binding, restarts, pidfile, and "start-on-boot" can then be managed through systemd (eg. start program = "/usr/bin/systemctl start my_unit").

Rabbitmq File Descriptor Limit

Rabbitmq documentation says that we need to do some configuration before we use it on production. One of the configuration is about maximum open file number (which is an OS parameter).
Rabbitmq server we use is running on Ubuntu 16.04 and according to resources I found on web, I updated the number of open files as 500k. When I check it from command line, I get the following output:
root#madeleine:~# ulimit -n
500000
However when I look at the rabbitmq server status, I see another number.
root#madeleine:~# rabbitmqctl status | grep 'file_descriptors' -A 4
{file_descriptors,
[{total_limit,924},
{total_used,19},
{sockets_limit,829},
{sockets_used,10}]},
It seems like, I managed to increase the limit on OS side, but rabbitmq still thinks that total limit of file descriptors is 924.
What might be causing this problem?
You might want to look at this page
Apparently, this operation depends on the OS version. If you have a systemd, you should do the following in /etc/systemd/system/rabbitmq-server.service.d/limits.conf file:
Notice that this service configuration might be somewhere else according to the operating system you are using. You can use the following command to find where this service configuration is located and update that file.
find / -name "*rabbitmq-server.service*"
[Service]
LimitNOFILE=300000
On the other hand, if you do not have the systemd folder, you should try this in your rabbitmq-env.conf file:
ulimit -S -n 4096
Increase / Set maximum number of open files
sudo sysctl -w fs.file-max=65536
These limits are defined in /etc/security/limits.conf
sudo nano /etc/security/limits.conf
and set
soft nofile 65536
hard nofile 65536
Per user settings for rabbitmq process can also be set in
/etc/default/rabbitmq-server
sudo nano /etc/default/rabbitmq-server
and set
ulimit -n 65536
Then reboot the server for changes to take effect.

Never successfully built a large hadoop&spark cluster

I was wondering if anybody could help me with this issue in deploying a spark cluster using the bdutil tool.
When the total number of cores increase (>= 1024), it failed all the time with the following reasons:
Some machine is never sshable, like "Tue Dec 8 13:45:14 PST 2015: 'hadoop-w-5' not yet sshable (255); sleeping"
Some nodes fail with an "Exited 100" error when deploying spark worker nodes, like "Tue Dec 8 15:28:31 PST 2015: Exited 100 : gcloud --project=cs-bwamem --quiet --verbosity=info compute ssh hadoop-w-6 --command=sudo su -l -c "cd ${PWD} && ./deploy-core-setup.sh" 2>>deploy-core-setup_deploy.stderr 1>>deploy-core-setup_deploy.stdout --ssh-flag=-tt --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-f"
In the log file, it says:
hadoop-w-40: ==> deploy-core-setup_deploy.stderr <==
hadoop-w-40: dpkg-query: package 'openjdk-7-jdk' is not installed and no information is available
hadoop-w-40: Use dpkg --info (= dpkg-deb --info) to examine archive files,
hadoop-w-40: and dpkg --contents (= dpkg-deb --contents) to list their contents.
hadoop-w-40: Failed to fetch http://httpredir.debian.org/debian/pool/main/x/xml-core/xml-core_0.13+nmu2_all.deb Error reading from server. Remote end closed connection [IP: 128.31.0.66 80]
hadoop-w-40: E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
I tried 16-core 128-nodes, 32-core 64-nodes, 32-core 32-nodes and other over 1024-core configurations, but either the above Reason 1 or 2 will show up.
I also tried to modify the ssh-flag to change the ConnectTimeout to 1200s, and change bdutil_env.sh to set the polling interval to 30s, 60s, ..., none of them works. There will be always some nodes which fail.
Here is one of the configurations that I used:
time ./bdutil \
--bucket $BUCKET \
--force \
--machine_type n1-highmem-32 \
--master_machine_type n1-highmem-32 \
--num_workers 64 \
--project $PROJECT \
--upload_files ${JAR_FILE} \
--env_var_files hadoop2_env.sh,extensions/spark/spark_env.sh \
deploy
To summarize some of the information that came out from a separate email discussion, as IP mappings change and different debian mirrors get assigned, there can be occasional problems where the concurrent calls to apt-get install during a bdutil deployment can either overload some unbalanced servers or trigger DDOS protections leading to deployment failures. These do tend to be transient, and at the moment it appears I can deploy large clusters in zones like us-east1-c and us-east1-d successfully again.
There are a few options you can take to reduce the load on the debian mirrors:
Set MAX_CONCURRENT_ASYNC_PROCESSES to a much smaller value than the default 150 inside bdutil_env.sh, such as 10 to only deploy 10 at a time; this will make the deployment take longer, but would lighten the load as if you just did several back-to-back 10-node deployments.
If the VMs were successfully created but the deployment steps fail, instead of needing to retry the whole delete/deploy cycle, you can try ./bdutil <all your flags> run_command -t all -- 'rm -rf /home/hadoop' followed by ./bdutil <all your flags> run_command_steps to just run through the whole deployment attempt.
Incrementally build your cluster using resize_env.sh; initially set --num_workers 10 and deploy your cluster, and then edit resize_env.sh to set NEW_NUM_WORKERS=20, and run ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh deploy and it will only deploy the new workers 10-20 without touching those first 10. Then you just repeat, adding another 10 workers to NEW_NUM_WORKERS each time. If a resize attempt fails, you simply ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh delete to only delete those extra workers without affecting the ones you already deployed successfully.
Finally, if you're looking for more reproducible and optimized deployments, you should consider using Google Cloud Dataproc, which lets you use the standard gcloud CLI to deploy cluster, submit jobs, and further manage/delete clusters without needing to remember your bdutil flags or keep track of what clusters you have on your client machine. You can SSH into Dataproc clusters and use them basically the same as bdutil clusters, with some minor differences, like Dataproc DEFAULT_FS being HDFS so that any GCS paths you use should fully-specify the complete gs://bucket/object name.

strace -f strace /bin/ls failed with PTRACE_TRACEME EPERM (Operation not permitted)

When I run
strace -f strace /bin/ls
to know how strace work it failed with
ptrace(PTRACE_TRACEME, 0, 0, 0) = -1 EPERM (Operation not permitted)
even with root account.
It there any solution for this?
Docker
When running strace within Docker container, to enable ptrace, run with SYS_PTRACE param:
docker run -it --cap-add SYS_PTRACE ubuntu
See: Running Strace in Docker.
ptrace system call is limited only one tracing application per process.
man ptrace:
EPERM The specified process cannot be traced. This could be because the tracer has insuffi‐
cient privileges (the required capability is CAP_SYS_PTRACE); unprivileged processes
cannot trace processes that they cannot send signals to or those running set-user-
ID/set-group-ID programs, for obvious reasons. Alternatively, the process may already
be being traced, or (on kernels before 2.6.26) be init(1) (PID 1).
This means only a debug application can attach to same process. When you done strace -f you tell it to attach all process started by program debugged. In your case strace call fork to create a new process and setups the new process for debugging using ptrace system call. Then it calls exec with parameters you provide to the call. This then start strace again which tries to do fork and ptrace again. But the second ptrace fails with EPERM because first strace has already attached to the process.
Running first strace without -f parameter allows you to trace the first thread from second strace while second strace is tracing the ls.
strace strace -f ls
There is -b to detach from lwp when a specific syscall is made but it only supports execve. If there was a ptrace call support it would be perfect. That means strace either needs a small patch to support ptrace call.
Alternative potential hacks include preloaded library which implements detaching with some trickery.
Better alternative would be using tracing tool systemtap or trace-cmd which can use kernel provided tracing infrastructure instead of ptrace.
I mention this and more helpful tips in a recent blog post about strace.
You need to enable support for gdb, strace, and similar tools to attach to processes on the system.
You can do this temporarily by running command to set a setting proc:
sudo bash -c 'echo 0 > /proc/sys/kernel/yama/ptrace_scope'
You can persist that setting between system reboots by modifying /etc/sysctl.d/10-ptrace.conf and setting kernel.yama.ptrace_scope = 0.
If your system does not have /etc/sysctl.d/10-ptrace.conf, you can modify /etc/sysctl.conf and set kernel.yama.ptrace_scope = 0.

Why doesn't setting the SUID bit in OpenBSD set effective and saved UIDs to executable file owner?

I am using a fresh install of OpenBSD 5.3 as a guest OS on Parallels for Mac:
$ uname -a
OpenBSD openbsd.localdomain 5.3 GENERIC#53 amd64
To my surprise, a binary file owned by root with its SUID bit set runs with UIDs as if the SUID was not set. That is, when UID 1000 runs such a program, the program starts in state:
<real_uid, effective_uid, saved_uid> = <1000, 1000, 1000>
and not in state:
<real_uid, effective_uid, saved_uid> = <1000, 0, 0>
as expected.
Why is this the case?
Here are the details regarding how I found the issue:
I have written an interactive C program (compiled as setuid_min.bin) for evaluating setuid behaviour in different Unix systems. The program lives in a subdirectory of UID 1000's home directory, and the sudo command is used to change ownership and SUID; then the program is run and I enter the uid to report the real, effective, and saved UIDs of the process:
$ sudo chown root:staff setuid_min.bin
$ ls -l | grep 'setuid_min\.bin$'
-rwxr-xr-x 1 root staff [...] setuid_min.bin
$ sudo chmod a+s setuid_min.bin
$ ls -l | grep 'setuid_min\.bin$'
-rwsr-sr-x 1 root staff [...] setuid_min.bin
$ ./setuid_min.bin
uid
1000 1000 1000 some_pid
exit
$
Note that some_pid above is the pid of the setuid_min.bin process. The program reports the real UID, effective UID, and saved UID by reporting the output of the following shell command:
ps -ao ruid,uid,svuid,pid | grep '[ ]my_pid$'
where my_pid is the pid is reported by getpid(). My only guess as to why this might be the case is that OpenBSD has some underlying permissions structure that is using the ownership/permissions of the directory where setuid_min.bin resides, or that is not actually changing ownership/SUID bit when an unprivileged user uses sudo to change file permissions.
Most likely your binary is in one of the default partitions that are mounted "nosuid". The default fstab the install script creates will by mount everything nosuid unless it's known to contain suid binaries.