I am emulating qemu for linux x86-64. In qemu virtual machine, I am using
taskset -c 0 prc1 & taskset -c 1 prc2 & taskset -c 2 prc3 & taskset -c 3 prc4;
to simultaneously issue 4 processes and bind them to four cores (prc is short for process). However, I find that once they start running; then afterwards, in-between, some cores (say 1 and 2) do not execute those processes but either idle or do something else. Can you suggest, what could be the reason for this or a way of improvement so that, I can make sure processes don't migrate from one core to another.
The processes aren't migrating from one core to another. Whenever they need CPU, they will only get the core you bound them to. That won't prevent the CPUs from doing other work, nor will it somehow force a process use a core even when it cannot run, say because it's waiting for I/O.
Related
A server with 4 GPUs is used for deep learning.
It often happens that GPU memory is not freed after the training process was terminated (killed). Results shown by nvidia-smi is
Nvidia-smi results
The cuda device 2 is used. (Might be a process launched with CUDA_VISIBLE_DEVICES=2)
Some sub-processes are still alive thus occupy the memory.
One bruce-force solution is to kill all processes created by python using:
pkill -u user_name python
This should be helpful if there is only one process to be cleaned up.
Another solution proposed by pytorch official My GPU memory isn’t freed properly
One may find them via
ps -elf | grep python.
However, if multiple processes are launched and we only want to kill the ones that related to a certain GPU, we can group processes by the gpu index (nvidia0, nvidia1, ...) as:
fuser -v /dev/nvidia*
fuser -v results
As we can see, /dev/nvidia3 is used by some python threads. Thus /dev/nvidia3 corresponds to cuda device 2.
The problem is: I want to kill certain processes launched with setting of CUDA_VISIBLE_DEVICES=2, but I do not know the gpu index (/dev/nvidia0, /dev/nvidia1, ...).
How to find the mapping between CUDA_VISIBLE_DEVICES={0,1,2,3} and /dev/nvidia{0,1,2,3}.
If you set CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable, then the order should be consistent between CUDA and nvidia-smi.
There is also another option (if you are sure the limiting to a specific GPU is done via CUDA_VISIBLE_DEVICES env var). Every process' environment can be examined in /proc/${PID}/environ. The format is partially binary, but grepping through the output usually works (if you force grep to treat the file as text file). This might require root privileges.
Due to increasing space consumption of WSL I was forced to move my WSL distros to another disk.
Ubuntu
docker-desktop
docker-desktop-data
I used these commands.
wsl --shutdown
wsl --export (on all three of those distros)
wsl --import (already on another disk)
Now my environment is running fine but the ext4.vhdx in AppData\Local\Docker\wsl\data is still present and I can't remove it due to it still being used.
When I look at process hadnles
Its still being used by system which is not telling much.
If I run WSL --shutdown all virtual disks present on disk E: lose their handles and the one on disk C: is still being used.
Would you know how to find out what part of WSL or if it even is WSL is using?
Since shutting down WSL does not remove that handle it might be used by something else.
Its not docker-for-desktop that one uses different disk.
Thanks for your suggestions.
Docker Desktop for Windows, which uses WSL2, stores all image and container files in a separate virtual volume (vhdx). This virtual hard disk file can automatically grow when it needs more space (to a certain limit). Unfortunately, if you reclaim some space, i.e. by removing unused images, vhdx doesn't shrink automatically. Luckily, you can reduce its size manually by calling this command in PowerShell (as Administrator):
Optimize-VHD -Path $Env:LOCALAPPDATA\Docker\wsl\data\ext4.vhdx -Mode Full
If the above command fails with
The system failed to compact 'C:\Users\Maxx\AppData\Local\Docker\wsl\data\ext4.vhdx':
The process cannot access the file because it is being used by another process. (0x80070020).
exit form Docker Desktop or stop services and tasks using that file:
net stop com.docker.service
taskkill /IM "docker.exe" /F
taskkill /IM "Docker Desktop.exe" /F
wsl --shutdown
I reclaimed 15Gb of 40Gb.
Origin of the solution.
You can just clean data from interface. Troubleshooting -> Clean/Purge data
Upgrading from WSL1 to WSL2 made it a bit messy, but resetting docker-desktop to its default setting and then purging data from WSL (using docker-desktop troublesshot) cleared it for me.
I have access to 4 GPUs(not root user). One of the GPU(no. 2) behaves weird, their is some memory blocked but the power consumption and temperature is very low(as if nothing is running on it). See details from nvidia-smi in the image below:
How can I reset the GPU 2 without disturbing the processes running on the other GPUs?
PS: I am not a root user but I think I can catch hold of some root user as well.
resetting a gpu can resolve you problem somehow it could be impossible due your GPU configuration
nvidia-smi --gpu-reset -i "gpu ID"
for example if you have nvlink enabled with gpus it does not go through always, and also it seems that nvidia-smi in your case is unable to find the process running over your gpu, the solution for your case is finding and killing associated process to that gpu by running following command, fill out the PID with one that are you find by fuser there
fuser -v /dev/nvidia*
kill -9 "PID"
I am currently setting up a web service powered by apache and running on CENTOS 6.4.
This service uses perl scripts (cgi-bin) launching in particular external homemade fortran compiled binaries.
Here is the issue: when I boot my server, everything goes well except that one of my binary crashes systematically (with a kernel segfault) when called by my perl scripts.
If I restart manually the httpd service (at the command line: service httpd restart), the issue is totally fixed.
I examined apache/system logs and nothing suspicious can be found.
It appears that the problem occurs only when httpd is launched by /etc/rc[0-6].d startup directives. I tried to change the launch order of http (S85httpd by default) to any other position without success.
To summarize, my web service is only functional (with no external binary crash) when httpd is launched at the command line once the server has fully booted up!
[EDIT] This issue is now resolved:
My fortran binary handles very large arrays and complex functions requiring an unlimited stack size.
Despite that the stack size limit was defined on a system-wide basis (in /etc/security/limits.conf), for any reason it appears that the "apache/perl/fortran binary" ensemble was not aware of that (causing my binary to crash each time it was called).
At the contrary, when I manually restarted apache at the shell prompt, the stacksize limit was correctly passed (.bashrc with 'ulimit -S -s unlimited').
As a workaround, I used BSD::Resource module (http://metacpan.org/pod/BSD::Resource) to define stacksize directly in my perl script by using e.g. setrlimit(RLIMIT_STACK, $softlimit, $hardlimit);
Thus, this new stack size limit is now directly passed from my perl script to my binary.
I've run into similar problems before. Maybe one way to solve this is to put the binary on a 'delayed start', so that it starts after everything else on your system is running. One way to do this is to put an at job in your /etc/rc.local script, to start the binary in X minutes.
I've recently become quite fond of Upstart. Previously I've been using God, Monit and Bluepill but I don't really like these solutions so I'm giving Upstart a try.
I've been using the Foreman gem to generate some basic Upstart configuration files for my processes in /etc/init. However, these generated files only handle the respawning of a crashed process. I was wondering whether it's possible to tell Upstart to restart a process that's consuming for example > 150mb of memory, as you would with Monit, God or Bluepill.
I read through the Upstart docs and this looks like the thing I'm looking for. Though I have no clue how to config something like this.
What I basically want is quite simple. I want to restart my web process if the memory usage is > 150mb ram. These are the files I have:
|-- myapp-web-1.conf
|-- myapp-web-2.conf
|-- myapp-web-3.conf
|-- myapp-web.conf
|-- myapp.conf
And their contents are:
myapp.conf
pre-start script
bash << "EOF"
mkdir -p /var/log/myapp
chown -R deployer /var/log/myapp
EOF
end script
myapp-web.conf
start on starting myapp
stop on stopping myapp
myapp-web-1.conf / myapp-web-2.conf / myapp-web-3.conf
start on starting myapp-web
stop on stopping myapp-web
respawn
exec su - deployer -c 'cd /var/applications/releases/20110607140607; cd myapp && bundle exec unicorn -p $PORT >> /var/log/myapp/web-1.log 2>&1'
Any help much appreciated!
Appending this to the end of myapp-web-*.conf will cause any allocation calls trying to allocate more than 150mb of memory to return ENOMEM:
limit rss 157286400 157286400
The process might crash at this point, or it might not. That's up to the process!
Here's a test for this in the Upstart Source.
From the Upstart docs, the limits come from the rlimit system call options. (http://upstart.ubuntu.com/cookbook/#limit)
Since Linux 2.4+ setting the rss (Resident Set Size) has no effect.
An alternative already suggested in other answers is as which sets the virtual memory Address Space size limits. This will have a very different effect of setting 'real' memory limits.
limit as <soft limit> <hard limit>
Excerpt from man pages for setrlimit:
RLIMIT_AS
The maximum size of the process's virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2), and mremap(2),
which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that
kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a
32-bit long either this limit is at most 2 GiB, or this resource is unlimited.