I have access to 4 GPUs(not root user). One of the GPU(no. 2) behaves weird, their is some memory blocked but the power consumption and temperature is very low(as if nothing is running on it). See details from nvidia-smi in the image below:
How can I reset the GPU 2 without disturbing the processes running on the other GPUs?
PS: I am not a root user but I think I can catch hold of some root user as well.
resetting a gpu can resolve you problem somehow it could be impossible due your GPU configuration
nvidia-smi --gpu-reset -i "gpu ID"
for example if you have nvlink enabled with gpus it does not go through always, and also it seems that nvidia-smi in your case is unable to find the process running over your gpu, the solution for your case is finding and killing associated process to that gpu by running following command, fill out the PID with one that are you find by fuser there
fuser -v /dev/nvidia*
kill -9 "PID"
Related
A server with 4 GPUs is used for deep learning.
It often happens that GPU memory is not freed after the training process was terminated (killed). Results shown by nvidia-smi is
Nvidia-smi results
The cuda device 2 is used. (Might be a process launched with CUDA_VISIBLE_DEVICES=2)
Some sub-processes are still alive thus occupy the memory.
One bruce-force solution is to kill all processes created by python using:
pkill -u user_name python
This should be helpful if there is only one process to be cleaned up.
Another solution proposed by pytorch official My GPU memory isn’t freed properly
One may find them via
ps -elf | grep python.
However, if multiple processes are launched and we only want to kill the ones that related to a certain GPU, we can group processes by the gpu index (nvidia0, nvidia1, ...) as:
fuser -v /dev/nvidia*
fuser -v results
As we can see, /dev/nvidia3 is used by some python threads. Thus /dev/nvidia3 corresponds to cuda device 2.
The problem is: I want to kill certain processes launched with setting of CUDA_VISIBLE_DEVICES=2, but I do not know the gpu index (/dev/nvidia0, /dev/nvidia1, ...).
How to find the mapping between CUDA_VISIBLE_DEVICES={0,1,2,3} and /dev/nvidia{0,1,2,3}.
If you set CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable, then the order should be consistent between CUDA and nvidia-smi.
There is also another option (if you are sure the limiting to a specific GPU is done via CUDA_VISIBLE_DEVICES env var). Every process' environment can be examined in /proc/${PID}/environ. The format is partially binary, but grepping through the output usually works (if you force grep to treat the file as text file). This might require root privileges.
VirtualBox (Version 5.2.24 r128163 (Qt5.6.2)) user with xubuntu guest (Ubuntu 18.04.2 LTS) and Windows 10 host here.
I recently tried to resize my vdi from ~100GB to 200GB. In windows I used the command:
./VBoxManage modifyhd "D:\xub2\xub2.vdi" --resize 200000
That went fine. Then I used a gparted live cd to create a vm, attached the vdi and resize the partitions:
gparted gui
All looks good. If I then use the 'fdisk -l' command whilst in the gparted vm the increased partition sizes are visible as expected.
fdisk -l results for vdi attached to gparted vm
If I try and resize the file system for one of the newly resized logical drives with 'resize2fs /dev/sda5' I am told it is already 46265856 blocks long and there is nothing to do.
However....
If I then re-attach this vdi to an ubuntu vm and boot up with the vdi, the 'fdisk -l' command gives different results and is basically telling me that the drive is still 100GB in size.
fdisk -l results for the same vdi attached to ubuntu vm
The 'df' command confirms that it is not resized.
df command output with same vdi attached to ubuntu vm
If I try the command 'resize2fs /dev/sda5' I get the result:
The filesystem is already 22003712 (4k) blocks long. Nothing to do!
How can I fix this and make the ubuntu vm see that the disk and partitions have been increase in size?
Ok. I will answer my own question (thank you for the negative vote anonymous internet).
This issue occurs when you have existing snapshots of the drive that you are trying to expand associated with a VirtualBox VM.
I found this described in VirtualBox's documentation.
https://forums.virtualbox.org/viewtopic.php?f=35&t=50661
One suggested solution is to delete the snapshots, however I got an error message when I attempted that.
The solution that worked for me was to clone my VM. The cloned VM (which did not have any snapshots associated with it), behaved as expected and showed the correct size for the resized disk.
To be clear: the situation I described above is 100% true.
Hope that helps someone.
When running a notebook on Colab i get the message:
"desactivated execution environment"
and the execution stops.
Note that i disconnected from colab then reconnected gain just before running this notebook.
Any idea of the problem ?
thanks
I think we are having a similar issue, and it's probably due to memory. The session will just disconnect and you have to start the run all over again.
The 2 things I did to somewhat reduce the times this occurs is:
before I start or restart running my code I run the below "kill" command. This basically clears any potential items left behind in memory.
!kill -9 -1
I use python so every now and then I run the below garbage collector.
gc.collect()
This helps, but not bulletproof.
I am facing a problem with my Camera image fetching program stop working. When the program has no response, I captured the following info by ps command:
What is the first process cfinteractive?
cfinteractive is a kernel thread for the Interactive governor of the CPUFreq driver. You can verify that your system is using this governor with:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
And you can temporarily disable CPU frequency scaling with the command:
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
You need to do that for each of your CPUs.
I am emulating qemu for linux x86-64. In qemu virtual machine, I am using
taskset -c 0 prc1 & taskset -c 1 prc2 & taskset -c 2 prc3 & taskset -c 3 prc4;
to simultaneously issue 4 processes and bind them to four cores (prc is short for process). However, I find that once they start running; then afterwards, in-between, some cores (say 1 and 2) do not execute those processes but either idle or do something else. Can you suggest, what could be the reason for this or a way of improvement so that, I can make sure processes don't migrate from one core to another.
The processes aren't migrating from one core to another. Whenever they need CPU, they will only get the core you bound them to. That won't prevent the CPUs from doing other work, nor will it somehow force a process use a core even when it cannot run, say because it's waiting for I/O.