I'm using Google Colab Pro+ to train a computer vision model. Sometimes execution takes up to 8 hours so after completion I'm still allocating memory and a GPU. I wish I had a line of code to execute after the training finishes. Is there any way to terminate the environment using code?
You should be able to disconnect your runtime by running the following command,
!kill $(ps aux | awk '{print $2}')
This command will extract the PID of each running process and then kill (or stop) them.
Related
A server with 4 GPUs is used for deep learning.
It often happens that GPU memory is not freed after the training process was terminated (killed). Results shown by nvidia-smi is
Nvidia-smi results
The cuda device 2 is used. (Might be a process launched with CUDA_VISIBLE_DEVICES=2)
Some sub-processes are still alive thus occupy the memory.
One bruce-force solution is to kill all processes created by python using:
pkill -u user_name python
This should be helpful if there is only one process to be cleaned up.
Another solution proposed by pytorch official My GPU memory isn’t freed properly
One may find them via
ps -elf | grep python.
However, if multiple processes are launched and we only want to kill the ones that related to a certain GPU, we can group processes by the gpu index (nvidia0, nvidia1, ...) as:
fuser -v /dev/nvidia*
fuser -v results
As we can see, /dev/nvidia3 is used by some python threads. Thus /dev/nvidia3 corresponds to cuda device 2.
The problem is: I want to kill certain processes launched with setting of CUDA_VISIBLE_DEVICES=2, but I do not know the gpu index (/dev/nvidia0, /dev/nvidia1, ...).
How to find the mapping between CUDA_VISIBLE_DEVICES={0,1,2,3} and /dev/nvidia{0,1,2,3}.
If you set CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable, then the order should be consistent between CUDA and nvidia-smi.
There is also another option (if you are sure the limiting to a specific GPU is done via CUDA_VISIBLE_DEVICES env var). Every process' environment can be examined in /proc/${PID}/environ. The format is partially binary, but grepping through the output usually works (if you force grep to treat the file as text file). This might require root privileges.
I know how to modify my image, reboot, and re-run it, but that would make my experiments very slow, since boot takes a few minutes.
Is there a way to quickly switch:
command line options
the executable
that is being run after boot?
This is not trivial because the Linux kernel knows about:
the state of the root filesystem
the state of memory, and therefore of kernel CLI options that could be used to modify init
so I can't just switch those after a checkpoint.
This question is inspired from: https://www.mail-archive.com/gem5-users#gem5.org/msg16959.html
Here is a fully automated setup that can help you to do it.
The basic workflow is as follows:
run your benchmark from the init executable that gets passed to the Linux kernel
https://unix.stackexchange.com/questions/122717/how-to-create-a-custom-linux-distro-that-runs-just-one-program-and-nothing-else/238579#238579
https://unix.stackexchange.com/questions/174062/can-the-init-process-be-a-shell-script-in-linux/395375#395375
to run a single benchmark with different parameters without
rebooting, do in your init script:
m5 checkpoint
m5 readfile | sh
and set the contents of m5 readfile on a host filesystem file before restoring the checkpoint with:
echo 'm5 resetstats && ./run-benchmark && m5 dumpstats' > path/to/script
build/ARM/gem5.opt configs/example/fs.py --script path/to/script
This is what the "configs/boot/hack_back_ckpt.rcS" but I think that
script is overly complicated.
to modify the executable without having to reboot, attach a second
disk image and mount after the checkpoint is restored:
How to attach multiple disk images in a simulation with gem5 fs.py?
Another possibility would be 9P but it is not working currently: http://gem5.org/WA-gem5
to only count only benchmark instructions, do:
m5 resetstats
./run-benchmark
m5 dumpstats
If that is not precise enough, modify the source of your benchmark with m5ops magic instructions that do resetstats and dumpstats from within the benchmark as shown at: How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
to make the first boot faster, you can boot with a simple CPU model like the default AtomicSimpleCPU and then switch to a more detailed, slower model after the checkpoint is restored: How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?
I have access to 4 GPUs(not root user). One of the GPU(no. 2) behaves weird, their is some memory blocked but the power consumption and temperature is very low(as if nothing is running on it). See details from nvidia-smi in the image below:
How can I reset the GPU 2 without disturbing the processes running on the other GPUs?
PS: I am not a root user but I think I can catch hold of some root user as well.
resetting a gpu can resolve you problem somehow it could be impossible due your GPU configuration
nvidia-smi --gpu-reset -i "gpu ID"
for example if you have nvlink enabled with gpus it does not go through always, and also it seems that nvidia-smi in your case is unable to find the process running over your gpu, the solution for your case is finding and killing associated process to that gpu by running following command, fill out the PID with one that are you find by fuser there
fuser -v /dev/nvidia*
kill -9 "PID"
When running a notebook on Colab i get the message:
"desactivated execution environment"
and the execution stops.
Note that i disconnected from colab then reconnected gain just before running this notebook.
Any idea of the problem ?
thanks
I think we are having a similar issue, and it's probably due to memory. The session will just disconnect and you have to start the run all over again.
The 2 things I did to somewhat reduce the times this occurs is:
before I start or restart running my code I run the below "kill" command. This basically clears any potential items left behind in memory.
!kill -9 -1
I use python so every now and then I run the below garbage collector.
gc.collect()
This helps, but not bulletproof.
I am using a monkeyrunner Jython script to automate some UI test. I want to confirm that the previous step is complete before doing the next step, based on the current CPU usage of the OS (of the PC the emulator is running on). Hence I need a way to get current CPU usage in a monkeyrunner Jython script.
I've done some survey, but looks like monkeyrunner Jython script does not work with psutil: Monkeyrunner doesnt find my module
Anyone could tell me what is the easiest way to get current CPU usage in a monkeyrunner Jython script?
Thanks.
You can invoke shell commands directly from MonkeyDevice:
top10 = device.shell('top -n 1 -m 10')
try top command to get cpu usage,
1.try this if you want to get cpu usage of android device:-
import os
top_10_ps_list=os.popen('adb shell top -n 1 -m 10').read()
2.try this if you want to get cpu usage of PC OS:-
import os
top_10_ps_list=os.popen('top -b -n 1').read()