A server with 4 GPUs is used for deep learning.
It often happens that GPU memory is not freed after the training process was terminated (killed). Results shown by nvidia-smi is
Nvidia-smi results
The cuda device 2 is used. (Might be a process launched with CUDA_VISIBLE_DEVICES=2)
Some sub-processes are still alive thus occupy the memory.
One bruce-force solution is to kill all processes created by python using:
pkill -u user_name python
This should be helpful if there is only one process to be cleaned up.
Another solution proposed by pytorch official My GPU memory isn’t freed properly
One may find them via
ps -elf | grep python.
However, if multiple processes are launched and we only want to kill the ones that related to a certain GPU, we can group processes by the gpu index (nvidia0, nvidia1, ...) as:
fuser -v /dev/nvidia*
fuser -v results
As we can see, /dev/nvidia3 is used by some python threads. Thus /dev/nvidia3 corresponds to cuda device 2.
The problem is: I want to kill certain processes launched with setting of CUDA_VISIBLE_DEVICES=2, but I do not know the gpu index (/dev/nvidia0, /dev/nvidia1, ...).
How to find the mapping between CUDA_VISIBLE_DEVICES={0,1,2,3} and /dev/nvidia{0,1,2,3}.
If you set CUDA_DEVICE_ORDER=PCI_BUS_ID environment variable, then the order should be consistent between CUDA and nvidia-smi.
There is also another option (if you are sure the limiting to a specific GPU is done via CUDA_VISIBLE_DEVICES env var). Every process' environment can be examined in /proc/${PID}/environ. The format is partially binary, but grepping through the output usually works (if you force grep to treat the file as text file). This might require root privileges.
Related
I know how to modify my image, reboot, and re-run it, but that would make my experiments very slow, since boot takes a few minutes.
Is there a way to quickly switch:
command line options
the executable
that is being run after boot?
This is not trivial because the Linux kernel knows about:
the state of the root filesystem
the state of memory, and therefore of kernel CLI options that could be used to modify init
so I can't just switch those after a checkpoint.
This question is inspired from: https://www.mail-archive.com/gem5-users#gem5.org/msg16959.html
Here is a fully automated setup that can help you to do it.
The basic workflow is as follows:
run your benchmark from the init executable that gets passed to the Linux kernel
https://unix.stackexchange.com/questions/122717/how-to-create-a-custom-linux-distro-that-runs-just-one-program-and-nothing-else/238579#238579
https://unix.stackexchange.com/questions/174062/can-the-init-process-be-a-shell-script-in-linux/395375#395375
to run a single benchmark with different parameters without
rebooting, do in your init script:
m5 checkpoint
m5 readfile | sh
and set the contents of m5 readfile on a host filesystem file before restoring the checkpoint with:
echo 'm5 resetstats && ./run-benchmark && m5 dumpstats' > path/to/script
build/ARM/gem5.opt configs/example/fs.py --script path/to/script
This is what the "configs/boot/hack_back_ckpt.rcS" but I think that
script is overly complicated.
to modify the executable without having to reboot, attach a second
disk image and mount after the checkpoint is restored:
How to attach multiple disk images in a simulation with gem5 fs.py?
Another possibility would be 9P but it is not working currently: http://gem5.org/WA-gem5
to only count only benchmark instructions, do:
m5 resetstats
./run-benchmark
m5 dumpstats
If that is not precise enough, modify the source of your benchmark with m5ops magic instructions that do resetstats and dumpstats from within the benchmark as shown at: How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
to make the first boot faster, you can boot with a simple CPU model like the default AtomicSimpleCPU and then switch to a more detailed, slower model after the checkpoint is restored: How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?
I have access to 4 GPUs(not root user). One of the GPU(no. 2) behaves weird, their is some memory blocked but the power consumption and temperature is very low(as if nothing is running on it). See details from nvidia-smi in the image below:
How can I reset the GPU 2 without disturbing the processes running on the other GPUs?
PS: I am not a root user but I think I can catch hold of some root user as well.
resetting a gpu can resolve you problem somehow it could be impossible due your GPU configuration
nvidia-smi --gpu-reset -i "gpu ID"
for example if you have nvlink enabled with gpus it does not go through always, and also it seems that nvidia-smi in your case is unable to find the process running over your gpu, the solution for your case is finding and killing associated process to that gpu by running following command, fill out the PID with one that are you find by fuser there
fuser -v /dev/nvidia*
kill -9 "PID"
System Information:
Debian 4.5.5
TF installed from binary (pip3 install tensorflow-gpu==1.0.1 --user)
TF version: v1.0.0-65-g4763edf-dirty 1.0.1
Bazel version: N.A.
CUDA 8.0 cuDNN v5.1
Steps to reproduce
Make a directory and download the following files into it:
training.py run.sh
Run the command ./run.sh to simply reproduce this issue.
Detailed descriptions for the bug
Recently, I tried to deploy the synchronous distributed tensorflow training on the cluster. I followed the tutorial and the inception example to write my own program. The training.py is from other user's implementation, which follows the same API usage as the official example. I modified it to enable it running on a single machine with multiple GPUs by making them communicate through localhost and mapping each worker to see only one GPU.
The run.sh launched three processes. One of them is the parameter server and the others are two workers implemented by between-graph replication. I created the training supervisor by tf.train.Supervisor() to manage multiple sessions in the distributed training for the initialization and synchronization.
I expect these two workers would synchronize each batch and work in the same epoch. However, the worker 0, which is launched prior to the worker 1, completed the whole training set without waiting for the worker 1. After that, the process of the worker 0 finished training process and exited normally while worker 1 behaved like falling into the deadlock and keep near 0% utilization of CPU and GPU for several hours.
Based on my observation, I suspect these two workers didn't communicate and synchronize at all for the data they passed. I report this problem as a bug because I create the optimizer tf.train.SyncReplicasOptimizer as suggested by the official website and the inception example. However, it seems that the synchronization behaviors, if any, are very strange and the program can not exit normally.
Source code / logs
Two files:
training.py: This file contains the source code for the parameter server and workers created to use synchronous distributed optimizers (tf.train.SyncReplicasOptimizer).
run.sh: This file launched the parameter server and the workers.
Log:
Please produce according to the steps and look at worker_0_log and worker_1_log
I am working on uclinux porting on coldfire board M5272C3. Right now I have kernel running from RAM with romfs as my rootfile system.
I am not clear about few terms what they mean and when to use them....
Please explain me in a simplest possible manner:
Q1: What is initrd? Why we need that?
Q2: What is ramdisk? Why and where we need this?
Q3: what is initramfs? Why and where we use this?
Q4: What is ramfs? Why and where we use this?
Also please refer document/reference book for in depth knowledge of these terms....
Thanks
Phogat
A ramdisk merely refers to an in-memory disk image. It is implemented using the ramfs VFS driver in the kernel. The contents of the ramdisk would be wiped on the next reboot or power-cycle.
I'll give you details about initrd and initramfs next.
In simple terms, both initrd and initramfs refers to an early stage userspace root filesystem (aka rootfs) that will let you run a very minimal filesystem in memory.
The documentation present at Documentation/filesystems/ramfs-rootfs-initramfs.txt part of the linux kernel source tree, which would also give you a length description of what these are.
What is initrd ?
One common case where there is the need for such an early-stage filesystem is to load driver modules for hard disk controllers. If the drivers were present on the hard drive, it becomes a chicken-and-egg problem. Having these drivers as part of this early-stage rootfs helps the kernel load the drivers for any detected hard disk controllers, before it can mount the actual root filesystem from the hard drive. Another solution to this problem would be to have all the driver modules built into the kernel, but you're going to increase the size of the kernel binary this way. This kind of filesystem image is commonly referred to as initrd. It is implemented using either ramfs or tmpfs. It is emulated using a loopback block device.
The bootloader loads the kernel image into a memory address, the initrd image into another memory address, and tells the kernel where to find the initrd, passes the boot arguments to the kernel, and passes control to the kernel to let it continue the boot process.
So how is it different from initramfs then ?
initramfs is an even earlier stage filesystem compared to initrd which is built into the kernel (controlled by the kernel config of course).
As far as I know, both initrd and initramfs are controlled by this single kernel config, but it could have been changed in the recent kernels.
config BLK_DEV_INITRD
I'm not going deep into how to build your own initramfs, but I can tell you it just uses cpio format to store the files and can be configured using usr/Kconfig while building the kernel. Even if you do not specify your own initramfs image, but have turned on support for initramfs, kernel automatically embeds a very simple initramfs containing /dev/console, /root and some other files/directories.
In addition there is also a newer tmpfs filesystem which is commonly used to implement in-memory filesystems. In fact newer kernels implement initrd using tmpfs instead of ramfs.
UPDATE:
Just happened to stumble upon a similar question
This might also be useful
I am emulating qemu for linux x86-64. In qemu virtual machine, I am using
taskset -c 0 prc1 & taskset -c 1 prc2 & taskset -c 2 prc3 & taskset -c 3 prc4;
to simultaneously issue 4 processes and bind them to four cores (prc is short for process). However, I find that once they start running; then afterwards, in-between, some cores (say 1 and 2) do not execute those processes but either idle or do something else. Can you suggest, what could be the reason for this or a way of improvement so that, I can make sure processes don't migrate from one core to another.
The processes aren't migrating from one core to another. Whenever they need CPU, they will only get the core you bound them to. That won't prevent the CPUs from doing other work, nor will it somehow force a process use a core even when it cannot run, say because it's waiting for I/O.