Show GPU memory usage and utilization for a slurm job - gpu

I am using slurm to access GPU resources. Is it possible to show GPU usage for a running slurm job? Just like using nvidia-smi in a normal interactive shell.

you can use ssh to login your job's node. Then use nvidia-smi. It works for me.
For example, I use squeue check my job xxxxxx is current running at node x-x-x. Then I use ssh x-x-x to access to that node. After that, you can use nvidia-smi to check the usage of GPUs.

I suggest trying to launch your application manually in jupyter and access the terminal shell in jupyter.

Related

Stopping and starting a deep learning google cloud VM instance causes tensorflow to stop recognizing GPU

I am using the pre-built deep learning VM instances offered by google cloud, with an Nvidia tesla K80 GPU attached. I choose to have Tensorflow 2.5 and CUDA 11.0 automatically installed. When I start the instance, everything works great - I can run:
Import tensorflow as tf
tf.config.list_physical_devices()
And my function returns the CPU, accelerated CPU, and the GPU. Similarly, if I run tf.test.is_gpu_available(), the function returns True.
However, if I log out, stop the instance, and then restart the instance, running the same exact code only sees the CPU and tf.test.is_gpu_available() results in False. I get an error that looks like the driver initialization is failing:
E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
Running nvidia-smi shows that the computer still sees the GPU, but my tensorflow can’t see it.
Does anyone know what could be causing this? I don’t want to have to reinstall everything when I’m restarting the instance.
Some people (sadly not me) are able to resolve this by setting the following at the beginning of their script/main:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
I had to reinstall CUDA drivers and from then on it worked even after restarting the instance. You can configure your system settings on NVIDIAs website and it will provide you the commands you need to follow to install cuda. It also asks you if you want to uninstall the previous cuda version (yes!).This is luckily also very fast.
I fixed the same issue with the commands below, taken from https://issuetracker.google.com/issues/191612865?pli=1
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
Option-1:
Upgrade a Notebooks instance's environment. Refer the link to upgrade.
Notebooks instances that can be upgraded are dual-disk, with one boot disk and one data disk. The upgrade process upgrades the boot disk to a new image while preserving your data on the data disk.
Option-2:
Connect to the notebook VM via SSH and run the commands link.
After execution of the commands, the cuda version will update to 11.3 and the nvidia driver version to 465.19.01.
Restart the notebook VM.
Note: Issue has been solved in gpu images. New notebooks will be created with image version M74. About new image version is not yet updated in google-public-issue-tracker but you can find the new image version M74 in console.

WSL2- $nvidia-smi command not running

I have Ubuntu 18.04LTS install inside WSL2 and I was able to use GPU. I can run
$nvidia-smi from window run terminal.
However, I can not find any result when I run $nvidia-smi on WSL2
The fix is now available at the nvidia-docs.
cp /usr/lib/wsl/lib/nvidia-smi /usr/bin/nvidia-smi
chmod ogu+x /usr/bin/nvidia-smi
Source: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations
From the known limitations in documentation from nvidia :
NVIDIA Management Library (NVML) APIs are not supported. Consequently,
nvidia-smi may not be functional in WSL 2.
However you should be able to run https://docs.nvidia.com/cuda/wsl-user-guide/index.html#unique_1238660826
EDIT: since this answer, nvidia-smi supported since driver 465.42
I am running it well using 470.57.02.
When I was installing CUDA 11.7.1 in WSL2, a same error raised which noted me
Failed to initialize NVML: GPU access blocked by the operating system
Failed to properly shut down NVML: GPU access blocked by the operating system
I fixed it by updating the Windows system to version 19044 21H2.

Keep jupyter lab notebook running when SSH is terminated on local machine?

I would like to be able to turn off my local machine while having my code continuously running in jupyter lab and come back to it running later, however as soon as the SSH is terminated, the jupyter lab kernel is stopped. My code also stops executing when I close the jupyter lab browser tab.
From the Google Cloud Platform marketplace I'm using a 'Deep Learning VM'. From there, I SSH to it through the suggested gcloud command (Cloud SDK) gcloud compute ssh --project projectname --zone zonename vmname -- -L 8080:localhost:8080. It then opens a PuTTY connection to the VM and automatically has jupyter lab running that I can now access on local host.
What can I do to be able to run my code with my local machine off in this case?
I usually use "nohup" when using jupter notebook through ssh!
:~$ nohup jupyter notebook --ip=0.0.0.0 --port=xxxx --no-browser &
you can know more about it here
Hope it helps!
You can use Notebook remote execution.
Basically your Notebook code will run in a remote machine and results will be stored there or in GCS for later view.
You have the following options:
nbconvert based options:
nbconvert: Provides a convenient way to execute the input cells of an .ipynb notebook file and save the results, both input and output cells, as a .ipynb file.
papermill: is a Python package for parameterizing and executing Jupyter Notebooks. (Uses nbconvert --execute under the hood.)
notebook executor: This tool that can be used to schedule the execution of Jupyter notebooks from anywhere (local, GCE, GCP Notebooks) to the Cloud AI Deep Learning VM. You can read more about the usage of this tool here. (Uses gcloud sdk and papermill under the hood)
Notebook training tool
Python package allows users to run a Jupyter notebook at Google Cloud AI Platform Training Jobs.
AI Platform Notebook Scheduler
This is in Alpha (Beta soon) with AI Platform Notebooks and the recommended option. Allows you scheduling a Notebook for recurring runs follows the exact same sequence of steps, but requires a crontab-formatted schedule option.
There are other options which allow you to execute Notebooks remotely:
tensorflow_cloud (Keras for GCP) Provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
GCP runner Allows running any Jupyter notebook function on Google Cloud Platform
Unlike all other solutions listed above, it allows to run training for the whole project, not single Python file or Jupyter notebook

How to use a remote machine's GPU in jupyter notebook

I am trying to run tensorflow on a remote machine's GPU through Jupyter notebook. However, if I print the available devices using tf, I only get CPUs. I have never used a GPU before and am relatively new at using conda / jupyter notebook remotely as well, so I am not sure how to set up using the GPU in jupyter notebook.
I am using an environment set up by someone else who already executed the same code on the same GPU, but they did it via python script, not in a jupyter notebook.
this is the only code in the other person's file that had to do with the GPU
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
set_session(tf.Session(config=config))
I think the problem was that I had tensorflow in my environment instead of tensorflow-gpu. But now I get this message "cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version" and I don't know how to update the driver through terminal
How is your environment set up? Specifically, what is your remote environment, and what is your local environment? Sounds like your CUDA drivers are out of date, but it could be more than just that. If you are just getting started, I would recommend finding an environment that requires little to no configuration work on your part, so you can get started more easily/quickly.
For example, you can run GPUs on the cloud, and connect to them via local terminal. You also have your "local" frontend be Colab by connecting it to a local runtime. (This video explains that particular setup, but there's lots of other options)
You may also want to try running nvidia-smi on the remote machine to see if the GPUs are visible.
Here is another solution, that describes how to set up a GPU-Jupyterlab instance with Docker.
To update your drivers via terminal, run:
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
sudo reboot
Are your CUDA paths set appropriately? Like that?
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

100% GPU utilization on a GCE without any processes

I've just started an instance on a Google Compute Engine with 2 GPUs (Nvidia Tesla K80). And straight away after the start, I can see via nvidia-smi that one of them is already fully utilized.
I've checked a list of running processes and there is nothing running at all. Does it mean that Google has rented out that same GPU to someone else?
It's all running on this machine:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial
Enabling "persistence mode" with nvidia-smi -pm 1 might solve the problem.
ECC in combination with non persistence mode can lead to 100% GPU utilization.
Alternatively you can disable ECC with nvidia-smi -e 0.
Note: I'm not sure if the performance actually is worse. I can remember that I was able to train ML model despite the 100% GPU utilization but I don't know if it was slower.
I would like to suggest you to report and create this issue on the Google Issue Tracker as need to investigate. Please provide your project number and instance name over there. Please follow this URL that make you able to create a file as private in Google Issue Tracker.