CUDA_ERROR_NOT_INITIALIZED on A100 after server reset

CUDA_ERROR_NOT_INITIALIZED on A100 after server reset - tensorflow

I'm running on a server with a A100 GPU. When trying to run tensorflow code after a server reset, tensorflow does not recognize the GPU. Running tf.config.list_physical_devices('GPU') yields CUDA_ERROR_NOT_INITIALIZED:
2021-09-09 07:41:42.956917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-09-09 07:41:43.899014: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2021-09-09 07:41:43.899148: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: f42a3aa12bd1
2021-09-09 07:41:43.899169: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: f42a3aa12bd1
2021-09-09 07:41:43.899890: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.32.3
2021-09-09 07:41:43.899955: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.32.3
2021-09-09 07:41:43.899969: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.32.3
Running nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:00:06.0 Off | On |
| N/A 46C P0 40W / 250W | 0MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Why do I get CUDA_ERROR_NOT_INITIALIZED? The server ran perfectly well before the reset, and nvidia-smi is clearly working.

It seems NVIDIA Multi-Instance GPU (MIG) is enabled on your GPU, but you haven't defined any GPU instances. This can be seen from the fact that nvidia-smi shows a MIG devices table, but it's empty (No MIG devices found).
The MIG documentation states:
Without creating GPU instances (and corresponding compute instances),
CUDA workloads cannot be run on the GPU. In other words, simply
enabling MIG mode on the GPU is not sufficient. Also note that, the
created MIG devices are not persistent across system reboots. Thus,
the user or system administrator needs to recreate the desired MIG
configurations if the GPU or system is reset.
You probably had a MIG configuration defined before the reset, but the server reset removed that configuration. You need to re-configure the GPU instances to get the GPU working again. If you just want a basic configuration, in which you have only one GPU instance that uses all the resources, you can run:
sudo nvidia-smi mig -cgi 0 -C
If you need a fancier configuration than that, you should consult the documentation.
After configuring the GPU instances, the nvidia-smi command should show the MIG devices table full. In our case, it should have one entry:
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 0 0 0 | 0MiB / 40536MiB | 98 0 | 7 0 5 1 1 |
| | 1MiB / 65536MiB | | |
+------------------+----------------------+-----------+-----------------------+

Related

Google Colab not detecting GPU with 'sudo'

I've been using Google Colab to produce Blender renders for a few months, but today my scripts stopped working without any changes. I run my scripts with sudo and for whatever reason Google Colab is not giving GPU access to commands run with sudo.
This is the output for nvidia-smi:
/content# nvidia-smi
Fri Jun 3 13:46:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But the same command throws an error if run with sudo:
/content# sudo nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
sudo is important for me because my Blender commands, for whatever reason, don't work without sudo.
/content# ./blender-3.1.0-linux-x64/blender -b --python-console -noaudio
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7fe46122b000
Aborted (core dumped)

CUDA_ERROR_UNKNOWN: unknown error in Colab instance

I'm connected to Google Colab through SSH (using this method). I get the following error when trying to use the GPU.
python lstm_example.py
Num GPUs Available: 1
(25000,)
(25000,)
2022-03-21 12:43:53.301917: W tensorflow/stream_executor/cuda/cuda_driver.cc:374] A non-primary context 0x559ed434d210 for device 0 exists before initializing the StreamExecutor. The primary context is now 0. We haven't verified StreamExecutor works with that.
2022-03-21 12:43:53.302331: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
Aborted (core dumped)
GPU info
nvidia-smi
Mon Mar 21 13:00:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 460.32.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 50C P0 59W / 149W | Function Not Found | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
I've added the following lines
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
The same code works when run in a notebook cell. I also notice that Memory_Usage is available when running nvidia-smi from a notebook and the CUDA version used is different (11.2).
Tue Mar 22 10:52:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 43C P8 31W / 149W | 3MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-

Using GPU for reinforcement learning with Keras

I am using this code (please excuse its messiness) to run on my CPU. I have a custom RL environment that I have created myself and I am using DQN agent.
But when I run this code on GPU, it doesn't utilize much of it and in fact it is slower than my CPU.
This is the output of nvidia-smi. As you can see my processes are running on GPU but the speed is much slower than I would expect.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:00:05.0 Off | N/A |
| 23% 37C P2 60W / 250W | 11619MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:00:06.0 Off | N/A |
| 23% 29C P8 9W / 250W | 157MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 25540 C python3 11609MiB |
| 1 25540 C python3 147MiB |
+-----------------------------------------------------------------------------+
Can anyone point out what can I do to change my code for GPU capabilities?
PS: Notice that I have two GPUs and my process is running on both of them. Even if I use any one of two GPUs, the issue is that my GPU is not utilized and the speed is comparatively slower than GPU so two GPUs is not the issue

How to access to GPUs on different nodes in a cluster with Slurm?

I have access to a cluster that's run by Slurm, in which each node has 4 GPUs.
I have a code that needs 8 gpus.
So the question is how can I request 8 gpus on a cluster that each node has only 4 gpus?
So this is the job that I tried to submit via sbatch:
#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
But then I get the following error:
sbatch: error: Batch job submission failed: Requested node configuration is not available
Then I changed my the settings to this and submitted again:
#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
nvidia-smi
and the result shows only 4 gpus not 8.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 0000:03:00.0 Off | 0 |
| N/A 32C P0 31W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 0000:04:00.0 Off | 0 |
| N/A 37C P0 29W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 0000:82:00.0 Off | 0 |
| N/A 35C P0 28W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 0000:83:00.0 Off | 0 |
| N/A 33C P0 26W / 250W | 0MiB / 12193MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Thanks.

Slurm does not support what you need. It only can assign to your job GPUs/node, not GPUs/cluster.
So, unlike CPUs or other consumable resources, GPUs are not consumable and are binded to the node where they are hosted.
If you are interested in this topic, there is a research effort to turn GPUs into consumable resources, check this paper.
There you'll find how to do it using GPU virtualization technologies.

Job script: You are requesting 2 nodes with each of them 4 GPUs. Tolal 8 GPUs are assigned to you. You are running "nvidia-smi". nvidia-smi does not aware of SLURM nor MPI. It runs only on first node assigned to you. So it shows only 4 GPUs, result is normal.
If you run GPU based engineering application like Ansys HFSS or CST, They can use all 8 GPUs.

GPU load in tensorflow

I just built TensorFlow v1.0 and I am trying to run MNIST test just to see if it's working. Seems like it is, but i am observing weird behaiviour.
My system has two Tesla P100, and nvidia-smi shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 0002:01:00.0 Off | 0 |
| N/A 34C P0 114W / 300W | 15063MiB / 16280MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 0006:01:00.0 Off | 0 |
| N/A 27C P0 35W / 300W | 14941MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 67288 C python3 15061MiB |
| 1 67288 C python3 14939MiB |
+-----------------------------------------------------------------------------+
As it shown, python3 ate all the memory on both GPUs, but computational load are placed only on first.
Exporting CUDA_VISIBLE_DEVICES I can limit GPU to be used, but it's not affect computational time. So no gain from adding second GPU. Single GPU
real 2m23.496s
user 4m26.597s
sys 0m12.587s
Two GPUs:
real 2m18.165s
user 4m18.625s
sys 0m12.958s
So the question is, how to load both GPUs?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

CUDA_ERROR_NOT_INITIALIZED on A100 after server reset - tensorflow

Related

Google Colab not detecting GPU with 'sudo'

CUDA_ERROR_UNKNOWN: unknown error in Colab instance

Using GPU for reinforcement learning with Keras

How to access to GPUs on different nodes in a cluster with Slurm?

GPU load in tensorflow

Categories

Resources