MXNet colab pro allocate memory on GPU stuck - google-colaboratory

I bought Colab pro to train a model written with MXNet. In the standard version of Colab with GPU, everything is working fine. But in the pro version, a simple array allocation on GPU is running forever: mx.nd.array([1, 2, 3], ctx=mx.gpu(0)).
Versions:
cuda 10.1 installed via !apt-get -y install cuda-10-1
mxnet with Cuda support 10.1 installed via %pip install mxnet-cu101==1.5.0
The image below basically says that the execution of the cell started at 13:50 (6 minutes ago). 6 minutes to allocate an array of 3 values? Why is this happening?
Some Colab info:

Related

Tensorflow for GPU without Conda not working

I wanted to use tensorflow for GPU for my current project so I followed a tutorial from youtube to install it without using conda (https://www.youtube.com/watch?v=-Q6SM_usn84)
I followed the tutorial and turns out there is no CUDA 11.2 for windows 11. Hence I installed the latest version i.e. CUDA 12 and followed the rest of the tutorial.
I then proceeded to create a virtualenv using python 3.10.10 and installed tensorflow using:
pip3 install --upgrade https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow_cpu-2.11.0-cp310-cp310-win_amd64.whl
It did install successfully and imports but does not list my GPU as a physical device available.
After a bit of research I came across this "https://forums.developer.nvidia.com/t/how-do-i-install-cuda-11-0-on-windows-11-not-wsl2-windows-itself/192251" and reinstalled CUDA 11.2 for windows 10 in my windows 11 machine. Everything is still the same.
It imports but does not list my GPU as a physical device available.

Tensorflow Loss function is NAN when using GPU

I am trying to train custom object detection model using pre-trained model from Tensorflow1 Model ZOO.
I am using model ssd_mobilenet_v2_coco_2018_03_29
I created suitable environment for training following this tutorial :https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/tensorflow-1.14/training.html
The thing is, when I tried to train the model using tensorflow-gpu==1.14.0 I always got the error saying Model diverged with loss = NaN.
Then I tried to uninstall tensorflow-gpu==1.14.0 and install tensorflow==1.14.0 (so it did not use my GPU) and all of sudden it started to work !
I have no idea how is that possible...
Command I am using -
python model_main.py --alsologtostderr --model_dir=models\ssd_mobilenet_v2_coco_2018_03_29\export --pipeline_config_path=models\ssd_mobilenet_v2_coco_2018_03_29\pipeline.config --num_train_steps=2000
Python version is 3.7
OS is Windows 10
My Graphics Card is Nvidia GeForce RTX3050, I used CUDA v10.0 and cuDNN v7.4.1
Any ideas ?
This is because RTX30's don't support cuda 10. If you need tf v1 (1.15) you can install nvidia's tensorflow (1.15) that can run on cuda 11.
pip install nvidia-pyindex
pip install nvidia-tensorflow[horovod]
Note: Only supports Python 3.6 or 3.8 [Not 3.7]
https://developer.nvidia.com/blog/accelerating-tensorflow-on-a100-gpus/

What is the proper configuration for Quadro RTX3000 to run tensorflow with GPU?

My laptop System is Win10, with GPU NVIDIA Quadro RTX3000.
While trying to set up the TensorFlow with GPU, it always can't recognize my GPU.
What is the proper configuration for CUDA/CUDNN/Tensorflow etc.?
I did suffer a while before making it works.
Here is my configuration:
Win10
RTX 3000
Nvidia driver version 456.71
cuda_11.0.3_451.82_win10 (can't works with 11.1 version, not sure why)
cudnn -v8.0.4.30
Python 3.8.7
Tensorflow 2.5.0-dev20210106 (2.4 don't support cuda 11.x)
For future reference, You could have simply installed Anaconda on windows and run the command conda install -c anaconda tensorflow-gpu which would install the required CUDA, Tensorflow, CUDNN (correct versions) while forming a separate environment to effortlessly install Tensorflow.
It's the easiest solution, one that works out-of-the box and automates all the tasks.

Tensorflow GPU build with CPU

Does the tensorflow GPU version can be used with CPU only machines?
Getting some issues with Tensorflow 1.5 regarding cuda libraries not found.
Anybody tried this?
The CUDA 9 library is required for tensorflow-gpu version 1.5 and above, and CUDA 9 must be installed with a CUDA-capable GPU. You should probably install the CPU version of Tensorflow.

Keras with Tensorflow backend on GPU. MKL ERROR: Parameter 4 was incorrect on entry to DLASCL

I installed Tensorflow with GPU support and Keras to an environment in Anaconda (v1.6.5) by using following commands:
conda install -n EnvName tensorflow-gpu
conda install -n EnvName -c conda-forge keras-gpu
I have NVIDIA Quadro 2200K on my machine with driver v384.66, cuda-8.0, cudnn 7.0
When I am trying to run a python code with Keras at the stage of training I get the following
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
and later
File
"/home/User/anaconda3/envs/keras_gpu/lib/python3.6/site-packages/numpy/linalg/linalg.py",
line 99, in _raise_linalgerror_svd_nonconvergence
raise LinAlgError("SVD did not converge") numpy.linalg.linalg.LinAlgError: SVD did not converge
Other relevant sources suggest to check data for NaNs and Infs, but my data is clean for sure. By the way, CPU version of the installation is working fine, the issue occurs only when trying to run on GPU
I tried to reinstall Anaconda, to reinstall CUDA and numpy, but it didn't work out.
The problem was in package mkl (2018.0.0) - it seems like it has recently been released and conflicts with the version of some packages supplied with Tensorflow(1.3.0) and Keras(2.0.5) via conda*.
So I manually downgraded mkl using Anaconda Navigator to v11.3.3 which led automatically to downgrade of other packages and everything is working well now.