CUDNN_STATUS_INTERNAL_ERROR when using both Pytorch and TensorFlow - tensorflow

I'm running a program to process some data, and I inference both a TensorFlow model and a Pytorch model.
When inferencing either of the models everything works fine. However, when I add the pytorch input my program crashes with this error:
2018-05-14 12:55:05.525251: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-05-14 12:55:05.525280: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
Note that this already happens before I do anything with Pytorch. No models are loaded, nothing is put on GPU, no devices are checked.
Does anyone know what might be going wrong, how to fix it, and if there are some parameters I can change?
Something I already tried is disabling the PyTorch backend using this code:
import torch.backends.cudnn as cudnn
cudnn.enabled = False
But unfortunately this does not help...

You'll find in the NVIDIA Forums some references of cuBLAS not playing well with several Python processes interacting with it at the same time. This is referenced in this 1 year old issue for Tensorflow, but it should be the same for any multiple-PyTorch client applications interfacing with GPU through CUDA - and cuBLAS, to be more specific. cuBLAS handles weren't being properly initialized, somehow due to a mixture of issues related to on-disk caching and RAM utilization being too large.
The solution was both to delete the on-disk cache for cuBLAS,
sudo rm -rf ~/.nv
and restrict the amount of memory usage for nets.

Related

pickle.load cannot open up a (Stylegan2 network) pickle model on my local machine, but can on the cloud

Stylegan2 uses network pickle files to store ML models. I transfer trained one model, which I am able to open up on cloud servers. I have been generating images from this model fine with the following setup:
Google Colab: Python 3.6.9, CUDA 10.1, tensorflow-gpu 1.15, CuDNN 7.6.5
However, I cannot open the network pickle file on my local machine, even though I've been trying to replicate that cloud setup the best I can. (I have the right GPU hardware/drivers/etc.)
Local (Windows 10) Python 3.6.9, CUDA 10.1, tensorflow-gpu 1.15, CuDNN 7.6.5
Requires a library 'dnnlib' to be in the PYTHONPATH and for a tf.Session() to be initialized
I get the an assertion error about the pickle.
**Assertion error**: `assert state["version"] in [2,3]`
I find this error very odd because the network pickle works on the cloud--so it was saved properly. Additionally, my local set up can open up other network pickles(ie. ones downloaded from the internet through GET requests), making me think that I have properly set up my PYTHONPATH and initialized a tf.Session. These are prerequisites listed in the Stylegan repo:
"You can import the networks in your own Python code using pickle.load(). For this to work, you need to include the dnnlib source directory in PYTHONPATH and create a default TensorFlow session by calling dnnlib.tflib.init_tf()"
I'm not sure why I cannot open up this pickle in one environment, but can in another. Does anyone have any suggestions as to where I might start looking?
Actually, I figured it out by printing out what version was throwing the error. The version printed was '4'. I realized that this matched the pickle (HIGHEST_PROTOCOL) and that what I needed was the newest pull of the Stylegan2 repository, which included pickle format_version 4 in their allowed versions.

Blas GEMM launch failed: what does this error mean?

I am having problems executing a simple Tensorflow model that worked well yesterday. I suspect, the problem in its entirety relates to the error given
Blas GEMM launch failed
In the console it says,
tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed
My impression is that this may relate to my CUDA installation based on this
TensorFlow: Blas GEMM launch failed
however, I can't see how to run the simpleCUBLAS examples. I am completely new to CUDA.
I have 4 1080ti GPUs (Ubuntu 16.04, TensorFlow 1.3.0) and I have not identified any zombie processes taking up GPU memory. Any help is greatly appreciated.
So I found the answer after days of going mad. I first ran this
I did this:
cd /usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS
make
./simpleCUBLAS
to check my CUBLAS installation. It returned CUBLAS INITIALIZATION FAILED!!!
So next I did this (based on advice)
sudo rm -f ~/.nv
And it worked. Hope this saves someone else. Seems easy when you see it.
The other thing that is worth mentioning is that this problem also threw this error occasionally:
tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
This was cryptic - everybody suggested it was a memory issue and sure enough, my GPUs got hogged by python during the initiation of my TF model. But it was the CUBLAS error that led me to the solution.

How to run a project Caffe with CPU only

i'm using Ubuntu 14.04 without GPU and i want to run this code ( with CPU only ). I want to run this code with CPU ( not with GPU) : https://github.com/smallcorgi/Faster-RCNN_TF . what should I do ?
The Github repository you are referring to is a TensorFlow implementation of Faster-RCNN, not Caffe.
If you want to use the Caffe implementation, you have to use this repository : https://github.com/rbgirshick/py-faster-rcnn
You have to edit the Python scripts that are used to train and test the model, e.g. train_faster_rcnn_alt_opt.py so that the line caffe.set_mode_gpu() is replaced by caffe.set_mode_cpu(). You might have to compile Caffe by editing the Makefile.config file to remove CUDNN and CUDA support.
Note that Caffe on CPU will be very slow compared to GPU computing.

Synchronous distributed tensorflow training runs asynchronously

System Information:
Debian 4.5.5
TF installed from binary (pip3 install tensorflow-gpu==1.0.1 --user)
TF version: v1.0.0-65-g4763edf-dirty 1.0.1
Bazel version: N.A.
CUDA 8.0 cuDNN v5.1
Steps to reproduce
Make a directory and download the following files into it:
training.py run.sh
Run the command ./run.sh to simply reproduce this issue.
Detailed descriptions for the bug
Recently, I tried to deploy the synchronous distributed tensorflow training on the cluster. I followed the tutorial and the inception example to write my own program. The training.py is from other user's implementation, which follows the same API usage as the official example. I modified it to enable it running on a single machine with multiple GPUs by making them communicate through localhost and mapping each worker to see only one GPU.
The run.sh launched three processes. One of them is the parameter server and the others are two workers implemented by between-graph replication. I created the training supervisor by tf.train.Supervisor() to manage multiple sessions in the distributed training for the initialization and synchronization.
I expect these two workers would synchronize each batch and work in the same epoch. However, the worker 0, which is launched prior to the worker 1, completed the whole training set without waiting for the worker 1. After that, the process of the worker 0 finished training process and exited normally while worker 1 behaved like falling into the deadlock and keep near 0% utilization of CPU and GPU for several hours.
Based on my observation, I suspect these two workers didn't communicate and synchronize at all for the data they passed. I report this problem as a bug because I create the optimizer tf.train.SyncReplicasOptimizer as suggested by the official website and the inception example. However, it seems that the synchronization behaviors, if any, are very strange and the program can not exit normally.
Source code / logs
Two files:
training.py: This file contains the source code for the parameter server and workers created to use synchronous distributed optimizers (tf.train.SyncReplicasOptimizer).
run.sh: This file launched the parameter server and the workers.
Log:
Please produce according to the steps and look at worker_0_log and worker_1_log

tensorflow inception for retraining / fine tuning with a pretrained model: inception_train

I tried to retrain (new images, new classes) on top of the pretrained inception model, I therefor followed the instructions of the inception readme:
https://github.com/tensorflow/models/tree/master/inception#how-to-construct-a-new-dataset-for-retraining
I successfully built and ran build_image_data using bazel, as described in the tutorial. Afterwards I successfully built inception_train using bazel:
~/tensorflowmodels/models/inception# bazel build inception/inception_train
INFO: Found 1 target...
Target //inception:inception_train up-to-date (nothing to build)
INFO: Elapsed time: 0.073s, Critical Path: 0.00s
However, running bazel-bin/inception/inception_train I always get the following:
~/tensorflowmodels/models/inception# bazel-bin/inception/inception_train --train_dir="/" --validation_dir="/" --data_dir="/images_jpg/" --pretrained_model_checkpoint_path="/tensorflowmodels/models/inception/inception-v3/" --fine_tune=True --initial_learning_rate=0.001 --input_queue_memory_factor=1 --num_gpus=1
-bash: bazel-bin/inception/inception_train: No such file or directory
Naturally I would say it's by 99.9999% chance a typo. So then I tried to run inception_train.py with python. I had to change some import locations, and it finally ran with the parameters. However the script stops without any error messages after the initialization of the CUDA drivers.
Any help on how to solve this (or perform fine tuning / retraining with inception) would be very much appreciated.
tensorflow version: 0.9rc0
CPU: Xeon 5, 24 cores
GPU: Grid K2 8 GB
OS: Ubuntu 14.04
BTW I posted this already as an Github issue (which was closed, since it would be more a case for Stack Overflow).