Keras model generates Nan on predict on gcp - tensorflow

I have a Keras model (Yolo3) and I want to run the code on several images in the loop:
I am using Debian 10, Tensorflow 2.4, cuda 11 and K80 Tesla GPU on GCP virtual machine.
here is the code on colab code
I remark that the code is running very well on my local GPU using RTX2070 and tensorflow 2.3
Running exactly the same code on GCP instance crash at the second iteration by predicting NaN value !!!!)
I debug the image (infinite, Nan value), consider a list of the same image, etc, it still giving the same error!!
I also debug the weights of the model and I get that it has the same weights across iterations!
Anyone from the GCP team can help ?

Related

Tensorflow Object Detection API GPU memory issue

I'm currently trying to train a model based off the model detection zoo for object detection. Running the setup on the CPU works as expected but trying the same on my GPU results in the following error.
2021-03-10 11:46:54.286051: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-03-10 11:46:54.751423: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-03-10 11:46:54.764147: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-03-10 11:46:54.764233: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Monitoring the GPU information within task manager it seems that tensorflow (as expected as far as I've understood it) trys to allocate the whole memory. Shortly after reaching a specific peak (roughly 7.3 Gb of 8Gb) TF crashes with the error seen in the snippet above.
Solutions for this specific error within the internet / stackoverflow mention that this problem can be solved allowing the dynamic memory growth. Doing this seems to work and TF manages to create atleast one new checkpoint but in the end crashes with an error of a similar category. In that case the CUDA_OUT_OF_MEMORY error.
System Information:
Ryzen 5
16 GB RAM
RTX 2060 Super with 8Gb VRAM
Training Setup:
Tensorflow 2.4
CUDA 11.0 (also tried several combinations of CUDA cuDNN versions)
cuDNN 8.0.4
Originally I wanted to use the pretrained EfficientDet D6 model but also tried several others like EfficientDet D4, CenterNet HourGlass 512x512 and the SSD MobileNet V2 FPNLite. All those models were started with different batch sizes but even on a batch size of 1 the problem still occures. The training images aren't large either (in average 600 x 800). Currently there are a total of 30 images, 15 per class for training (I'm aware that the training data set should be bigger, but it's just to test the setup).
Now my question would be if anybody has an educated guess or another approach of finding the cause of this error as I cannot immagine that my 2060 isn't capable of atleast training a SSD with a batch size of 1 and rather small images. Could it be a hardware fault? If so, is there a way to check that?
I've done a complete reinstallation of every involving component. I might have done something different this time but I cannot say what. Atleast I'm now able to utilize the GPU for training.

StyleGAN 2 images completely black after Tick 0

I am training StyleGAN 2 on my own dataset - https://github.com/NVlabs/stylegan2
It works fine on a single P100 in Google Colab, but when I move the model to Vast.ai and try it on multiple GPU's an odd issue happens.
Everything works up to Tick 0, and after Tick 1, the fake images all come out completely black.
My environment:
Tensorflow 1.15
CUDA 10.0
My training command:
python3 run_training.py --num-gpus=4 --data-dir="/root/data/" --config=config-f --dataset=images1_tf --mirror-augment=true --metrics=none
In rare instances it works and generates proper fakes, but if I interrupt the training with ^C and resume again, then it starts generating the all black images.
I have tried changing datasets, tried it with different machine instances, but the problem persists.
I had the exact same problem with 2 GPUs (GTX 1080 8GB cards in my case) running Tensorflow 1.15 and CUDA 10.2... It would train for exactly 1 tick as you mentioned, and then all subsequent fakes would be a pure black image. On a whim, I upgraded my Nvidia driver from 440 to 450 which also bumped CUDA up to 11. It then began working and generating proper images after tick 1.

TF Keras NAN Loss when using multiple GPUs

System:
Ubuntu 18.04 LTS
(2) NVIDIA GTX 1080Ti GPUs 11GB
Driver Version: 440.33.01
CUDA Version: 10.0
I am currently using Tensorflow 2.0 (Python) and the tf.keras library to train a CNN.
However, I am encountering an issue when I try to train my model by calling model.fit(). After
I begin training, the loss is normal for 1 ~ 2 steps for the first epoch. But after that, it suddenly becomes NaN loss. If I try to stop the kernel that is running the training script, the whole computer freezes.
This issue only happens when using multiple GPUs. The code I'm using works perfectly fine on a single GPU. I have wrapped all of my code inside the scope of a tf.distribute.MirroredStrategy using with strategy.scope():. I am feeding my network with data from a tf.data.Dataset (though this error occurs regardless of the data I'm using to train).
I then ran some tests:
1) I tried to replace the data in my dataset with random numbers from a distribution, but the loss stil went to NaN.
2) I also tried feeding the numpy arrays directly to .fit(), but that didn't solve the issue.
3) I tried using different optimizers (Adam, RMSprop, SGD), batch sizes (4, 8, 16, 32), and learning rates, none of which helped to solve this problem.
4) I swapped out my network for a simple Multi-layer Perceptron, but the error persisted.
This doesn't appear to be an OOM issue, since the data is relatively small and running watch -n0.1 nvidia-smi reveals that memory usage never exceeds 30% on either of my GPUs. There doesn't seem to be any warning or error in the console output that might hint at the issue either.
Any help is appreciated

Can we run training and validation on separate GPUs using tensorflow object detection API running on tensorflow 1.12?

I have two Nvidia Titan X cards on my machine and want to finetune COCO pretrained Inception V2 model on a single specific class. I have created the train/val tfrecords and changed the config to run the tensorflow object detection training pipeline.
I am able to start the training but it hangs (without any OOM) whenever it tries to evaluate a checkpoint. Currently it is using only GPU 0 with other resource parameters (like RAM, CPU, IO etc) in normal range. So I am guessing that GPU is the bottleneck. I wanted to try splitting training and validation on separate GPUs and see if it works.
I tried to look for a place where I could do something like setting "CUDA_VISIBLE_DEVICES" differently for both the processes but unfortunately the latest tensorflow object detection API code (using tensorflow 1.12) makes it very difficult to do so. I am also unable to verify my assumption about training and validation running in same process as my machine hangs. Could someone please suggest where to look for to solve it?

Running Multiple Gpus in theano jupyter notebooks, implementing theano.gpyarray.use

I have a linux system with three gpus. I am using keras with theano to run cnn's, In the past when I was using Theano 8.+ , I was able to assign a particular gpu to jupyter notebook window using the following:
import theano.sandbox.cuda
theano.sandbox.cuda.use("gpu2")
This allowed me to run three versions of the same cnn model using different hyper-parameters.
I very recently updated both keras (to 2.0) and theano ( to 0.9). This required me to setup the gpuarray backend.
Running just one jupyter notebook with a model works fine. gpu1 is selected by theano. However when I startup a second notebook with the same model, theano tries to use the gpu assigned to the first notebook, causing a memory usage problem and ultimately causing the cnn model to run on the cpu rather than using one of the available two remaining gpus.
Is there a way to select the gpu that I wish the run on each jupyter notebook in theano 0.9 as I was able in theano 8.+