StyleGAN 2 images completely black after Tick 0 - tensorflow

I am training StyleGAN 2 on my own dataset - https://github.com/NVlabs/stylegan2
It works fine on a single P100 in Google Colab, but when I move the model to Vast.ai and try it on multiple GPU's an odd issue happens.
Everything works up to Tick 0, and after Tick 1, the fake images all come out completely black.
My environment:
Tensorflow 1.15
CUDA 10.0
My training command:
python3 run_training.py --num-gpus=4 --data-dir="/root/data/" --config=config-f --dataset=images1_tf --mirror-augment=true --metrics=none
In rare instances it works and generates proper fakes, but if I interrupt the training with ^C and resume again, then it starts generating the all black images.
I have tried changing datasets, tried it with different machine instances, but the problem persists.

I had the exact same problem with 2 GPUs (GTX 1080 8GB cards in my case) running Tensorflow 1.15 and CUDA 10.2... It would train for exactly 1 tick as you mentioned, and then all subsequent fakes would be a pure black image. On a whim, I upgraded my Nvidia driver from 440 to 450 which also bumped CUDA up to 11. It then began working and generating proper images after tick 1.

Related

Tensorflow Object Detection API GPU memory issue

I'm currently trying to train a model based off the model detection zoo for object detection. Running the setup on the CPU works as expected but trying the same on my GPU results in the following error.
2021-03-10 11:46:54.286051: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-03-10 11:46:54.751423: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-03-10 11:46:54.764147: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-03-10 11:46:54.764233: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Monitoring the GPU information within task manager it seems that tensorflow (as expected as far as I've understood it) trys to allocate the whole memory. Shortly after reaching a specific peak (roughly 7.3 Gb of 8Gb) TF crashes with the error seen in the snippet above.
Solutions for this specific error within the internet / stackoverflow mention that this problem can be solved allowing the dynamic memory growth. Doing this seems to work and TF manages to create atleast one new checkpoint but in the end crashes with an error of a similar category. In that case the CUDA_OUT_OF_MEMORY error.
System Information:
Ryzen 5
16 GB RAM
RTX 2060 Super with 8Gb VRAM
Training Setup:
Tensorflow 2.4
CUDA 11.0 (also tried several combinations of CUDA cuDNN versions)
cuDNN 8.0.4
Originally I wanted to use the pretrained EfficientDet D6 model but also tried several others like EfficientDet D4, CenterNet HourGlass 512x512 and the SSD MobileNet V2 FPNLite. All those models were started with different batch sizes but even on a batch size of 1 the problem still occures. The training images aren't large either (in average 600 x 800). Currently there are a total of 30 images, 15 per class for training (I'm aware that the training data set should be bigger, but it's just to test the setup).
Now my question would be if anybody has an educated guess or another approach of finding the cause of this error as I cannot immagine that my 2060 isn't capable of atleast training a SSD with a batch size of 1 and rather small images. Could it be a hardware fault? If so, is there a way to check that?
I've done a complete reinstallation of every involving component. I might have done something different this time but I cannot say what. Atleast I'm now able to utilize the GPU for training.

Keras model generates Nan on predict on gcp

I have a Keras model (Yolo3) and I want to run the code on several images in the loop:
I am using Debian 10, Tensorflow 2.4, cuda 11 and K80 Tesla GPU on GCP virtual machine.
here is the code on colab code
I remark that the code is running very well on my local GPU using RTX2070 and tensorflow 2.3
Running exactly the same code on GCP instance crash at the second iteration by predicting NaN value !!!!)
I debug the image (infinite, Nan value), consider a list of the same image, etc, it still giving the same error!!
I also debug the weights of the model and I get that it has the same weights across iterations!
Anyone from the GCP team can help ?

TF Keras NAN Loss when using multiple GPUs

System:
Ubuntu 18.04 LTS
(2) NVIDIA GTX 1080Ti GPUs 11GB
Driver Version: 440.33.01
CUDA Version: 10.0
I am currently using Tensorflow 2.0 (Python) and the tf.keras library to train a CNN.
However, I am encountering an issue when I try to train my model by calling model.fit(). After
I begin training, the loss is normal for 1 ~ 2 steps for the first epoch. But after that, it suddenly becomes NaN loss. If I try to stop the kernel that is running the training script, the whole computer freezes.
This issue only happens when using multiple GPUs. The code I'm using works perfectly fine on a single GPU. I have wrapped all of my code inside the scope of a tf.distribute.MirroredStrategy using with strategy.scope():. I am feeding my network with data from a tf.data.Dataset (though this error occurs regardless of the data I'm using to train).
I then ran some tests:
1) I tried to replace the data in my dataset with random numbers from a distribution, but the loss stil went to NaN.
2) I also tried feeding the numpy arrays directly to .fit(), but that didn't solve the issue.
3) I tried using different optimizers (Adam, RMSprop, SGD), batch sizes (4, 8, 16, 32), and learning rates, none of which helped to solve this problem.
4) I swapped out my network for a simple Multi-layer Perceptron, but the error persisted.
This doesn't appear to be an OOM issue, since the data is relatively small and running watch -n0.1 nvidia-smi reveals that memory usage never exceeds 30% on either of my GPUs. There doesn't seem to be any warning or error in the console output that might hint at the issue either.
Any help is appreciated

Google Colab GPU speed-up works with 2.x, but not with 1.x

In https://colab.research.google.com/notebooks/gpu.ipynb, which I assume is an official demonstration of GPU speed-up by Google, if I follow the steps, the GPU speed-up (around 60 times faster than with CPU) using Tensorflow 2.x works. However, if I want to use version 1.15 like in https://colab.research.google.com/drive/12dduH7y0GPztxSM0AFlfpjj8FU5x8YSv (the only change compared to the notebook from the first link is getting rid of "%tensorflow_version 2.x" both times), tf.test.gpu_device_name() returns the string /device:GPU:0 but there is no speed-up. I would really love to use the a Tensorflow version between 1.5 and 1.15 though, as the code I want to run uses functions removed in Tensorflow 2.x. Does anyone know how to use Tensorflow 1.x while still getting the GPU speed-up?
In your notebook your code is not executed actually, since you didn't called session.run() nor tf.enable_eager_execution().
Add tf.enable_eager_execution() at the top of your code and you'll see the real difference between cpu and gpu times.

tensorflow does not recognise 2nd GPU (/gpu:1)

I am trying to use 2 GPUs, tensorflow does not recognise the 2nd one. the 2nd GPU is working fine (in widows environment)
When I set CUDA_VISIBLE_DEVICES=0 and run the program I see RTX2070 as GPU0
When I set CUDA_VISIBLE_DEVICES=1 and run the program I see GTX1050as GPU0
When I set CUDA_VISIBLE_DEVICES=0,1 and run the program I see RTX2070 as GPU0
so basically, TF does not recognise GPU1, it one GPU at the same time (GPU 0)
Is there any command to manually define GPU1?
uninstalled and re-installed, Cudann, python 3.7, tensorflow and keras (GPU versions). I am using anaconda on windows 10. tried to change the CUDA_VISIBLE_DEVICES to 0, 1. I dont see any error, but the 2nd GPU does not appear anywhere in python.
the main GPU is RTX2070 (8GB) and 2nd GPU is GTX1050 (2GB). Before i submit i spent sometime searching for solution and did whatever I could find on the internet. drivers are up to date, 64bit version anf latest versions of the software are installed. I dont see any issue, beside not appearing the 2nd GPU.
The codes are working fine on first GPU, both have > 3.5 computational capacity.
Providing the solution here (Answer Section), even though it is present in the Comment Section (Thanks M Student for sharing solution), for the benefit of the community.
Adding this at the beginning of the code resolved the issue
import os
os.environ["TF_MIN_GPU_MULTIPROCESSOR_COUNT"]="2"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"