TF Keras NAN Loss when using multiple GPUs - tensorflow

System:
Ubuntu 18.04 LTS
(2) NVIDIA GTX 1080Ti GPUs 11GB
Driver Version: 440.33.01
CUDA Version: 10.0
I am currently using Tensorflow 2.0 (Python) and the tf.keras library to train a CNN.
However, I am encountering an issue when I try to train my model by calling model.fit(). After
I begin training, the loss is normal for 1 ~ 2 steps for the first epoch. But after that, it suddenly becomes NaN loss. If I try to stop the kernel that is running the training script, the whole computer freezes.
This issue only happens when using multiple GPUs. The code I'm using works perfectly fine on a single GPU. I have wrapped all of my code inside the scope of a tf.distribute.MirroredStrategy using with strategy.scope():. I am feeding my network with data from a tf.data.Dataset (though this error occurs regardless of the data I'm using to train).
I then ran some tests:
1) I tried to replace the data in my dataset with random numbers from a distribution, but the loss stil went to NaN.
2) I also tried feeding the numpy arrays directly to .fit(), but that didn't solve the issue.
3) I tried using different optimizers (Adam, RMSprop, SGD), batch sizes (4, 8, 16, 32), and learning rates, none of which helped to solve this problem.
4) I swapped out my network for a simple Multi-layer Perceptron, but the error persisted.
This doesn't appear to be an OOM issue, since the data is relatively small and running watch -n0.1 nvidia-smi reveals that memory usage never exceeds 30% on either of my GPUs. There doesn't seem to be any warning or error in the console output that might hint at the issue either.
Any help is appreciated

Related

Keras model generates Nan on predict on gcp

I have a Keras model (Yolo3) and I want to run the code on several images in the loop:
I am using Debian 10, Tensorflow 2.4, cuda 11 and K80 Tesla GPU on GCP virtual machine.
here is the code on colab code
I remark that the code is running very well on my local GPU using RTX2070 and tensorflow 2.3
Running exactly the same code on GCP instance crash at the second iteration by predicting NaN value !!!!)
I debug the image (infinite, Nan value), consider a list of the same image, etc, it still giving the same error!!
I also debug the weights of the model and I get that it has the same weights across iterations!
Anyone from the GCP team can help ?

Keras getting frozen when using regularizer in CNN model

I had a custom CNN implementation in keras running with TensorFlow backend. To improve generalizability I was working on adding regularization to the CNN model. The model works fine without any activity/kernel regularization. The moment I add an activity/kernel regularization the model freezes in between; training typically stops in between batches/iterations of a single epoch (for e.g. 67/172 batch). The issue is very repeatable and reproducible on my system and I was able to localize the issue to the implementation of regularization. It was strange to see this behavior and I could not find similar issues by others. I am not sure if I need to provide any additional information, if someone can guide me on what is lacking, I would be more than happy to provide the required information, and guidance on the issue would be greatly appreciated.
The following are some helpful information about things like the libraries/dependencies
Keras 2.4.3
Tensorflow 2.3.1
GPU: NVIDIA 1070 TI (8GB)
cudart64_101.dll was successfully openedT
The code was written in Spyder running on Python 3.8
Input: 32 batch size, input size (32, 256,64,1)
Using model.fit function to train the model
100,277 parameters, 99523 trainable
Actually, I think this issue is fixed after I updated the NVIDIA software to the latest version (11.1) and added the most recent ones to the path

How to get the exact GPU memory usage for Keras

I recently started learning Keras and TensorFlow. I am testing out a few models currently on the MNIST dataset (pretty basic stuff). I wanted to know, exactly how much my model is consuming memory-wise, during training and inference. I tried googling but did not find much info.
I came across Nvidia-smi. I tried using config.gpu_options.allow_growth = True option but still am not able to use the exact memory python.exe is consuming due to some issues with Nvidia-smi. I know that I could run a separate pass of train and inference, but this is too cumbersome. It is very easy if I could just find the right API to do the job.
Tensorflow being such a well known and well-used library, I am hoping to find a better and faster way to get to these numbers.
Finally, once again my question is:
How to get the exact memory usage for a Keras model during training and inference.
Relevant specs:
OS: Windows 10
GPU: GTX 1050
TensorFlow version: 1.14
Please let me know if any other details are required.
Thanks!

Non Deterministic Results Using GPUs with Tensorflow and Tensorflow Serving . .. Why?

We have an object detection model developed in Tensorflow (1.10 and 1.3) that uses a standard CNN and some extra layers. We host the model in Tensorflow Serving 1.13.0 using a saved model format, on Nvidia Tesla V100 GPUs with Cuda 10 and CUDNN 7.4.x. (We use the Google containers images and/or dockerfiles for Tensorflow serving.)
We run unit tests to ensure that prediction results are what we expect. These all work great on CPU. But when we run them on the above GPU/CUDA/CUDNN configuration, we get differences in the prediction probabilities ranging from .001 to .0005.
Our goals are to understand:
why this happens?
is there anything we can do to prevent it?
If there is something we can do to prevent it, does that entail some sort of trade off, such as performance?
We have tried the following experiments:
Multiple runs of the same model on tensorflow GPU using checkpoint with batchsize of 1
results identical
Multiple runs of the same model on GPU using checkpoint with various batchsizes
results off by .001
Multiple runs of the same model on CPU using checkpoint with various batchsizes
results identical
Multiple runs of the same model on tensorflow serviing GPU using checkpoint with batchsize of 1
results identical
Comparing runs with checkpoint to runs with saved model on GPU
results off by .005
Compare runs with checkpoint to runs with savedmodel on CPU
results identical
Experimented with changing the batch_size and setting TF_CUDNN_USE_AUTOTUNE=0 on GPU
reduces max difference from .001 to .0005
Experimented with adding intra_op_parallelism_threads=1, inter_op_parallelism_threads=1 didn’t make any difference when used with TF_CUDNN_USE_AUTOTUNE=0
results no different than the above
IN SUMMARY: We have a few cases where the results of running inference on GPU are different:
Using a checkpoint versus a saved model.
Batchsize = 1 versus various batch sizes
Setting TF_CUDNN_USE_AUTOTUNE=0 reduces the difference when using various batch sizes
This happens with TF 1.10 AND 1.13.1
Again, our goals are to understand:
Why this happens?
Is there anything we can do to prevent it?
If there is something we can do to prevent it, does that entail some sort of trade off, such as performance?
I have some crazy nondeterministic stuff going on, that didn't occur in my laptop's GPU but happened in the server's GPUs.
Solution: Now I call cudaDeviceSynchronize() every time after a call to a cublas, cusolver, etc., function, and the nondeterministic issue dissapeared! :) It made me really crazy and angry but aparently because those libraries use stream, then you can end using the content of a device pointer before the results have been written completely by those libs' functions.

data generator with tensorflow on the gpu

I am making a neural network using tensorflow and I ran into a problem trying to use a generator to split my data up, basically it's too slow.
My training data consists of 52x52 numpy arrays. I need to split each array into a 52x52x3 array before I input it into my NN. As mentioned I have a generator working that does this, but I noticed that even though my NN is running on the GPU my GPU usage is very low (under 10% usually). I think this might be caused by me doing the generator on the CPU.
Is there any way of running my generator on the GPU?
What I tried:
- I thought of trying to use pyCUDA in order to program the generator on the GPU but found that tensorflow and pyCUDA don't support each other
-I tried using the from_generator function from the Dataset API as mentioned here:
https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset
But while having issues with it I ran into this github thread mentioning that this function isn't supported to run on the GPU anyway:
https://github.com/tensorflow/tensorflow/issues/13610
Any help would be greatly appreciated.