OOM error only after many epochs of training a tacotron model - tensorflow

I was checking out google's tacotron2 model, slightly modified it to fit to my data. The training runs successfully until about 9000 epoch, but throws an OOM error then (I repeated the training, but it stops at the exact same spot every time I try).
I added swap_memory=True option in the tf.nn.bidirectional_dynamic_rnn function to see if it resolves. After that change, the training runs a bit slower, but was able to run for more epochs, but it still throws OOM error at about 10000 epoch.
I'm using a 12GB titanX gpu. The model checkpoint files (3 files per checkpoint) are only 500 MB, and 80 MB for meta and data files. I don't know enough about checkpoints but if it represents all the model parameters and all variables necessary for training, it seems much smaller than 12 GB and I don't understand why OOM error occurs.
Does anybody have a clue what might cause OOM error? How do I check if there are stray variables/graphs keep accumulating? Or does the dynamic rnn somehow cause the problem?

have no found this error. maybe you can just upgrade tensorflow version or cuda driver. or just reduce batch size

Related

TensorFlow OOM when looping over multiple experiments

Using TensorFlow 2.4.1, I'm running multiple experiments or grid-search successively:
creating models
training on them
and then losing their reference
After a certain amount of iterations, I run into OOM errors and no successive experiment manage to allocated the required memory to run.
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor
The models are all the same with a grid search on the learning rate. The tensor for which TensorFlow failed to allocate memory has been successfully allocated in the previous experiment iterations
I tried running each experiment in a multiprocessing.Process like recommended on https://github.com/tensorflow/tensorflow/issues/36465#issuecomment-582749350
I tried calling tf.keras.backend.clear_session() at the begining of each process as recommended on https://www.tensorflow.org/api_docs/python/tf/keras/backend/clear_session, without success.
I tried setting memory growth with tf.config.experimental.set_memory_growth but, using nvidia-smi, the allocated memory stays identical between runs that succeed and runs that fail with OOM (43474/45556 MB)
I'm running out of idea on how to prevent this out-of-memory error with TensorFlow.
Is there any recommendations on how to run multiple successive experiments with TensorFlow?

Tensorflow Object Detection API low loss low confidence - checkpoint not saving weights

A few months ago I trained a custom object detector on stanford dogs using efficientdet_d0_512x512 and only 2 classes of dogs with success. Not changing the code, I tried doing that again and the model was pumping out really low confidence scores (<1%), even though the loss in the training process was low.
I then tried resuming the training using the checkpoint generated after the initial training and the loss starts high as if the checkpoint did not exist.
I also tried working with faster rcnn, getting the same results. Here's the code:
https://colab.research.google.com/drive/1fE3TYRyRrvKI2sVSQOPUaOzA9JItkuFL?usp=sharing
I'm thinking that the exporting is not working and the trained weights are not saved. Any ideas?
It's seems indeed that your checkpoint cannot be loaded as the many warnings you're getting :
WARNING:tensorflow:Unresolved object in checkpoint:
(root). .......
After a few research I found this issue on the model repository of Tensorflow Object Detection API : https://github.com/tensorflow/models/issues/8892#issuecomment-680207038.
Basically it's saying that you should change :
fine_tune_checkpoint_type = "detection"
to :
fine_tune_checkpoint_type = "fine_tune"
so that the checkpoint loaded will be used in fine-tuning and the number of classes doesn't cause issues between your configuration file and the one you're starting with.
It also suggests to be careful whereas your modeldir (where the custom checkpoints and tensorboard events will be saved) is empty or not for the same reasons but it seems that you're good with that on your colab notebook.
Also on a different subject you should be careful to your learning rate, right now you're using a cosine_decay_learning_rate which requires some warmup steps, 2500 in that case. However you are using only 800 so the warmups steps are not completed when you stop the training! If for some reason you want to keep the number of steps low you should change your learning rate to a manual_step_learning_rate or exponential_decay_learning rate. Otherwise you should keep your training going for much longer.
EDIT : After further investigation the problem might be a bit deeper confers to this issue on github : https://github.com/tensorflow/models/issues/9229
You might want to keep an eye on this to see where this is going.

Tensorflow fails to run on GPU from time to time

Solved this problem myself. It was because there were too much images in the celeba dataset and my dataloader was so inefficient. The dataloading took too much time and caused the low speed.
But still, this could not explain why the code was running on the cpu while the gpu memory was also taken up. After all I just transfer to pytorch.
My environment: windows10, cuda 9.0, cudnn 7.0.5, tensorflow-gpu 1.8.0.
I am working a cyclegan model. At first, it worked fine with my toy dataset, and could run on gpu without main problem(though the first 10 iterations took extremely long time, which means it might be running on cpu).
I later tried celeba dataset, only changed the folder name to load the data(I loaded data to the memory all at once, then use my own next_batch function and feed_dict to train the model). Then the problem arose: the GPU memory was still taken according to GPU-Z, but the GPU-load is low(less than 10%), and the training speed is very slow(took more than 10 times than normal), which means the code was running on CPU.
Would anyone please give me some advise? Any help is appreciated, thanks.
What is the batch size that you were trying? If it's too low (something like 2-8) for a small model, the memory consumed will not be much. It all depends on your batch size, the number of parameters in your model, etc. It also depends on the model architecture and how much of the model has components that can be run in parallel. Maybe try increasing your batch size and re-running it?

object detection Training becomes slower in time. Uses more CPU than GPU as the training progresses

System information
What is the top-level directory of the model you are using:research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes (just VGG-16 implementation for Faster RCNN)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow version (use command below):1.4.0
CUDA/cuDNN version:8 and 6
GPU model and memory: NVIDIA-1060 6GB
I am trying to Train a Faster-RCNN with VGG-16 as feature extractor(paper) on my custom dataset using the API.
Training params are same as described in the paper except for, am running for 15k steps only and resizing the images to 1200x1200 with a batch size = 1.
The Training Runs Fine but as the Time progresses The Training becomes slower. It is shifting between CPU and GPU.
The steps where the time around 1sec is running on GPU and the other high numbers like ~20secs is running in CPU I cross verified them using 'top' and 'nvidia-smi'. Why is it shifting between CPU and GPU in the middle? I can understand the shift when the model and logs are getting saved but this I don't understand why.
PS: I am running Only the Train script. Am not running the eval script
Update:
This becomes worse over time.
the secs/step is increasing thus affecting the rate at which the checkpoints and the logs getting stored
It should run less than 1sec/step because that was the speed when I started the training for the first 2k steps. And my dataset is pretty small (300 Images for training).
In my experience, it is possible that the size of your input image is quite too large. When you take a look at the tensorboard during the training session, you can find out that all the reshape calculation are running on GPU. So maybe you can write a python script to resize your input image without changing the aspect ratio, and you can at the same time set your batch size(maybe 4 or 8) a little bit higher. Then your can train your dataset faster and can also get a relative good result (mAP)

Deep Learning: Out of Memory error for data that's too wide

I'm trying to build a model (using tensorflow) that makes use of LSTMs. My training examples aren't too big - I have only about 500 examples. But each of those examples is a vector of size 2500. I keep getting Out of Memory errors every time I attempt to train the model. I tried using batching with small batch sizes (of 32 and 64) to reduce the size of the data, but the OOM error still persists.
What does one do when the training data is too wide?
Thank you very much!