Keras OOM error while fine-tuning but not while training - tensorflow

I have a autoencoder model that I initialized and trained on a GPU machine with 128x128 sized-images dataset and a batch-size of 64. I saved the trained model file. Then I load this saved model file and restart training on a new dataset with 128x128 sized-images and batch-size of 64. But I get an OOM error this time. When the scenario is exactly same as before, why is it throwing an OOM error? (The GPU memory is 0% used every time I start training).

Related

Tensorflow object detection API only evaluates the latest checkpoint

I've trained SSD mobilent v2 320x320 model with around 4k steps which produced quite a few checkpoints that are saved in my training folder.The issue I am experiencing now is that this only evaluates the latest checkpoint, but I'd like to evaluate all of them at once.
Ideally I would like to see the results in TensorBoard which shows the validation accuracy (mAP) of the different checkpoints as graph - which it does already, but just for the one checkpoint.
I have tried to run my evaluation code to generate a graph for my mAP but it shows my mAP with a simple dot.
Each checkpoint refers to a previous state of your model during training. The graph you see on TensorBoard for mAP, at some points, is the same as the dots that are produced when you run the evaluation once on checkpoint because the checkpoints are not actually different models but your model at different times during training. So the graph of the last model is what you need.

CNN model for deployment: how to optimize

Its my first time deploying a model. I've created a cnn model using tensorflow, keras, Xception and saved model is about 80 mb. When I load it from a function and do a prediction, it takes about 4-5 seconds. Is there a way to reduce this time? Does the model has to be loaded for every prediction?
enter image description here
The model load only once in your program. for each prediction, you use the loaded model. it might take time to predict. TensorFlow doesn't load the model on prediction. the better way is to only save weights after training and for inference create model architecture and then load the saved weights.

OOM error only after many epochs of training a tacotron model

I was checking out google's tacotron2 model, slightly modified it to fit to my data. The training runs successfully until about 9000 epoch, but throws an OOM error then (I repeated the training, but it stops at the exact same spot every time I try).
I added swap_memory=True option in the tf.nn.bidirectional_dynamic_rnn function to see if it resolves. After that change, the training runs a bit slower, but was able to run for more epochs, but it still throws OOM error at about 10000 epoch.
I'm using a 12GB titanX gpu. The model checkpoint files (3 files per checkpoint) are only 500 MB, and 80 MB for meta and data files. I don't know enough about checkpoints but if it represents all the model parameters and all variables necessary for training, it seems much smaller than 12 GB and I don't understand why OOM error occurs.
Does anybody have a clue what might cause OOM error? How do I check if there are stray variables/graphs keep accumulating? Or does the dynamic rnn somehow cause the problem?
have no found this error. maybe you can just upgrade tensorflow version or cuda driver. or just reduce batch size

Why LOSS is constantly increasing in tensorflow Object Detection API?

I am using faster_rcnn_resnet50_coco model from the model zoo of tensorflow. I trained the model on my own data (solar images) and as I am observing the training (in tensorboard) the total loss is constantly increasing.
This is surprising to me because the model seems to be pretty successful in detecting the objects.

Training object detectors from scratch leads to really bad performance

I am trying to train a Faster-RCNN network with Inception-v3 architecture (reference paper: Google's paper) as my fixed feature extractor using keras on my own dataset (number of classes = 4) which is very different compared to the Image-net. Still I initialized it with Image-net weights because this paper gives evidence that initializing with pre-trained weights is always better compared to random initialization.
Upon Training for 60 Epochs my Training accuracy is at 96% and my validation accuracy is at 84% ,Over-fit! (severe maybe?). But what is more worrying is that my loss did not converge at all. Upon testing the network it failed miserably! like, it didn't even detect.
Then I took a slightly different approach. I did a two step training. First I trained the Inception-v3 on my dataset like a classification problem (Still initialized it with Image-net weights) it converged well. Then I used those weights to initialize the Faster-RCNN network. This worked! But, I am confused why this two staged approach works but Training from scratch didn't work. Given I initialized both the methods with the pre-trained image-net weights initially.
Is there a way to train Faster RCNN from scratch?