Recovering a checkpoint after reaching NaN loss? - tensorflow

I'm training an RNN and sometime overnight the loss function reached NaN. I've been reading that a solution to this is to decrease the learning rate. When attempting to restart training from the (only) checkpoint I have and using a smaller learning rate, I still get NaN. Does this mean my checkpoint is beyond repair? Is there a way to either recover this one OR use tf.train.Saver in such a way that I am guaranteed a version of the model before it reaches a point of no return?

If your checkpoint has NaN values in it, there is probably not a lot you can do to recover it. I guess you could replace the NaNs with something else, but that isn't that principled.
You probably want to see if there is an earlier checkpoint without NaN values. tf.train.Saver keeps up to 5 previous checkpoints by default, for precisely this sort of reason:
https://www.tensorflow.org/api_docs/python/tf/train/Saver
Hope this helps!

Related

Is there any way in tensorflow to iterate all the training error during training?

I am trying to design a new loss function to iterate all the training errors for each data in a training batch and calculate the new loss base on the magnitude of different errors.
Do there have any way to achieve it? Because when you design the loss function, the error.shape[0] would be None, so the traditional ways may not be used to iterate the errors.
error=Ypred-Ytrue and all of their shape[0] are None, so I don't know how to iterate the errors now. I need to know the errors during training and compare their magnitude with one specific value to know how many errors are larger than it. And then calculate the loss base on it.
In short, I want to calculate the mean error of the errors larger than 0.5 and the mean error smaller than 0.5 in a batch respectively, and then use their addition as the loss function.
Do there has any way can achieve it?
Larger_Error=Error[Error>0.5] can work.
It is a quite silly problem, but still keep this problem for the starters in deep learning just like me :-)

Tensorflow GradientTape returns NaNs when using MultivariateNormalTriL distribution

I'm using network outputs to instantiate MultivariateNormalTriL distribution from tensorflow_probability and GradientTape to record gradients after calculating loss.
After random number of training steps GradientTape returns NaN values for some layers of the network. All the values in the outputs of the network and the distribution look ok. They are some random numbers. Not to small, not to big. All the calculated loss values are also ok. There are no NaNs anywhere else except in the gradients from the GradienTape.
Also, everything works fine when using Normal distribution.
As I understood, GradientTape should only return NaNs when the function is not differentiable. So it turns out that MultivariateNormalTriL is not differentiable for some specific values.
Am I missing something? Do you have any idea how to solve this or at least where to look?
It seems that the problem was with the standard deviation matrix. I used tfp.bijectors.FillScaleTriL to transform the matrix and it seems to be working for now.

how to manage batches for model.provide_groundtruth

I'm trying to use TensorFlow 2 Object Detection API with a custom dataset for multi classes to train an SSD, I took as base the example provide by the documentation: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/eager_few_shot_od_training_tf2_colab.ipynb
My current problem is when I start the fine tuning:
InvalidArgumentError: The first dimension of paddings must be the rank
of inputs[2,2] [6] [Op:Pad]
That seems to be related with the section of model.provide_groundtruth on train_step_fn, as I mention I took my data from a TensorFlow record, I mapped this to a dataset and divide it into batches using padded_batches(tf.data.TFRecordDataset) seems that this is the correct to feed the training with the image but now my problem is the groundtruth because this now is also converted to batches [batch_size,num_detections,coordinate_bbox], is this the problem? any idea on how to fix this issue.
Thanks
P.S. I tried to used the version of modified the pipeline.config file and run the model_main_tf2.py as was in the past with TensorFlow 1 but this method is buggy.
Just to share with everyone this resolves my issue was that I manage to split the data into batches the images and ground truth correctly but I never convert my labels to one hot vector encoding.

Avoiding overfitting while training a neural network with Tensorflow

I am training a neural network using Tensorflow's object detetction API to detect cars. I used the following youtube video to learn and execute the process.
https://www.youtube.com/watch?v=srPndLNMMpk&t=65s
Part 1 to 6 of his series.
Now in his video, he has mentioned to stop the training when the loss value reaches ~1 or below on an average and that it would take about 10000'ish' steps.
In my case, it is 7500 steps right now and the loss values keep fluctuating from 0.6 to 1.3.
Alot of people complained in the comment section about false positives on this series but I think this happened because of the unnecessary prolonged process of training (because they din't know maybe when to stop ?) which caused overfitting!
I would like to avoid this problem. I would like to have not the most optimum weights but fairly optimum weights while avoiding false detection or overfitting. I am also observing 'Total Loss' section of Tensorboard. It fluctuates between 0.8 to 1.2. When do I stop the training process?
I would also like to know in general, which factors does the 'stopping of training' depend on? is it always about the average loss of 1 or less?
Additional information:
My training data has ~300 images
Test data ~ 20 images
Since I am using the concept of transfer learning, I chose ssd_mobilenet_v1.model.
Tensorflow version 1.9 (on CPU)
Python version 3.6
Thank you!
You should use a validation test, different from the training set and the test set.
At each epoch, you compute the loss of both training and validation set.
If the validation loss begin to increase, stop your training. You can now test your model on your test set.
The Validation set size is usually the same as the test one. For example, training set is 70% and both validation and test set are 15% each.
Also, please note that 300 images in your dataset seems not enough. You should increase it.
For your other question :
The loss is the sum of your errors, and thus, depends on the problem, and your data. A loss of 1 does not mean much in this regard. Never rely on it to stop your training.

Tensorflow: Multiplying tensor by 1 change the result significantly

Replacing a tensor t in my model by t*1 change numerical behavior significantly, and causes nans to propagate to the loss.
Before I take this to github issues, I was wondering if someone know a workaround for this "magic".
I am using version 1.0