Replacing a tensor t in my model by t*1 change numerical behavior significantly, and causes nans to propagate to the loss.
Before I take this to github issues, I was wondering if someone know a workaround for this "magic".
I am using version 1.0
Related
I am trying to design a new loss function to iterate all the training errors for each data in a training batch and calculate the new loss base on the magnitude of different errors.
Do there have any way to achieve it? Because when you design the loss function, the error.shape[0] would be None, so the traditional ways may not be used to iterate the errors.
error=Ypred-Ytrue and all of their shape[0] are None, so I don't know how to iterate the errors now. I need to know the errors during training and compare their magnitude with one specific value to know how many errors are larger than it. And then calculate the loss base on it.
In short, I want to calculate the mean error of the errors larger than 0.5 and the mean error smaller than 0.5 in a batch respectively, and then use their addition as the loss function.
Do there has any way can achieve it?
Larger_Error=Error[Error>0.5] can work.
It is a quite silly problem, but still keep this problem for the starters in deep learning just like me :-)
I'm using network outputs to instantiate MultivariateNormalTriL distribution from tensorflow_probability and GradientTape to record gradients after calculating loss.
After random number of training steps GradientTape returns NaN values for some layers of the network. All the values in the outputs of the network and the distribution look ok. They are some random numbers. Not to small, not to big. All the calculated loss values are also ok. There are no NaNs anywhere else except in the gradients from the GradienTape.
Also, everything works fine when using Normal distribution.
As I understood, GradientTape should only return NaNs when the function is not differentiable. So it turns out that MultivariateNormalTriL is not differentiable for some specific values.
Am I missing something? Do you have any idea how to solve this or at least where to look?
It seems that the problem was with the standard deviation matrix. I used tfp.bijectors.FillScaleTriL to transform the matrix and it seems to be working for now.
I'm training an RNN and sometime overnight the loss function reached NaN. I've been reading that a solution to this is to decrease the learning rate. When attempting to restart training from the (only) checkpoint I have and using a smaller learning rate, I still get NaN. Does this mean my checkpoint is beyond repair? Is there a way to either recover this one OR use tf.train.Saver in such a way that I am guaranteed a version of the model before it reaches a point of no return?
If your checkpoint has NaN values in it, there is probably not a lot you can do to recover it. I guess you could replace the NaNs with something else, but that isn't that principled.
You probably want to see if there is an earlier checkpoint without NaN values. tf.train.Saver keeps up to 5 previous checkpoints by default, for precisely this sort of reason:
https://www.tensorflow.org/api_docs/python/tf/train/Saver
Hope this helps!
I've written the following segmentor and I can't get the accuracy to work. In fact I'm always getting accuracy of 0.0 whatever the size of my sample.
I think the problem is at the sigmoid layer at the end of U() function where a tensor of continuous elements between 0 and 1 (conv10) is further compared to a binary tensor and therefore there's no chance of getting any equality between the two.
UPDATE: The code can be found as git here
I've resolve the issue. The problem was the numpy arrays conversion to placeholder at feed level. The new update code as git can be found at : https://github.com/JulienBelanger/TensorFlow-Image-Segmentation
Current implementation of recommendation system use TF 1.8 and WALS algorithm. The model was trained using self.fit(input_fn=input_fn) and ML Engine with run time version 1.8. Data set was formed following example using tensorflow.train.Example(...) Extraction from training logs shown below.
The fit was performed with some default parameters. The loss value did decreased on second evaluation. However loss did not changed after that. The final Root weighted squared error (rwse) in this training became 0.126.
Hyperparameter tuning was performed later and the best parameter set was used in the following training. The result of that training is shown below.
Tree things to note here. First, the loss value at the beginning is lower than at later evaluation steps. Low value in the beginning most likely due to choice of parameters from the results of hyperparameter tuning. Increase of the loss value later on looks strange. Second, it’s unchanged loss value after second evaluation. This pattern remains the same while self.fit(input_fn=input_fn) is used for model training. Third, the final rwse in this training became 0.487 while during hyperparameter tuning with the same parameter set rwse=0.015.
The question is if anyone observed something similar? Is it possible to improve performance of the algorithm using WALSMatrixFactorization class and self.fit(input_fn=input_fn, steps=train_steps)?? Thanks in advance for your help.