Strange behavior of loss function in an implementation of TensorFlow matrix factorization model for recommendation system - tensorflow

Current implementation of recommendation system use TF 1.8 and WALS algorithm. The model was trained using self.fit(input_fn=input_fn) and ML Engine with run time version 1.8. Data set was formed following example using tensorflow.train.Example(...) Extraction from training logs shown below.
The fit was performed with some default parameters. The loss value did decreased on second evaluation. However loss did not changed after that. The final Root weighted squared error (rwse) in this training became 0.126.
Hyperparameter tuning was performed later and the best parameter set was used in the following training. The result of that training is shown below.
Tree things to note here. First, the loss value at the beginning is lower than at later evaluation steps. Low value in the beginning most likely due to choice of parameters from the results of hyperparameter tuning. Increase of the loss value later on looks strange. Second, it’s unchanged loss value after second evaluation. This pattern remains the same while self.fit(input_fn=input_fn) is used for model training. Third, the final rwse in this training became 0.487 while during hyperparameter tuning with the same parameter set rwse=0.015.
The question is if anyone observed something similar? Is it possible to improve performance of the algorithm using WALSMatrixFactorization class and self.fit(input_fn=input_fn, steps=train_steps)?? Thanks in advance for your help.

Related

DIfferent optimization with different TF versions

I'm trying to train a convolutional neural network with keras and Tensorflow version 2.6, also I did it with Tensorflow version 1.11. I think that I did the migration okey (two neural networks converged) but when I see the results they are very different, worst in TF2.6, I used an optimizer Adam for both cases with the same hyperparameters (learning_rate = 0.001) but the optimization in the loss function in TF1.11 is better than in TF2.6
I'm trying to find out where the differences could be. What things must be taken into account when we work with differents TF versions? Can have important numerical differences? I know that in TF1.x the default mode is graph and in TF2 the default is eager, I don't know if this could bring different behavior in the training.
It surprises me how much the loss function is reduced in the first epochs reaching a lower value at the end of the training.
you understand that is correct they are working in different working modes eager and graph but the loss Fn is defined by how much change of value to required optimized pointed calculated by your or configured method.
You cannot directly be compared one model training history to another directly, running it several time you experience TF 1 is faster and smaller in the number of losses in the loss Fn that is needed to review the changelog Changlog
Loss Fn are updated, the graph is the powerful technique we know but TF 2.x supports access of the value at its level, why you have easy delegated methods such as callback, dynamic FNs, and working update value runtime. ( Trends to understand and experiments for student or user compared by both versions on the same tasks )
Symetrics in methods not create different results.

What does stateful mean in tensorflow metrics in my case?

I don't really understand the explanation of a stateful metric here: Keras metrics with TF backend vs tensorflow metrics
Now, if I split my evaluation data in batches and for each batch I use tf.metrics.precision for the precision, does it mean that the previous variables (counter false positives etc. ) are used for the calculation in the next batch? That would be really bad, since I want the single evaluations for each batch (that is why I do the split!)
If this is the case how can I reset the variables for each batch.
I need the single values from each batch for a mean afterwards.
The reason why tf.metrics.Precision and the like (Recall, etc) store true/false positive is because we do not want to estimate them batch-wise (unlike Accuracy or Loss, etc). The original implementation of Precision in keras (noted, not tf.keras) did exactly what you described (single evaluations for each batch and then aggregate afterward) but was later removed in version 2.0.0 because this way of computing global metric is "more misleading than helpful" (https://github.com/keras-team/keras/issues/5794).
But you may still do what you want to do, you can subclass tf.metrics.Metric and implement the logic of Precision in update_state method. The Metric API doc on Tensorflow has an example of custom Metrics. https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric
I hope this is helpful!

Different between fit and evaluate in keras

I have used 100000 samples to train a general model in Keras and achieve good performance. Then, for a particular sample, I want to use the trained weights as initialization and continue to optimize the weights to further optimize the loss of the particular sample.
However, the problem occurred. First, I load the trained weight by the keras API easily, then, I evaluate the loss of the one particular sample, and the loss is close to the loss of the validation loss during the training of the model. I think it is normal. However, when I use the trained weight as the inital and further optimize the weight over the one sample by model.fit(), the loss is really strange. It is much higher than the evaluate result and gradually became normal after several epochs.
I think it is strange that, for the same one simple and loading the same model weight, why the model.fit() and model.evaluate() return different results. I used batch normalization layers in my model and I wonder that it may be the reason. The result of model.evaluate() seems normal, as it is close to what I seen in the validation set before.
So what cause the different between fit and evaluation? How can I solve it?
I think your core issue is that you are observing two different loss values during fit and evaluate. This has been extensively discussed here, here, here and here.
The fit() function loss includes contributions from:
Regularizers: L1/L2 regularization loss will be added during training, increasing the loss value
Batch norm variations: during batch norm, running mean and variance of the batch will be collected and then those statistics will be used to perform normalization irrespective of whether batch norm is set to trainable or not. See here for more discussion on that.
Multiple batches: Of course, the training loss will be averaged over multiple batches. So if you take average of first 100 batches and evaluate on the 100th batch only, the results will be different.
Whereas for evaluate, just do forward propagation and you get the loss value, nothing random here.
Bottomline is, you should not compare train and validation loss (or fit and evaluate loss). Those functions do different things. Look for other metrics to check if your model is training fine.

Avoiding overfitting while training a neural network with Tensorflow

I am training a neural network using Tensorflow's object detetction API to detect cars. I used the following youtube video to learn and execute the process.
https://www.youtube.com/watch?v=srPndLNMMpk&t=65s
Part 1 to 6 of his series.
Now in his video, he has mentioned to stop the training when the loss value reaches ~1 or below on an average and that it would take about 10000'ish' steps.
In my case, it is 7500 steps right now and the loss values keep fluctuating from 0.6 to 1.3.
Alot of people complained in the comment section about false positives on this series but I think this happened because of the unnecessary prolonged process of training (because they din't know maybe when to stop ?) which caused overfitting!
I would like to avoid this problem. I would like to have not the most optimum weights but fairly optimum weights while avoiding false detection or overfitting. I am also observing 'Total Loss' section of Tensorboard. It fluctuates between 0.8 to 1.2. When do I stop the training process?
I would also like to know in general, which factors does the 'stopping of training' depend on? is it always about the average loss of 1 or less?
Additional information:
My training data has ~300 images
Test data ~ 20 images
Since I am using the concept of transfer learning, I chose ssd_mobilenet_v1.model.
Tensorflow version 1.9 (on CPU)
Python version 3.6
Thank you!
You should use a validation test, different from the training set and the test set.
At each epoch, you compute the loss of both training and validation set.
If the validation loss begin to increase, stop your training. You can now test your model on your test set.
The Validation set size is usually the same as the test one. For example, training set is 70% and both validation and test set are 15% each.
Also, please note that 300 images in your dataset seems not enough. You should increase it.
For your other question :
The loss is the sum of your errors, and thus, depends on the problem, and your data. A loss of 1 does not mean much in this regard. Never rely on it to stop your training.

Convergence in Logistic Regression in distributed tensorflow

I'm trying to develop logistic regression in distributed tensorflow and I want to integrate a convergence check in my algorithm apart from the upper bound of iterations. The convergence criteria I am about to use is
||prevW - currW|| < E
where prevW is the previous values of the model weights and currW the current ones. E is the convergence tolerance.
My question is about the previous model weights. Since I am using between graph replication and asynchronous training, I don't know when it's worker of the cluster will update the weights. So let's say a worker has computed the new weights using a batch and wants to check if the algorithm has converged in order to stop. I will use the weights available in local replica (so use the corresponding tensor) or I will evaluate the tensor to get the last updated value before I continue with the current computation? I tried to do as described above, but the algorithm did not converge and stopped after the upper bound for the iterations was reached.
Thank you beforehand for your help :D
I would do the convergence check in the same device where the variables are. This way you avoid copying too much stuff over the network. This can be done by putting it in a with tf.device(variable.device): block.