Get TotalLoss of a Checkpoint file - tensorflow

I have 100s of checkpoint files and I need to pick one that will have the TotalLoss best suited to my needs.
How can I get that information through the python classes?
Edit: TF 1.x

TotalLoss is a value calculated on your data. The tensorflow model and it's weights do not store any dataset information. The rule of thumb would be: if you have to feed data to calculate a value, such as loss, gradient, accuracy etc. this can not be derived from the model alone and would not be visible from checkpoint, unless designed otherwise.
To calculate total loss of all the checkpoints you have to load each of them and run loss evaluation on the dataset you have.
To avoid reruns you can store the loss value in a variable and save it along with the model, so you can read it later.

Related

Tensorflow / Keras - Using both ModelCheckpoint: save_best_only and EarlyStopping: restore_best_weights

ModelCheckpoint
save_best_only: if save_best_only=True, it only saves when the model is considered the "best" and the latest best model according to the quantity monitored will not be overwritten. If filepath doesn't contain formatting options like {epoch} then filepath will be overwritten by each new better model.
EarlyStopping
restore_best_weights: Whether to restore model weights from the epoch with the best value of the monitored quantity. If False, the model weights obtained at the last step of training are used. An epoch will be restored regardless of the performance relative to the baseline. If no epoch improves on baseline, training will run for patience epochs and restore weights from the best epoch in that set.
If I train my model and save the best model and restore the weights of the best epoch... - am I not doing the same thing twice? Would it not just produce two model files, one for the epoch and one for the final model but both actually being the same?
Then if this is correct which would be the preferred method to use?
(As I understand, models are sometimes held in memory EarlyStopping for but not sure about model_checkpoint ModelCheckpoint)
The former saves the weights of the model at the epoch where it performed the best on the validation set, while the latter restores those saved weights into the model and use it for predictions.
When you save the weights of a model using the ModelCheckpoint callback during training, the weights are saved to disk (e.g., to a .h5 file) at specified checkpoints (e.g., after every epoch). The purpose of saving the weights is to be able to restore them later for predictions, in case you need to stop the training for some reason, or if you want to use the weights for inference on a different dataset.
Once the training is complete, you can restore the weights of the best performing model by loading them back into the model architecture, and then use the model for predictions.
The difference between early stopping and saving the weights using ModelCheckpoint is that early stopping saves the weights automatically based on a criterion (the performance on the validation set), while ModelCheckpoint saves the weights at specified intervals (e.g., after every epoch).
So, in the case of early stopping, you don't have to specify when to save the weights, because the algorithm stops training automatically and saves the weights when the performance on the validation set stops improving. On the other hand, with ModelCheckpoint, you have more control over when to save the weights, but you have to manually stop the training when the performance is no longer improving.
In summary, saving the weights during training allows you to persist the state of the model, so that you can continue training or use the model for predictions later.
In terms of preferred method, it depends on your use case. If you have limited memory, you may only keep the best model's weights in memory, and use the ModelCheckpoint to periodically save the best weights to disk. If memory is not a concern, you could keep all intermediate models in memory and use the EarlyStopping to stop training once the performance on the validation set stops improving.

Does .sample() of a Tensorflow Probability model repeatedly call the model parameters?

I build a BNN model similar to the Keras example, which accounts for aleatory and epistemic uncertainties:
https://keras.io/examples/keras_recipes/bayesian_neural_networks/
However, I want to compute the uncertainties separately. In an outer loop I compute several predictions, each sampling new weights and biases their distribution functions. Between these predictions I compute the epistemic (model) uncertainty.
For the determination of the aleatory uncertainties I use a tfp.layers.IndependentNormal() layer, through which the model returns a distribution object. Since my data is normalized, I can't just use the .stdv() method, since the inverse transformation would not return the correct inverse normalized outputs (this would only be the case for .mean()). Instead, I need to sample from the distribution object, then inverse normalize the samples, and afterwards I can calculate my standard deviation in the initial range.
Now I'm not sure if the .sample() method e.g. for n=100 samples -> .sample(100) samples 100 times new model parameters of the distribution function of the weights and biases. That would contaminate the aleatory uncertainty with the epistemic uncertainty.
I also saw that .sample() has a seed attribute, but am not completely sure if this prevents re-sampling of the model parameter distribution function for computations on a GPU.

What is the last state of the model after training?

After fitting the model with model.fit(...), you can use .evaluate() or .predict() methods with the model.
The problem arises when I use Checkpoint during training.
(Let's say 30 checkpoints, with checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath, save_weights_only=True))
Then I can't quite figure out what do I have left, the last state of this model.
Is it the best one? or the latest one?
If the former is the case, one of 30 checkpoints should be same with the model I have left.
If the latter is the case, the latest checkpoint should be same with the model I have left.
Of course, I checked both the cases and neither one is right.
If you set save_best_only=True the checkpoint saves the model weights for the epoch that had the "best" performance. For example if your were monitoring 'val_loss' then it will save the model for the epoch with the lowest validation loss. If save_best_only=False then the model is saved at the end of each epoch regardless of the value of the metric being monitored. Of course if you do not use special formatting for the model save path then the save weights will be over written at the end of each epoch.

Different between fit and evaluate in keras

I have used 100000 samples to train a general model in Keras and achieve good performance. Then, for a particular sample, I want to use the trained weights as initialization and continue to optimize the weights to further optimize the loss of the particular sample.
However, the problem occurred. First, I load the trained weight by the keras API easily, then, I evaluate the loss of the one particular sample, and the loss is close to the loss of the validation loss during the training of the model. I think it is normal. However, when I use the trained weight as the inital and further optimize the weight over the one sample by model.fit(), the loss is really strange. It is much higher than the evaluate result and gradually became normal after several epochs.
I think it is strange that, for the same one simple and loading the same model weight, why the model.fit() and model.evaluate() return different results. I used batch normalization layers in my model and I wonder that it may be the reason. The result of model.evaluate() seems normal, as it is close to what I seen in the validation set before.
So what cause the different between fit and evaluation? How can I solve it?
I think your core issue is that you are observing two different loss values during fit and evaluate. This has been extensively discussed here, here, here and here.
The fit() function loss includes contributions from:
Regularizers: L1/L2 regularization loss will be added during training, increasing the loss value
Batch norm variations: during batch norm, running mean and variance of the batch will be collected and then those statistics will be used to perform normalization irrespective of whether batch norm is set to trainable or not. See here for more discussion on that.
Multiple batches: Of course, the training loss will be averaged over multiple batches. So if you take average of first 100 batches and evaluate on the 100th batch only, the results will be different.
Whereas for evaluate, just do forward propagation and you get the loss value, nothing random here.
Bottomline is, you should not compare train and validation loss (or fit and evaluate loss). Those functions do different things. Look for other metrics to check if your model is training fine.

Tensorflow input pipeline

I have an input pipeline where samples are generated on fly. I use keras and custom ImageDataGenerator and corresponding Iterator to get samples in memory.
Under assumption that keras in my setup is using feed_dict (and that assumption is a question to me) I am thinking of speeding things up by switching to raw tensorflow + Dataset.from_generator().
Here I see that suggested solution for input pipelines that generate data on fly in the most recent Tensorflow is to use Dataset.from_generator().
Questions:
Does keras with Tensorflow backend use feed_dict method?
If I switch to raw tensorflow + Dataset.from_generator(my_sample_generator) will that cut feed_dict memory copy overhead and buy me performance?
During predict (evaluation) phase apart from batch_x, batch_y I have also opaque index vector from my generator output. That vector corresponds to sample ids in the batch_x. Does that mean that I'm stuck with feed_dict approach for predict phase because I need that extra batch_z output from iterator?
The new tf.contrib.data.Dataset.from_generator() can potentially speed up your input pipeline by overlapping the data preparation with training. However, you will tend to get the best performance by switching over to TensorFlow ops in your input pipeline wherever possible.
To answer your specific questions:
The Keras TensorFlow backend uses tf.placeholder() to represent compiled function inputs, and feed_dict to pass arguments to a function.
With the recent optimizations to tf.py_func() and feed_dict copy overhead, I suspect the amount of time spent in memcpy() will be the same. However, you can more easily use Dataset.from_generator() with Dataset.prefetch() to overlap the training on one batch with preprocessing on the next batch.
It sounds like you can define a separate iterator for the prediction phase. The tf.estimator.Estimator class does something similar by instantiating different "input functions" with different signatures for training and evaluation, then building a separate graph for each role.
Alternatively, you could add a dummy output to your training iterator (for the batch_z values) and switch between training and evaluation iterators using a "feedable iterator".