What is the difference between saving a summary and saving the model in the logdir? - tensorflow

Using Tensorflow (tf.contrib.slim in particular) we are required to calibrate a few parameters to produce the graphs that we want at tensorboard.
Saving a summary interval is more clear for us what it does. It saves the value (or an average of them?) of a particular point in the graph at the interval provided.
Now checkpoints for saving the model itself why should be required at the training process? Does the model changes?.. Not sure how this works

You save the model to checkpoints because the Variables in the model, including neural network weights and biases and the global_step counter, keep changing during the training process. The structure of the model doesn't change. The saved checkpoints allow you to load the trained model for serving and to resume training later.

Related

Tensorflow / Keras - Using both ModelCheckpoint: save_best_only and EarlyStopping: restore_best_weights

ModelCheckpoint
save_best_only: if save_best_only=True, it only saves when the model is considered the "best" and the latest best model according to the quantity monitored will not be overwritten. If filepath doesn't contain formatting options like {epoch} then filepath will be overwritten by each new better model.
EarlyStopping
restore_best_weights: Whether to restore model weights from the epoch with the best value of the monitored quantity. If False, the model weights obtained at the last step of training are used. An epoch will be restored regardless of the performance relative to the baseline. If no epoch improves on baseline, training will run for patience epochs and restore weights from the best epoch in that set.
If I train my model and save the best model and restore the weights of the best epoch... - am I not doing the same thing twice? Would it not just produce two model files, one for the epoch and one for the final model but both actually being the same?
Then if this is correct which would be the preferred method to use?
(As I understand, models are sometimes held in memory EarlyStopping for but not sure about model_checkpoint ModelCheckpoint)
The former saves the weights of the model at the epoch where it performed the best on the validation set, while the latter restores those saved weights into the model and use it for predictions.
When you save the weights of a model using the ModelCheckpoint callback during training, the weights are saved to disk (e.g., to a .h5 file) at specified checkpoints (e.g., after every epoch). The purpose of saving the weights is to be able to restore them later for predictions, in case you need to stop the training for some reason, or if you want to use the weights for inference on a different dataset.
Once the training is complete, you can restore the weights of the best performing model by loading them back into the model architecture, and then use the model for predictions.
The difference between early stopping and saving the weights using ModelCheckpoint is that early stopping saves the weights automatically based on a criterion (the performance on the validation set), while ModelCheckpoint saves the weights at specified intervals (e.g., after every epoch).
So, in the case of early stopping, you don't have to specify when to save the weights, because the algorithm stops training automatically and saves the weights when the performance on the validation set stops improving. On the other hand, with ModelCheckpoint, you have more control over when to save the weights, but you have to manually stop the training when the performance is no longer improving.
In summary, saving the weights during training allows you to persist the state of the model, so that you can continue training or use the model for predictions later.
In terms of preferred method, it depends on your use case. If you have limited memory, you may only keep the best model's weights in memory, and use the ModelCheckpoint to periodically save the best weights to disk. If memory is not a concern, you could keep all intermediate models in memory and use the EarlyStopping to stop training once the performance on the validation set stops improving.

Tensorflow Keras - Does Model.save() save the best model?

I have been training several models using 10-fold CV and added the ModelCheckpoint callback which saves the model with the lowest validation loss to an HDF5 file. However, for a while I would then call model.save(filepath) right after training.
I only came to the realization that the last call would probably save the model trained on the very last epoch and that the saved checkpoint is not being used at all. Is my assumption correct? If so, is it normal that the best models from the checkpoint files score lower than the ones saved with model.save()?

What is the last state of the model after training?

After fitting the model with model.fit(...), you can use .evaluate() or .predict() methods with the model.
The problem arises when I use Checkpoint during training.
(Let's say 30 checkpoints, with checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath, save_weights_only=True))
Then I can't quite figure out what do I have left, the last state of this model.
Is it the best one? or the latest one?
If the former is the case, one of 30 checkpoints should be same with the model I have left.
If the latter is the case, the latest checkpoint should be same with the model I have left.
Of course, I checked both the cases and neither one is right.
If you set save_best_only=True the checkpoint saves the model weights for the epoch that had the "best" performance. For example if your were monitoring 'val_loss' then it will save the model for the epoch with the lowest validation loss. If save_best_only=False then the model is saved at the end of each epoch regardless of the value of the metric being monitored. Of course if you do not use special formatting for the model save path then the save weights will be over written at the end of each epoch.

TensorFlow.Keras ModelCheckpoint Saving model while training , why?

I was wondering why do we need to save the model while training ?
isn't enough to save it once at the beginning of the train and then only save the weights during the train ?
I mean , The model isn't changing during the train , why this boolean is need for ?
class ModelCheckpoint(Callback):
...
save_weights_only: if True, then only the model's weights will be saved.
...
Thanks !
Its not a need or requirement, its just convenience. In a typical DL/DS workflow, you train a lot of models with different configurations and it is quite easy to get lost. Maybe you now have saved the weights for the best model but you don't remember which model configuration it was used. That information is not part of the weights and has to be recorded separately.
Then Keras provides a simple solution, to store the mode (which takes less than 10 KB) along with the weights, so in the case that you lose the original model configuration, it is still saved in the same HDF5 file.
Also consider the case where you send the model weights to someone else without the model configuration, how can you load the weights without a model? Again its just convenience.

Training trained seq2seq model on additional training data

I have trained a seq2seq model with 1M samples and saved the latest checkpoint. Now, I have some additional training data of 50K sentence pairs which has not been seen in previous training data. How can I adapt the current model to this new data without starting the training from scratch?
You do not have to re-run the whole network initialization. You may run an incremental training.
Training from pre-trained parameters
Another use case it to use a base model and train it further with new training options (in particular the optimization method and the learning rate). Using -train_from without -continue will start a new training with parameters initialized from a pre-trained model.
Remember to tokenize your 50K corpus the same way you tokenized the previous one.
Also, you do not have to use the same vocabulary beginning with OpenNMT 0.9. See the Updating the vocabularies section and use the appropriate value with -update_vocab option.