Does Keras ModelCheckpoint save the best model across multiple fitting sessions? - tensorflow

If I have a Keras model fitted with the ModelCheckpoint callback and fit it in several 'fitting sessions' (i.e. I call multiple times), will the callback save the best model in the most recent fitting session or the best model out of all fitting sessions?

Good question. I did an experiment with an existing model and data set. I created a checkpoint callback as shown and used it in
mchk=tf.keras.callbacks.ModelCheckpoint( filepath=file_path1, monitor="val_loss", verbose=1,
save_best_only=True, save_weights_only=True, mode="auto", save_freq="epoch" )
history =, Y_train, validation_data=val_data,
batch_size= 128, epochs= 5, verbose= 1, callbacks=[mchk])
I saved the weights only and saved only the weights for the epoch with the lowest validation loss. I set verbose=1 in the callback so I could see the values of the validation loss on each epoch. Next I ran essentially the same code again but I changed
the name of the filepath to file2. Code for that is below
mchk=tf.keras.callbacks.ModelCheckpoint( filepath=file_path2, monitor="val_loss", verbose=1,
save_best_only=True, save_weights_only=True, mode="auto", save_freq="epoch" )
history =, Y_train, validation_data=val_data,
batch_size= 128, epochs= 5, verbose= 1, callbacks=[mchk])
Now preserves its state at the end of a session so if you run it a second time
it starts from where it left off. However it does not preserve the state of the callback.
So on the second run the callback initializes the validation loss as np.inf so it will
save the weights at the end of the first epoch for sure. If you don't change the name of the file it will over write the file you saved due to the first run. If in the second run the value of the validation loss for which the weights were saved is LOWER than the validation loss of the first run then you wind up with the best saved weights overall. However if in the second run the validation loss is higher than in the first run you end up not saving the OVERALL best weights. So that's how it works for the case where the the callback has save_weights_only=True. I thought it might behave differently if you save the entire model because it may in that case preserve the state of the callback. So I reran the experiment with save_weights_only=False. The results indicate saving the entire model does not save the state of the callback. Now I am using Tensorflow 2.0. The results may be different for different versions. I would run this experiment on your version and see if it behaves similarly.

It will save the best model in the most recent fitting session

It would save the model for the last fit() as you are essentially overwriting the same file.
If you wanted to find the best model over N iterations you should save them with a prefix N in the file name. This way it will save the best model for a particular fit() and you can easily compare them later. You could just manually add in the N i.e., 1,2,3,N for each fit().
// Example

Yes, a checkpoint will only be saved if the performance is better than over all calls to fit. In other words, if none of your epochs in the latest call to fit had better performance than an epoch in a previous call to fit, that previous checkpoint won't be overwritten.
There is one proviso: you must remember to create the callback outside of the call to fit. That is, do this:
checkpoint_callback = keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True), callbacks=checkpoint_callback)
..., callbacks=checkpoint_callback)
not this:, callbacks=keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True))
..., callbacks=keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True))
The checkpoint callback object has a best attribute which stores the best monitored value so far (and is initially set to the worst possible value, e.g. infinity if lower is good). This is not reset when the object is passed to fit. However, if you instantiate a new callback object within the call to fit, as in the latter code, naturally best will be initialised to the worst possible value, not the best monitored value stored by other callback objects in previous calls to fit.


Tensorflow / Keras - Using both ModelCheckpoint: save_best_only and EarlyStopping: restore_best_weights

save_best_only: if save_best_only=True, it only saves when the model is considered the "best" and the latest best model according to the quantity monitored will not be overwritten. If filepath doesn't contain formatting options like {epoch} then filepath will be overwritten by each new better model.
restore_best_weights: Whether to restore model weights from the epoch with the best value of the monitored quantity. If False, the model weights obtained at the last step of training are used. An epoch will be restored regardless of the performance relative to the baseline. If no epoch improves on baseline, training will run for patience epochs and restore weights from the best epoch in that set.
If I train my model and save the best model and restore the weights of the best epoch... - am I not doing the same thing twice? Would it not just produce two model files, one for the epoch and one for the final model but both actually being the same?
Then if this is correct which would be the preferred method to use?
(As I understand, models are sometimes held in memory EarlyStopping for but not sure about model_checkpoint ModelCheckpoint)
The former saves the weights of the model at the epoch where it performed the best on the validation set, while the latter restores those saved weights into the model and use it for predictions.
When you save the weights of a model using the ModelCheckpoint callback during training, the weights are saved to disk (e.g., to a .h5 file) at specified checkpoints (e.g., after every epoch). The purpose of saving the weights is to be able to restore them later for predictions, in case you need to stop the training for some reason, or if you want to use the weights for inference on a different dataset.
Once the training is complete, you can restore the weights of the best performing model by loading them back into the model architecture, and then use the model for predictions.
The difference between early stopping and saving the weights using ModelCheckpoint is that early stopping saves the weights automatically based on a criterion (the performance on the validation set), while ModelCheckpoint saves the weights at specified intervals (e.g., after every epoch).
So, in the case of early stopping, you don't have to specify when to save the weights, because the algorithm stops training automatically and saves the weights when the performance on the validation set stops improving. On the other hand, with ModelCheckpoint, you have more control over when to save the weights, but you have to manually stop the training when the performance is no longer improving.
In summary, saving the weights during training allows you to persist the state of the model, so that you can continue training or use the model for predictions later.
In terms of preferred method, it depends on your use case. If you have limited memory, you may only keep the best model's weights in memory, and use the ModelCheckpoint to periodically save the best weights to disk. If memory is not a concern, you could keep all intermediate models in memory and use the EarlyStopping to stop training once the performance on the validation set stops improving.

What is the last state of the model after training?

After fitting the model with, you can use .evaluate() or .predict() methods with the model.
The problem arises when I use Checkpoint during training.
(Let's say 30 checkpoints, with checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath, save_weights_only=True))
Then I can't quite figure out what do I have left, the last state of this model.
Is it the best one? or the latest one?
If the former is the case, one of 30 checkpoints should be same with the model I have left.
If the latter is the case, the latest checkpoint should be same with the model I have left.
Of course, I checked both the cases and neither one is right.
If you set save_best_only=True the checkpoint saves the model weights for the epoch that had the "best" performance. For example if your were monitoring 'val_loss' then it will save the model for the epoch with the lowest validation loss. If save_best_only=False then the model is saved at the end of each epoch regardless of the value of the metric being monitored. Of course if you do not use special formatting for the model save path then the save weights will be over written at the end of each epoch.

Early stopping based on AUC

I am fairly new to ML and am currently implementing a simple 3D CNN in python using tensorflow and keras. I want to optimize based on the AUC and would also like to use early stopping/save the best network in terms of AUC score. I have been using tensorflow's AUC function for this as shown below, and it works well for the training. However, the hdf5 file is not saved (despite the checkpoint save_best_only=True) and hence I cannot get the best weights for the evaluation.
Here are the relevant lines of code:
filepath = mypath
check = tf.keras.callbacks.ModelCheckpoint(filepath, monitor=tf.keras.metrics.AUC(), save_best_only=True,
earlyStopping = tf.keras.callbacks.EarlyStopping(monitor=tf.keras.metrics.AUC(), patience=hyperparams['pat'],mode='auto')
history =, y_trn,
callbacks=[check, earlyStopping],
validation_data=(X_val, y_val),
Interestingly, if I only change monitor='val_loss' in the early stopping and checkpoint (not the 'metrics' in model.compile), the hdf5 file is saved but obviously gives the best result in terms of validation loss. I have also tried using mode='max' but the problem is the same.
I would very much appreciate your advise, or any other constructive ideas how to work around this problem.
Turns out that even if you add a non-keyword metric, you still need to use its handle to refer to in when you want to monitor it. In your case you can do this:
auc = tf.keras.metrics.AUC() # instantiate it here to have a shorter handle
check = tf.keras.callbacks.ModelCheckpoint(filepath,
monitor='auc', # even use the generated handle for monitoring the training AUC
mode='max') # determine better models according to "max" AUC.
if you want to monitor the validation AUC (which makes more sense), simply add val_ in the beginning of the handle:
check = tf.keras.callbacks.ModelCheckpoint(filepath,
monitor='val_auc', # validation AUC
Another problem is that you ModelCheckpoint is saving the weights based on the minimum AUC instead of the max, which you want.
This can be changed by setting mode='max'.
What does mode='auto' do?
This setting essentially checks if the argument of monitor contains 'acc' and sets it to max. In any other case it sets uses mode='min', which is what is happening in your case.
You can confirm this here
The answer posted by Djib2011 should solve your problem. I just wanted to address the use of early stopping. Typically this is used to stop training when over fitting starts to cause the loss to increase. I think it is more effective to address the over fitting issue directly which should enable you to achieve a lower loss. You did not list your model so it is not clear how to address over fitting but some simple guidelines are as follows. If you havee several dense hidden layers at the top of the model delete most of them and just keep the final top dense layer. The more complex the model the more it is prone to over fitting. If that leads to lower training accuracy then keep the layers but add dropout layers. You might also try using regularization in the hidden dense layers. I also find it is beneficial to use the callback ReduceLROnPlateau. Set it up to monitor AUC and reduce the learning rate if it fails to improve.

How can I modify ModelCheckPoint in keras to monitor both val_acc and val_loss and save accordingly the best model?

ModelCheckPoint gives options to save both for val_Acc and val_loss separately.
I want to modify this in a way so that if val_acc is improving -> save model. if val_acc is equal to previous best val_acc then check for val_loss, if val_loss is less than previous best val_loss then save the model.
if val_acc(epoch i)> best_val_acc:
save model
else if val_acc(epoch i) == best_val_acc:
if val_loss(epoch i) < best_val_loss:
save model
do not save model
You can just add two callbacks:
callbacks = [ModelCheckpoint(filepathAcc, monitor='val_acc', ...),
ModelCheckpoint(filepathLoss, monitor='val_loss', ...)], callbacks=callbacks)
Using custom callbacks
You can do anything you want in a LambdaCallback(on_epoch_end=saveModel).
best_val_acc = 0
best_val_loss = sys.float_info.max
def saveModel(epoch,logs):
val_acc = logs['val_acc']
val_loss = logs['val_loss']
if val_acc > best_val_acc:
best_val_acc = val_acc
elif val_acc == best_val_acc:
if val_loss < best_val_loss:
callbacks = [LambdaCallback(on_epoch_end=saveModel)]
But this is nothing different from a single ModelCheckpoint with val_acc. You won't really be getting identical accuracies unless you're using very few samples, or you have a custom accuracy that doesn't vary much.
You can actually check in their documentation!
to save you some time though, the callback, ModelCheckpoint accepts an argument called save_best_only which does what you want to happen, just set it to True. here's the link of the documentation
I misunderstood you're question. I guess if you want a more complex type of callback you could always use the base Callback function, which gives you more power since you could access both parmas and model. Check the docu out. You can start by testing it out and printing the params and determine which one you'd want to take note of.
Check out ModelCheckPoint in here. method takes as a parameter the callback list. Make sure you have something like:, callbacks=[mcp] ) where mcp = ModelCheckPoint() as defined.
Note: You may have multiple callbacks in the callback list.
For clarity I am adding some details but effectively this will do the same as function:
class ModelCheckpoint(Callback):
"""Save the model after every epoch.
`filepath` can contain named formatting options,
which will be filled the value of `epoch` and
keys in `logs` (passed in `on_epoch_end`).
For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
then the model checkpoints will be saved with the epoch number and
the validation loss in the filename.
# Arguments
filepath: string, path to save the model file.
monitor: quantity to monitor.
verbose: verbosity mode, 0 or 1.
save_best_only: if `save_best_only=True`,
the latest best model according to
the quantity monitored will not be overwritten.
mode: one of {auto, min, max}.
If `save_best_only=True`, the decision
to overwrite the current save file is made
based on either the maximization or the
minimization of the monitored quantity. For `val_acc`,
this should be `max`, for `val_loss` this should
be `min`, etc. In `auto` mode, the direction is
automatically inferred from the name of the monitored quantity.
save_weights_only: if True, then only the model's weights will be
saved (`model.save_weights(filepath)`), else the full model
is saved (``).
period: Interval (number of epochs) between checkpoints.

Saving the state of the AdaGrad algorithm in Tensorflow

I am trying to train a word2vec model, and want to use the embeddings for another application. As there might be extra data later, and my computer is slow when training, I would like my script to stop and resume training later.
To do this, I created a saver:
saver = tf.train.Saver({"embeddings": embeddings,"embeddings_softmax_weights":softmax_weights,"embeddings_softmax_biases":softmax_biases})
I save the embeddings, and softmax weights and biases so I can resume training later. (I assume that this is the correct way, but please correct me if I'm wrong).
Unfortunately when resuming training with this script the average loss seems to go up again.
My idea is that this can be attributed to the AdaGradOptimizer I'm using. Initially the outer product matrix will probably be set to all zero's, where after my training it will be filled (leading to a lower learning rate).
Is there a way to save the optimizer state to resume learning later?
While TensorFlow seems to complain when you attempt to serialize an optimizer object directly (e.g. via tf.add_to_collection("optimizers", optimizer) and a subsequent call to tf.train.Saver().save()), you can save and restore the training update operation which is derived from the optimizer:
# init
if not load_model:
optimizer = tf.train.AdamOptimizer(1e-4)
train_step = optimizer.minimize(loss)
tf.add_to_collection("train_step", train_step)
saver = tf.train.import_meta_graph(modelfile+ '.meta')
saver.restore(sess, tf.train.latest_checkpoint('./'))
train_step = tf.get_collection("train_step")[0]
# training loop
while training:
if iteration % save_interval == 0:
saver = tf.train.Saver()
save_path =, filepath)
I do not know of a way to get or set the parameters specific to an existing optimizer, so I do not have a direct way of verifying that the optimizer's internal state was restored, but training resumes with loss and accuracy comparable to when the snapshot was created.
I would also recommend using the parameterless call to Saver() so that state variables not specifically mentioned will still be saved, although this might not be strictly necessary.
You may also wish to save the iteration or epoch number for later restoring, as detailed in this example: