I'm facing a problem with restoring training from the last checkpoint that I saved. I'm following exactly this code except that I'm changing the dataset and increasing the number of epochs to 100: Machine Translation French-English notebook
What do I add in order to keep the training because it wouldn't finish in one days and every time it re-starts from epoch 1.
I've found a similar question but the answer didn't solve the problem: Resume training from a certain checkpoint.
I know this is late but I wanted to share the code of a possible solution to this.
Saving a checkpoint and restoring the model from it is pretty easy according to the Tensorflow documentation. The saving can be done using the Tensorflow callbacks every epoch (or with a save_freq additional argument every x epochs):
model.compile(..., metrics=['accuracy'])
EPOCHS = 10
checkpoint_filepath = '/path/to/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_accuracy',
mode='max',
save_best_only=True # if this is not the best epoch so far it is not saved.
)
model.fit(epochs=EPOCHS, callbacks=[model_checkpoint_callback])
Then, before starting a new train, or doing prediction, the weights of the saved checkpoint can be loaded like this:
model.load_weights(checkpoint_filepath)
That's it.
Related
If I have a Keras model fitted with the ModelCheckpoint callback and fit it in several 'fitting sessions' (i.e. I call model.fit() multiple times), will the callback save the best model in the most recent fitting session or the best model out of all fitting sessions?
Thanks.
Good question. I did an experiment with an existing model and data set. I created a checkpoint callback as shown and used it in model.fit
file_path1=r'c:\temp\file1'
mchk=tf.keras.callbacks.ModelCheckpoint( filepath=file_path1, monitor="val_loss", verbose=1,
save_best_only=True, save_weights_only=True, mode="auto", save_freq="epoch" )
history = model.fit(X_train, Y_train, validation_data=val_data,
batch_size= 128, epochs= 5, verbose= 1, callbacks=[mchk])
I saved the weights only and saved only the weights for the epoch with the lowest validation loss. I set verbose=1 in the callback so I could see the values of the validation loss on each epoch. Next I ran essentially the same code again but I changed
the name of the filepath to file2. Code for that is below
file_path2=r'c:\temp\file2'
mchk=tf.keras.callbacks.ModelCheckpoint( filepath=file_path2, monitor="val_loss", verbose=1,
save_best_only=True, save_weights_only=True, mode="auto", save_freq="epoch" )
history = model.fit(X_train, Y_train, validation_data=val_data,
batch_size= 128, epochs= 5, verbose= 1, callbacks=[mchk])
Now model.fit preserves its state at the end of a session so if you run it a second time
it starts from where it left off. However it does not preserve the state of the callback.
So on the second run the callback initializes the validation loss as np.inf so it will
save the weights at the end of the first epoch for sure. If you don't change the name of the file it will over write the file you saved due to the first run. If in the second run the value of the validation loss for which the weights were saved is LOWER than the validation loss of the first run then you wind up with the best saved weights overall. However if in the second run the validation loss is higher than in the first run you end up not saving the OVERALL best weights. So that's how it works for the case where the the callback has save_weights_only=True. I thought it might behave differently if you save the entire model because it may in that case preserve the state of the callback. So I reran the experiment with save_weights_only=False. The results indicate saving the entire model does not save the state of the callback. Now I am using Tensorflow 2.0. The results may be different for different versions. I would run this experiment on your version and see if it behaves similarly.
It will save the best model in the most recent fitting session
It would save the model for the last fit() as you are essentially overwriting the same file.
If you wanted to find the best model over N iterations you should save them with a prefix N in the file name. This way it will save the best model for a particular fit() and you can easily compare them later. You could just manually add in the N i.e., 1,2,3,N for each fit().
// Example
ModelCheckpoint(
'/home/jupyter/checkpoint/best_model_{N}.h5',
monitor="val_loss",
save_best_only=True,
save_weights_only=False,
mode="min")
Yes, a checkpoint will only be saved if the performance is better than over all calls to fit. In other words, if none of your epochs in the latest call to fit had better performance than an epoch in a previous call to fit, that previous checkpoint won't be overwritten.
There is one proviso: you must remember to create the callback outside of the call to fit. That is, do this:
checkpoint_callback = keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True)
model.fit(..., callbacks=checkpoint_callback)
...
model.fit(..., callbacks=checkpoint_callback)
not this:
model.fit(..., callbacks=keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True))
...
model.fit(..., callbacks=keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True))
The checkpoint callback object has a best attribute which stores the best monitored value so far (and is initially set to the worst possible value, e.g. infinity if lower is good). This is not reset when the object is passed to fit. However, if you instantiate a new callback object within the call to fit, as in the latter code, naturally best will be initialised to the worst possible value, not the best monitored value stored by other callback objects in previous calls to fit.
I noticed while editing an old notebook that model.fit (in keras) where model = Sequential() always returns the number of train and validation samples (for example: Train on 2508 samples, validate on 250 samples) just before showing the epoch progress. Yet I don't see it when I ran the training process again and I immediately see the epoch progress. (Note: verbose is set to 1).
I even checked keras.io/guides all outputs for Sequential.fit() methods don't return this line as well.
Did that happen due to a new update or do I need to add a certain parameter?
tf.compat.v1.disable_eager_execution()
Previously I implemented 2 CNN models (Resnet50v2 and inceptionResNetv2) with a dataset contains 3662 images. Both worked fine in Google colab during training and validation. Now I re-run the exactly same code again and training samples per epoch were reduced to only 92 samples per epoch by itself (before it was 2929/epoch). Two models were using separate notebooks and they are both like this now.
I thought it might due to limited RAM (after 1 month of google colab, it seems reduced to half) so I upgraded to Colab pro with 25 G RAM. It doesn't solve the problem.
Has anyone got the same issue? Anyone can give a clue what could be the reason and a solution to fix it? Many thanks!
Some code in the end of the workflow here (they worked well before):
model = tf.keras.applications.InceptionResNetV2(
include_top=True, weights=None, input_tensor=None, input_shape=None,
pooling=None, classes=5)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
model.fit(X, y_orig, epochs = 20, batch_size = 32, validation_split = 0.2, callbacks=[tensorboard_callback])
So I think I found the reason. It was the number of batch that displayed during training. In my case: 2929(no. of train samples) / 32(batch_size) = 91.5 (number displayed now during training).
To test it, I changed the batch size to 8 and I got 366 / epoch. Also the overall training time stays the same, suggesting the number of training samples were actually staying the same as before.
Are you using tensorflow v1 or v2?
Does this problem persist if you switch to 1.x by running a cell with %tensorflow_version 1.x prior to importing tensorflow?
I'm trying to train multiple models in parallel on a single graphics card. To achieve that I need to resume training of models from saved weights which is not a problem. The model.fit() method has even a parameter initial_epoch that lets me tell the model which epoch the loaded model is on. However when i pass a TensorBoard callback to the fit() method in order to monitor the training of the models, on Tensorboard all data is shown on x=0.
Is there a ways to overcome this and adjust the epoch on tensorboard?
By the way: Im running Keras 2.0.6 and Tensorflow 1.3.0.
self.callbacks = [TensorBoardCallback(log_dir='./../logs/'+self.model_name, histogram_freq=0, write_graph=True, write_images=False, start_epoch=self.step_num)]
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
self.model.load_weights('./weights/%s.hdf5'%(self.model_name))
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
The resulting graph on Tensorboard looks like this which is not what i was hoping for:
Update:
When passing epochs=10 to the first model.fit() the 10 epoch results are displayed in TensorBoard (see picture).
However when reloading the model and running it (with the same callback attached) the on_epoch_end method of the callback gets never called.
Turns out that when i pass the number of episodes to model.fit() to tell it how long to train, it has to be the number FROM the initial_epoch specified. So if initial_epoch=self.step_num then , epochs=self.step_num+10 if i want to train for 10 episodes.
Say we just started fitting our model and our first time epoch count is 30
(please ignore other paramterers just look at epochs and initial_epoch)
model.fit(train_dataloader,validation_data = test_dataloader,epochs =30,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
Now say ,after 30 epoch we want to start again from 31st epoch (you can see this in tesnorboard) by changing our Adam optimizer(or nay optimizer) learning rate
so we can do is
model.optimizer.learning_rate = 0.0005
model1.fit(train_dataloader,validation_data = test_dataloader,initial_epoch=30,epochs =55,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
=> So here initial_epoch= where we have left training last time;
epochs= initial_epoch+num_epoch we want to run for this second fit
For research purposes, I am training a neural network that is updating its weights differently depending on the parity of the epoch:
1) If the epoch is even, change the weights of the NN with backpropagation
2) If the epoch is odd, only update the model with update_weights_with_custom_function() therefore freeze the network.
Here is a simplified part of the code that implements this (notice the epochs=1):
for epoch in range(nb_epoch):
if epoch % 2 == 0:
model.trainable = True # Unfreeze the model
else:
model.trainable = False # Freeze the model
model.compile(optimizer=optim, loss=gaussian_loss, metrics=['accuracy'])
hist = model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=1,
shuffle=True,
verbose=1,
callbacks=[tbCallBack, csv_epochs, early_stop],
validation_data=(X_val, Y_val))
if epoch % 2 == 1:
update_weights_with_custom_function()
Problem: after a few epoch, keras throws a ResourceExhaustedError but only with tensorflow, not with theano. It seems that looping over compile() is creating models without releasing them.
Therefore, what should I do? I know that K.clear_session() releases memory but it requires to save the model and reload it (see) which gives me some issues as load_model() in my case does not work out of the box.
I'm also open to other ways to do what I am trying to achieve (i.e. freezing a NN model depending on the parity of the epoch).
Summary: keras with tensorflow backend is throwing a ResourceExhaustedError because I am looping over compile().
As Marcin Możejko pointed out, using eval() is doing exactly what I was trying to achieve.
I added a custom callback (inspiration was here), which avoids the loop over compile()
The problem is now solved, even if the tensorflow issue was not addressed directly.