TensorBoard Callback in Keras does not respect initial_epoch of fit? - tensorflow

I'm trying to train multiple models in parallel on a single graphics card. To achieve that I need to resume training of models from saved weights which is not a problem. The model.fit() method has even a parameter initial_epoch that lets me tell the model which epoch the loaded model is on. However when i pass a TensorBoard callback to the fit() method in order to monitor the training of the models, on Tensorboard all data is shown on x=0.
Is there a ways to overcome this and adjust the epoch on tensorboard?
By the way: Im running Keras 2.0.6 and Tensorflow 1.3.0.
self.callbacks = [TensorBoardCallback(log_dir='./../logs/'+self.model_name, histogram_freq=0, write_graph=True, write_images=False, start_epoch=self.step_num)]
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
self.model.load_weights('./weights/%s.hdf5'%(self.model_name))
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
The resulting graph on Tensorboard looks like this which is not what i was hoping for:
Update:
When passing epochs=10 to the first model.fit() the 10 epoch results are displayed in TensorBoard (see picture).
However when reloading the model and running it (with the same callback attached) the on_epoch_end method of the callback gets never called.

Turns out that when i pass the number of episodes to model.fit() to tell it how long to train, it has to be the number FROM the initial_epoch specified. So if initial_epoch=self.step_num then , epochs=self.step_num+10 if i want to train for 10 episodes.

Say we just started fitting our model and our first time epoch count is 30
(please ignore other paramterers just look at epochs and initial_epoch)
model.fit(train_dataloader,validation_data = test_dataloader,epochs =30,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
Now say ,after 30 epoch we want to start again from 31st epoch (you can see this in tesnorboard) by changing our Adam optimizer(or nay optimizer) learning rate
so we can do is
model.optimizer.learning_rate = 0.0005
model1.fit(train_dataloader,validation_data = test_dataloader,initial_epoch=30,epochs =55,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
=> So here initial_epoch= where we have left training last time;
epochs= initial_epoch+num_epoch we want to run for this second fit

Related

I am trying to resume training from a certain checkpoint

I'm facing a problem with restoring training from the last checkpoint that I saved. I'm following exactly this code except that I'm changing the dataset and increasing the number of epochs to 100: Machine Translation French-English notebook
What do I add in order to keep the training because it wouldn't finish in one days and every time it re-starts from epoch 1.
I've found a similar question but the answer didn't solve the problem: Resume training from a certain checkpoint.
I know this is late but I wanted to share the code of a possible solution to this.
Saving a checkpoint and restoring the model from it is pretty easy according to the Tensorflow documentation. The saving can be done using the Tensorflow callbacks every epoch (or with a save_freq additional argument every x epochs):
model.compile(..., metrics=['accuracy'])
EPOCHS = 10
checkpoint_filepath = '/path/to/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_accuracy',
mode='max',
save_best_only=True # if this is not the best epoch so far it is not saved.
)
model.fit(epochs=EPOCHS, callbacks=[model_checkpoint_callback])
Then, before starting a new train, or doing prediction, the weights of the saved checkpoint can be loaded like this:
model.load_weights(checkpoint_filepath)
That's it.

Number or train and validation samples is not shown as a return of model.fit

I noticed while editing an old notebook that model.fit (in keras) where model = Sequential() always returns the number of train and validation samples (for example: Train on 2508 samples, validate on 250 samples) just before showing the epoch progress. Yet I don't see it when I ran the training process again and I immediately see the epoch progress. (Note: verbose is set to 1).
I even checked keras.io/guides all outputs for Sequential.fit() methods don't return this line as well.
Did that happen due to a new update or do I need to add a certain parameter?
tf.compat.v1.disable_eager_execution()

Unknown number of steps - Training convolution neural network at Google Colab Pro

I am trying to run (training) my CNN at Google Colab Pro, when I run my code, all is allright, but It does not know the number of steps, so an infinite loop is created.
Mounted at /content/drive
2.2.0-rc3
Found 10018 images belonging to 2 classes.
Found 1336 images belonging to 2 classes.
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
Epoch 1/300
8/Unknown - 364s 45s/step - loss: 54.9278 - accuracy: 0.5410
I am using ImageDataGenerator() for loadings images. How can I fix it?
An iterator does not store anything, it generates the data dynamically. When you are using a dataset or dataset iterator, you must provide steps_per_epoch. The length of an iterator is unknown until you iterate through it. You could explicitly pass len(datafiles) into the .fit function. So, You need to provide steps_per_epoch as shown below.
model.fit_generator(
train_data_gen,
steps_per_epoch=total_train // batch_size,
epochs=epochs,
validation_data=val_data_gen,
validation_steps=total_val // batch_size
)
More details are mentioned here
steps_per_epoch: Integer or None. Total number of steps (batches of
samples) before declaring one epoch finished and starting the next
epoch. When training with input tensors such as TensorFlow data
tensors, the default None is equal to the number of samples in your
dataset divided by the batch size, or 1 if that cannot be determined.
If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch
will run until the input dataset is exhausted. This argument is not
supported with array inputs.
I notice you are using binary classification. One more thing to remember when you use ImageDataGenerator is to provide class_mode as shown below. Otherwise, there will be a bug (in keras) or 50% accuracy (in tf.keras).
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
directory=train_dir,
shuffle=True,
target_size=(IMG_HEIGHT, IMG_WIDTH),class_mode='binary') #

Keras with tensorflow throws ResourceExhaustedError

For research purposes, I am training a neural network that is updating its weights differently depending on the parity of the epoch:
1) If the epoch is even, change the weights of the NN with backpropagation
2) If the epoch is odd, only update the model with update_weights_with_custom_function() therefore freeze the network.
Here is a simplified part of the code that implements this (notice the epochs=1):
for epoch in range(nb_epoch):
if epoch % 2 == 0:
model.trainable = True # Unfreeze the model
else:
model.trainable = False # Freeze the model
model.compile(optimizer=optim, loss=gaussian_loss, metrics=['accuracy'])
hist = model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=1,
shuffle=True,
verbose=1,
callbacks=[tbCallBack, csv_epochs, early_stop],
validation_data=(X_val, Y_val))
if epoch % 2 == 1:
update_weights_with_custom_function()
Problem: after a few epoch, keras throws a ResourceExhaustedError but only with tensorflow, not with theano. It seems that looping over compile() is creating models without releasing them.
Therefore, what should I do? I know that K.clear_session() releases memory but it requires to save the model and reload it (see) which gives me some issues as load_model() in my case does not work out of the box.
I'm also open to other ways to do what I am trying to achieve (i.e. freezing a NN model depending on the parity of the epoch).
Summary: keras with tensorflow backend is throwing a ResourceExhaustedError because I am looping over compile().
As Marcin Możejko pointed out, using eval() is doing exactly what I was trying to achieve.
I added a custom callback (inspiration was here), which avoids the loop over compile()
The problem is now solved, even if the tensorflow issue was not addressed directly.

How can I compute model metrics during training with canned estimators?

Using Keras, one typically gets metrics (e.g. accuracy) as part of the progress bar for free. Using the example here:
https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py
After running e.g.
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
Keras will start fitting the model, and will show progress output with something like:
3584/60000 [>.............................] - ETA: 10s - loss: 0.0308 - acc: 0.9905
Suppose I wanted to accomplish the same thing using a TensorFlow canned estimator -- extract the current accuracy for a classifier, and display that as part of a progress bar (done by e.g. a SessionRunHook).
It seems like accuracy metrics aren't provided as part of the default set of operations on a graph. Is there a way I can manually add it myself with a session run hook?
(It looks like it's possible to add operations to the graph as part of the begin() hook, but I'm not sure how I can e.g. request the computation of the model accuracy there.)
accuracy is one of the default metrics in canned classifiers. But it will be calculated by Estimator.evaluate call not by Estimator.train. You can create a for loop to do what you want:
for ...
estimator.train(training_data)
metrics = estimator.evaluate(evaluation_data)