TF/Keras: ModelCheckpoint "period" and "save_best_only" - tensorflow

If I use Keras callback ModelCheckpoint, and I put save_best_only = True and period=3, how will the model be saved? After 3 period it saves the best result from that 3 period, or it just saves the best one of all epochs?
Piece of code that I used:
mcp = tf.keras.callbacks.ModelCheckpoint("my_model.h5", monitor="val_accuracy",
save_best_only=True, period=3)

First of all, according to documentation, period argument is deprecated in favor of save_freq argument (which if assigned to an int, it would consider number of seen batches and not the epochs). But for backwards compatibility, the period argument is still working.
But to find out the answer to your question, we need to inspect the source code for ModelCheckpoint callback. Actually, the best value of monitored metric seen so far is updated only if period epochs has passed (since the last checkpoint). And also. since the best metric value seen so far is compared with the monitored metric value of only current epoch, therefore we can conclude that only the best performing model in epochs period, 2*period, 3*period, etc. are compared and saved, and the performance of the model between those epochs is ignored.

Setting period=3 will attempt to save a model every 3 batches. If you want it to save at the end of every epoch, set period='epoch'.
If save_best_only=True it will check to see if the validation accuracy is higher this time than last time and only save that model. If the validation accuracy is not this high, it will not save the model.
Source: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint#arguments_1

Related

Saving model weights after each forward pass

I am doing research in explainable AI by looking at the patterns in weights as a function of model hyperparameters and input data. One of the things I'm examining is how weights progress from randomness (or initializer starting values) to stabilization, after learning completes. I'd like to, instead of saving weights every epoch, save them at every other or third forward pass. How would I do that? Must I tweak the 'period' argument to the Keras model checkpoint method? If so, what's an easy formula to set that argument? Thx and have a great day.
just pass save_freq=3 when instantiating the tf.keras.callbacks.ModelCheckpoint.
Quoting the docs:
https://keras.io/api/callbacks/model_checkpoint/
save_freq: 'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of this many batches. If the Model is compiled with steps_per_execution=N, then the saving criteria will be checked every Nth batch. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch, since the metrics get reset every epoch). Defaults to 'epoch'.

What is the last state of the model after training?

After fitting the model with model.fit(...), you can use .evaluate() or .predict() methods with the model.
The problem arises when I use Checkpoint during training.
(Let's say 30 checkpoints, with checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath, save_weights_only=True))
Then I can't quite figure out what do I have left, the last state of this model.
Is it the best one? or the latest one?
If the former is the case, one of 30 checkpoints should be same with the model I have left.
If the latter is the case, the latest checkpoint should be same with the model I have left.
Of course, I checked both the cases and neither one is right.
If you set save_best_only=True the checkpoint saves the model weights for the epoch that had the "best" performance. For example if your were monitoring 'val_loss' then it will save the model for the epoch with the lowest validation loss. If save_best_only=False then the model is saved at the end of each epoch regardless of the value of the metric being monitored. Of course if you do not use special formatting for the model save path then the save weights will be over written at the end of each epoch.

Defining assignment function as variable in tensroflow?

I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?
First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.

Does steps_per_epoch use up the whole dataset

I have a large training dataset created by a generator, about 60,000 batches (size 32). Due to the time required for training, I need to use a callback to save the model periodically. However, I want to save it more frequently than once per epoch of 60,000, because that takes about 2 hours on Colab.
As I understand it, setting steps_per_epoch will give me smaller epochs, Say, 10,000. What is not clear to me from the documentation is will this still cycle through my whole 60k batches, or will it stop at 10k and just repeat that 10k? i.e. Does a new epoch start from where the last one left off when using steps_per_epoch?
Thanks, Julian
While I don't know about that option specifically, it wouldn't reuse old data because datasets are only meant to be processed forwards. If it repeated the data, it would have to store a copy of everything it's already processed somewhere since you can't reset a generator. That wouldn't be practical on a large dataset.

Why and when do I need to use the global step in tensorflow

I am using tensorflow, but I am not sure why I even need the global_step variable or if it is even necessary for training. I have sth like this:
gradients_and_vars = optimizer.compute_gradients(value)
train_op = optimizer.apply_gradients(gradients_and_vars)
and then in my loop inside a session I do this:
_ = sess.run([train_op])
I am using a Queue to feed my data the the graph. Do I even have to instantiate a global_step variable?
My loop looks like this:
while not coord.should_stop():
So this loop stops, when it should stop. So why do I need the global_step at all?
You don't need the global step in all cases. But sometimes people want to stop training, tweak some code and then continue training with the saved and restored model. Then often it is nice to know how long (=for how many time steps) this model had been trained so far. Thus the global step.
Also sometimes your learning rate regime might depend on the time the model already had been trained. Say you want to decay your learning rate every 100.000 steps. If you don't keep track of the number of steps already taken this might be difficult if you interrupted training in between and didn't keep track of the number of steps already taken.
And furthermore if you are using tensorboard the global step is the central parameter for your x-axis of the charts.