I am doing research in explainable AI by looking at the patterns in weights as a function of model hyperparameters and input data. One of the things I'm examining is how weights progress from randomness (or initializer starting values) to stabilization, after learning completes. I'd like to, instead of saving weights every epoch, save them at every other or third forward pass. How would I do that? Must I tweak the 'period' argument to the Keras model checkpoint method? If so, what's an easy formula to set that argument? Thx and have a great day.
just pass save_freq=3 when instantiating the tf.keras.callbacks.ModelCheckpoint.
Quoting the docs:
https://keras.io/api/callbacks/model_checkpoint/
save_freq: 'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of this many batches. If the Model is compiled with steps_per_execution=N, then the saving criteria will be checked every Nth batch. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch, since the metrics get reset every epoch). Defaults to 'epoch'.
Related
I have created a model testing pipeline for use in my internship and is run on Google Colab. This pipeline allows for the testing of multiple sets of models and parameters back-to-back. It will spin up a model and a set of parameters in a user-defined manner, perform training for 15 epochs, validating after every epoch. It uses two ModelCheckpoints to save models as h5 files, one to save every epoch, and another to save only the best epoch, under a known name in a different folder, so that it can be easily loaded later.
For reference, every model/parameter set tested is identified using a unique tester id and a model count number, which is incremented every model. The model checkpoints saved every epoch also have the epoch number appended to the end.
After all 15 epochs, the best model is loaded and evaluated on our testing set. Then the next model and set of parameters is spun up and the process repeats until it hits a user-defined stopping point.
At least, that is how it is supposed to work.
What happens instead is that the first model to be run goes according to plan. Then the next model is loaded up and trains and validates for one epoch. However, when it comes time to save the checkpoint for the first epoch, the following is thrown:
RuntimeError: Unable to create link (name already exists)
After that occurs, the only way I have found to not encounter the error at the end of the first epoch is to reset the Colab runtime. At which point I get an additional 1 model out of it before the error occurs again. (Note: this is not the same 1 model that I got out before, I adjusted the method parameters to start at the next model that needed to run)
Finally, to firmly lay to rest the most common causes of this error, I have tried running both model.summary() and for i, w in enumerate(model.weights): print(i, w.name). I do not have duplicate names indicated by either of these.
I am unsure why this behavior is occuring, my best guess is that it would fall under some combination of Colab's caching behavior and whatever methodology ModelCheckpoint uses to save the files causing it to interpret a name overlap where there is none.
Any further insight that can be provided as to why this is occurring and how to solve it would be greatly appreciated.
I eventually solved this but the issue was not was not what what I had thought it was. The issue was in the fact that I was using a class to wrap the parameters I was using for model.fit() and model.predict(). For the optimizer parameter, I had a default value of Adam(learning_rate=0.0001). With the parameter being a mutable object, it was being reused in each run, which was causing the error. I changed the default value to None and initialized the variable to the default if the value was None within the method and the problem disappeared.
After fitting the model with model.fit(...), you can use .evaluate() or .predict() methods with the model.
The problem arises when I use Checkpoint during training.
(Let's say 30 checkpoints, with checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath, save_weights_only=True))
Then I can't quite figure out what do I have left, the last state of this model.
Is it the best one? or the latest one?
If the former is the case, one of 30 checkpoints should be same with the model I have left.
If the latter is the case, the latest checkpoint should be same with the model I have left.
Of course, I checked both the cases and neither one is right.
If you set save_best_only=True the checkpoint saves the model weights for the epoch that had the "best" performance. For example if your were monitoring 'val_loss' then it will save the model for the epoch with the lowest validation loss. If save_best_only=False then the model is saved at the end of each epoch regardless of the value of the metric being monitored. Of course if you do not use special formatting for the model save path then the save weights will be over written at the end of each epoch.
If I use Keras callback ModelCheckpoint, and I put save_best_only = True and period=3, how will the model be saved? After 3 period it saves the best result from that 3 period, or it just saves the best one of all epochs?
Piece of code that I used:
mcp = tf.keras.callbacks.ModelCheckpoint("my_model.h5", monitor="val_accuracy",
save_best_only=True, period=3)
First of all, according to documentation, period argument is deprecated in favor of save_freq argument (which if assigned to an int, it would consider number of seen batches and not the epochs). But for backwards compatibility, the period argument is still working.
But to find out the answer to your question, we need to inspect the source code for ModelCheckpoint callback. Actually, the best value of monitored metric seen so far is updated only if period epochs has passed (since the last checkpoint). And also. since the best metric value seen so far is compared with the monitored metric value of only current epoch, therefore we can conclude that only the best performing model in epochs period, 2*period, 3*period, etc. are compared and saved, and the performance of the model between those epochs is ignored.
Setting period=3 will attempt to save a model every 3 batches. If you want it to save at the end of every epoch, set period='epoch'.
If save_best_only=True it will check to see if the validation accuracy is higher this time than last time and only save that model. If the validation accuracy is not this high, it will not save the model.
Source: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint#arguments_1
Is it possible to get the value of steps_per_epoch with which fit was executed?
Please, do not suggest me to divide the number of training examples by the batch size. I am wondering if this can be retrieved from a property of the model or something.
Yes, all the models in Keras have a default History callback which basically holds the history of training of the model. One piece of information which is stored in this callback is the value of steps_per_epoch which could be accessed using model.history.params['steps'].
Note that if the model has been saved after training, and then is loaded back later, then it may not be possible to access its History callback; that's because, as far as I know, the callback(s) is not persisted when saving the model.
I am using tensorflow, but I am not sure why I even need the global_step variable or if it is even necessary for training. I have sth like this:
gradients_and_vars = optimizer.compute_gradients(value)
train_op = optimizer.apply_gradients(gradients_and_vars)
and then in my loop inside a session I do this:
_ = sess.run([train_op])
I am using a Queue to feed my data the the graph. Do I even have to instantiate a global_step variable?
My loop looks like this:
while not coord.should_stop():
So this loop stops, when it should stop. So why do I need the global_step at all?
You don't need the global step in all cases. But sometimes people want to stop training, tweak some code and then continue training with the saved and restored model. Then often it is nice to know how long (=for how many time steps) this model had been trained so far. Thus the global step.
Also sometimes your learning rate regime might depend on the time the model already had been trained. Say you want to decay your learning rate every 100.000 steps. If you don't keep track of the number of steps already taken this might be difficult if you interrupted training in between and didn't keep track of the number of steps already taken.
And furthermore if you are using tensorboard the global step is the central parameter for your x-axis of the charts.