Add new layers and restore training from checkpoint in TensorFlow - tensorflow

I have trained a model with TensorFlow, and the training time cost me weeks. Now, I want to add new layer to the current model and continue training based on the trained weights. However, if I restore checkpoint an error like "xxx not found in checkpoint" will occur.
So how can I restore from checkpoint to a modified model in TensorFlow?

Related

Tensorflow Keras - Does Model.save() save the best model?

I have been training several models using 10-fold CV and added the ModelCheckpoint callback which saves the model with the lowest validation loss to an HDF5 file. However, for a while I would then call model.save(filepath) right after training.
I only came to the realization that the last call would probably save the model trained on the very last epoch and that the saved checkpoint is not being used at all. Is my assumption correct? If so, is it normal that the best models from the checkpoint files score lower than the ones saved with model.save()?

Partially restore weights in TF2

I trained a resnet model whose final layer has 3 outputs (multiclass classification). I want to use these model weights to pretrain a regression model which has the exact same architecture except the last layer, which has 1 output.
This seems like a very basic use case, but I do not see how to do this. Restoring a checkpoint gives an error since the architectures are not the same (mismatched shape). All other solutions I have found are either for TF1 (eg https://innerpeace-wu.github.io/2017/12/13/Tensorflow-Restore-partial-weights/) or using Keras .h5 restore.
How can I do this in TF2?

Loading a checkpoint from a trained model using estimator

I want to do very simple task. Let us assume that I have executed a model and saved multiple checkpoints and metada for this model using tf.estimator. We can again assume that I have 3 checkpoints. 1, 2 and 3. While I am evaluating the trained results on the tensorboard, I am realizing that checkpoint 2 is providing the better weights for my objective.
Therefore I want to load checkpoint 2 and make my predictions. What I want to ask simply is that, is it possible to delete checkpoint 3 from the model dir and let the estimator load it automatically from checkpoint 2 or is there anything I can do to load a specific checkpoint for.my predictions?
Thank you.
Yes, You can. By default, Estimator will load latest available checkpoint in model_dir. So you can either delete files manually, or specify checkpoint file with
warm_start = tf.estimator.WarmStartSettings(ckpt_to_initialize_from='file.ckpt')
and pass this to estimator
tf.estimator.Estimator(model_fn=model_fn,
config=run_config,
model_dir='dir',
warm_start_from=warm_start)
The latter option will not mess tensorboard summaries, so it's generally cleaner

In Keras what's the best way to save a model checkpoint and reload it for training on different data set?

So let's say I have a trained Keras neural network called model. I want to save the model before further training so I can go back to that checkpoint as I train it on different datasets.
model.save('checkpoint_1.h5')
model.fit(data_1, labels_1)
model.save('checkpoint_2.h5')
Now here's where my question comes in. I want to free up the GPU memory so I can reload checkpoint_1 before further training of the model. What I'm currently doing is ending the current tensorflow session and starting a new one.
from keras import backend as K
#End the current and start a new tensorflow session to free up gpu memory
#to allow the next nn2 to be trained.
K.get_session().close()
sess = tf.Session()
K.set_session(sess)
I then load checkpoint_1 and continue training.
model = load_model('checkpoint_1.h5')
model.fit(data_2, labels_2)
Is there a better way of doing this? Stopping and starting the tensorflow sessions takes a lot of time.

What is the difference between saving a summary and saving the model in the logdir?

Using Tensorflow (tf.contrib.slim in particular) we are required to calibrate a few parameters to produce the graphs that we want at tensorboard.
Saving a summary interval is more clear for us what it does. It saves the value (or an average of them?) of a particular point in the graph at the interval provided.
Now checkpoints for saving the model itself why should be required at the training process? Does the model changes?.. Not sure how this works
You save the model to checkpoints because the Variables in the model, including neural network weights and biases and the global_step counter, keep changing during the training process. The structure of the model doesn't change. The saved checkpoints allow you to load the trained model for serving and to resume training later.