Saving the state of the AdaGrad algorithm in Tensorflow - tensorflow

I am trying to train a word2vec model, and want to use the embeddings for another application. As there might be extra data later, and my computer is slow when training, I would like my script to stop and resume training later.
To do this, I created a saver:
saver = tf.train.Saver({"embeddings": embeddings,"embeddings_softmax_weights":softmax_weights,"embeddings_softmax_biases":softmax_biases})
I save the embeddings, and softmax weights and biases so I can resume training later. (I assume that this is the correct way, but please correct me if I'm wrong).
Unfortunately when resuming training with this script the average loss seems to go up again.
My idea is that this can be attributed to the AdaGradOptimizer I'm using. Initially the outer product matrix will probably be set to all zero's, where after my training it will be filled (leading to a lower learning rate).
Is there a way to save the optimizer state to resume learning later?

While TensorFlow seems to complain when you attempt to serialize an optimizer object directly (e.g. via tf.add_to_collection("optimizers", optimizer) and a subsequent call to tf.train.Saver().save()), you can save and restore the training update operation which is derived from the optimizer:
# init
if not load_model:
optimizer = tf.train.AdamOptimizer(1e-4)
train_step = optimizer.minimize(loss)
tf.add_to_collection("train_step", train_step)
else:
saver = tf.train.import_meta_graph(modelfile+ '.meta')
saver.restore(sess, tf.train.latest_checkpoint('./'))
train_step = tf.get_collection("train_step")[0]
# training loop
while training:
if iteration % save_interval == 0:
saver = tf.train.Saver()
save_path = saver.save(sess, filepath)
I do not know of a way to get or set the parameters specific to an existing optimizer, so I do not have a direct way of verifying that the optimizer's internal state was restored, but training resumes with loss and accuracy comparable to when the snapshot was created.
I would also recommend using the parameterless call to Saver() so that state variables not specifically mentioned will still be saved, although this might not be strictly necessary.
You may also wish to save the iteration or epoch number for later restoring, as detailed in this example:
http://www.seaandsailor.com/tensorflow-checkpointing.html

Related

Does Keras ModelCheckpoint save the best model across multiple fitting sessions?

If I have a Keras model fitted with the ModelCheckpoint callback and fit it in several 'fitting sessions' (i.e. I call model.fit() multiple times), will the callback save the best model in the most recent fitting session or the best model out of all fitting sessions?
Thanks.
Good question. I did an experiment with an existing model and data set. I created a checkpoint callback as shown and used it in model.fit
file_path1=r'c:\temp\file1'
mchk=tf.keras.callbacks.ModelCheckpoint( filepath=file_path1, monitor="val_loss", verbose=1,
save_best_only=True, save_weights_only=True, mode="auto", save_freq="epoch" )
history = model.fit(X_train, Y_train, validation_data=val_data,
batch_size= 128, epochs= 5, verbose= 1, callbacks=[mchk])
I saved the weights only and saved only the weights for the epoch with the lowest validation loss. I set verbose=1 in the callback so I could see the values of the validation loss on each epoch. Next I ran essentially the same code again but I changed
the name of the filepath to file2. Code for that is below
file_path2=r'c:\temp\file2'
mchk=tf.keras.callbacks.ModelCheckpoint( filepath=file_path2, monitor="val_loss", verbose=1,
save_best_only=True, save_weights_only=True, mode="auto", save_freq="epoch" )
history = model.fit(X_train, Y_train, validation_data=val_data,
batch_size= 128, epochs= 5, verbose= 1, callbacks=[mchk])
Now model.fit preserves its state at the end of a session so if you run it a second time
it starts from where it left off. However it does not preserve the state of the callback.
So on the second run the callback initializes the validation loss as np.inf so it will
save the weights at the end of the first epoch for sure. If you don't change the name of the file it will over write the file you saved due to the first run. If in the second run the value of the validation loss for which the weights were saved is LOWER than the validation loss of the first run then you wind up with the best saved weights overall. However if in the second run the validation loss is higher than in the first run you end up not saving the OVERALL best weights. So that's how it works for the case where the the callback has save_weights_only=True. I thought it might behave differently if you save the entire model because it may in that case preserve the state of the callback. So I reran the experiment with save_weights_only=False. The results indicate saving the entire model does not save the state of the callback. Now I am using Tensorflow 2.0. The results may be different for different versions. I would run this experiment on your version and see if it behaves similarly.
It will save the best model in the most recent fitting session
It would save the model for the last fit() as you are essentially overwriting the same file.
If you wanted to find the best model over N iterations you should save them with a prefix N in the file name. This way it will save the best model for a particular fit() and you can easily compare them later. You could just manually add in the N i.e., 1,2,3,N for each fit().
// Example
ModelCheckpoint(
'/home/jupyter/checkpoint/best_model_{N}.h5',
monitor="val_loss",
save_best_only=True,
save_weights_only=False,
mode="min")
Yes, a checkpoint will only be saved if the performance is better than over all calls to fit. In other words, if none of your epochs in the latest call to fit had better performance than an epoch in a previous call to fit, that previous checkpoint won't be overwritten.
There is one proviso: you must remember to create the callback outside of the call to fit. That is, do this:
checkpoint_callback = keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True)
model.fit(..., callbacks=checkpoint_callback)
...
model.fit(..., callbacks=checkpoint_callback)
not this:
model.fit(..., callbacks=keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True))
...
model.fit(..., callbacks=keras.callbacks.ModelCheckpoint(
"checkpoint.h5", save_best_only=True))
The checkpoint callback object has a best attribute which stores the best monitored value so far (and is initially set to the worst possible value, e.g. infinity if lower is good). This is not reset when the object is passed to fit. However, if you instantiate a new callback object within the call to fit, as in the latter code, naturally best will be initialised to the worst possible value, not the best monitored value stored by other callback objects in previous calls to fit.

Tensorflow load pre-trained model use different optimizer

I want to load a pre-trained model (optimized by AdadeltaOptimizer) and continue training with SGD (GradientDescentOptimizer). The models are saved and loaded with tensorlayer API:
save model:
import tensorlayer as tl
tl.files.save_npz(network.all_params,
name=model_dir + "model-%d.npz" % global_step)
load model:
load_params = tl.files.load_npz(path=resume_dir + '/', name=model_name)
tl.files.assign_params(sess, load_params, network)
If I continue training with adadelta, the training loss (cross entropy) looks normal (start at a close value as the loaded model). However, if I change the optimizer to SGD, the training loss would be as large as a newly initialized model.
I took a look at the model-xxx.npz file from tl.files.save_npz. It only saves all model parameters as ndarray. I'm not sure how the optimizer or learning rate is involved here.
You probably would have to import the tensor into a variable which is the loss function/cross-entropy that feeds into your Adam Optimizer previously. Now, just feed it through your SGD optimizer instead.
saver = tf.train.import_meta_graph('filename.meta')
saver.restore(sess,tf.train.latest_checkpoint('./'))
graph = tf.get_default_graph()
cross_entropy = graph.get_tensor_by_name("entropy:0") #Tensor to import
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
In this case, I have tagged the cross-entropy Tensor before training my pre-train model with the name entropy, as such
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv), name = 'entropy')
If you are unable to make changes to your pretrain model, you can obtain the list of Tensors in your model(after you have imported it) from graph and deduce which Tensor you require. I have no experience with Tensorlayer, so this guide is to provide more of an understanding. You can take a look at Tensorlayer-Layers, they should explain how to obtain your Tensor. As Tensorlayer is built on top of Tensorflow, most of the functions should still be available.
You can specify the parameters you want to save in your checkpoint file.
save_npz([save_list, name, sess])
In the save_list you're specifying only the network parameters that don't contain the optimizer parameters, thus no learning rate or any other optimizer parameters.
If you want to save the current learning rate (in order to use the same exact learning rate when you restore the model) you have to add it to the save_list, like that:
save_npz(network.all_params.extend([learning_rate])
(I suppoose that all_params is an array, I guess my supposition is correct.
Since you want to change the optimizer, I suggest you save the learning_rate only as optimizer parameter and not any other variable that the optimizer creates.
In that way, you'll be able to change the optimizer and restoring the model, otherwise (if you put in your checkpoint any other variable) the graph you'll try to restore won't find the variables in which place the saved value and you won't be able to change it.
https://tensorlayer.readthedocs.io/en/latest/user/get_start_advance.html#pre-trained-cnn
vgg = tl.models.vgg16(pretrained=True)
img = tl.vis.read_image('data/tiger.jpeg')
img = tl.prepro.imresize(img, (224, 224)).astype(np.float32) / 255
output = vgg(img, is_train=False)
For 2.0 version, use this

How to use Test data on saved model with queue approach (without feed_dict) #tensorflow?

I am new to tensorflow. I have build a convonet for mnist image classification as follows I am using queues to read images(png) from the disk batch it and pass it to train op (I am quite comfortable with this now) It's all good till train and I am evaluating my accuracy op at certain number of steps while training.
I am saving the model with Saver object and can see the meta and checkpoint file being written on the disk.
Now the real challenge is to restore the model once it has finished training and use it for predictions on new images
One of the first step in my graph (to train) is like below which takes x_image (images from train queue) h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
As I am not using feed dictionary approach, I can not just restore the accuracy op using saver and pass the new data. I have to define the queue for test data and rebuild the graph (exactly as earlier) with reference x_image changed to point to test data Queue.
How can I now restore the learned weights while training and use it to with this new graph to simply run my predict/accuracy op.
I tried to follow
- https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10.py tutorial but got lost with eval code.
Also if I add a dummy constant in my training graph and then try to retrive it's value, I am able to retrive it.
Can any 1 please help. Thanks
OK, So I have found the answer.
The original challenge was to to toggle between train and test data while training and validation phase when using queues.
Now as queues are part of graph structure, we can't simply modify them.
I found an article to use tf.case to toggle between train and test queue but I wasn't able to use shuffle batch along with it.
The real task at hand was to save the model post training and use the saved model to predict in production.
So here is the flow:
Training
create a method that creates your graph (will take image tensor as
input).
Build a training graph by passing training image batches
Perform training and save the model with saver object.
Evaluation
Now reconstruct the same graph with test image batches.
In the session use saver object to restore the weights (Note you dont need to pass which variables to restore, by default it restores all restore able variables)
Dont run gloabl variable initializer at this time
Run your predict op (generated from the newly constructed graph)
Also make sure you switch off the drop out functionality in the eval as it would keep varying the output for the same input
Below is the pseudocode
train_op, y_predict, accuracy = create_graph(train_input, train_label)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
model_saver = tf.train.Saver()
for i in range(2000):
if i%100 == 0:
train_accuracy = sess.run(accuracy)
print("step %d, training accuracy %f" %(i, train_accuracy))
sess.run(train_op)
print(sess.run(accuracy))
model_saver.save(sess, 'model/simple_model', global_step=100)
coord.request_stop()
coord.join(threads)
For evaluation
_, y_predict, accuracy = create_graph(test_input, test_label)
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, tf.train.latest_checkpoint("./model/"))
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
label_predict = sess.run([y_predict])

"rewind" tensorflow training step

I occasionally hit a problem with training in tensorflow and stochastic gradient descent where I load a mini-batch that wreaks havoc on my optimization op, pushing it to Nans. This, of course, throws an error in the training process and forces me to start over. Even if I wrap the optimization op in a try statement, by the time an exception is raised, the damage is done and I need to re-start.
Does anyone have a good way of, essentially, rewinding optimization back to a valid state when it hits an error? I would think you could use checkpoints for this, but the docs on saving/restoring are so spotty that i'm not sure...
As you suggest checkpoints are the way to do it. The key steps for your case are as follows:
First create a saver object after you've defined your graph:
saver = tf.train.Saver(max_to_keep=5, keep_checkpoint_every_n_hours=1)
Next, write out check points intermittently during training:
for step in range(max_steps):
... some training steps here
# Save the model every 100 iterations
if step % 100 == 0:
saver.save(sess, checkpoint_dir, global_step=step)
Finally, when you catch an error, reload the last good checkpoint:
# this next command restores the latest checkpoint or explicitly specify the filename if you want to use some other logic
restore_fn = tf.train.latest_checkpoint(FLAGS.restore_dir)
print('Restoring from %s' % restore_fn)
saver.restore(sess, restore_fn)
Answering a different question:
Which optimizer are you using?
Big jumps, like you can get with simple gradient descent, shouldn't be possible with gradient clipping or an optimizer with a limited step size (like Adam).

Tensorflow: Saving and restoring the model parameters

I am a beginner in TensorFlow, currently training a CNN.
I am using Saver in order to save the parameters used by the model, but I am having concerns whether this would itself store all the Variables used by the model, and is sufficient to restore the values to re-run the program for performing classification/testing on the trained network.
Let us look at the famous example MNIST given by TensorFlow.
In the example, we have bunch of Convolutional blocks, all of which have weight, and bias variables that gets initialised when the program is run.
W_conv1 = init_weight([5,5,1,32])
b_conv1 = init_bias([32])
After having processed several layers, we create a session, and initialise all the variables added to the graph.
sess = tf.Session()
sess.run(tf.initialize_all_variables())
saver = tf.train.Saver()
Here, is it possible to comment the saver.save code, and replace it by saver.restore(sess,file_path) after the training, in order to restore the weight, bias, etc., parameters back to the graph? Is this how it should be ?
for i in range(1000):
...
if i%500 == 0:
saver.save(sess,"model%d.cpkt"%(i))
I am currently training on large dataset, so terminating, and restarting the training is a waste of time, and resources so I request someone to please clarify before the I start the training.
If you want to save the final result only once, you can do this:
with tf.Session() as sess:
for i in range(1000):
...
path = saver.save(sess, "model.ckpt") # out of the loop
print "Saved:", path
In other programs, you can load the model using the path returned from saver.save for prediction or something. You can see some examples at https://github.com/sugyan/tensorflow-mnist.
Based on the explanation in here and Sung Kim solution I wrote a very simple model exactly for this problem. Basically in this way you need to create an object from the same class and restore its variables from the saver. You can find an example of this solution here.