Restoring a model with tensorflow does not lead to same prediction - tensorflow

my question revolves around Tensorflow and its tf.train.Saver() function to save the state of a network and restore afterwards. My code is very long and messy but it is structured as:
Generate a set of parameter settings for the network
Build the network
Either train the network or restore it based on saved model file
Predict on some external data
Now to my question. When I run my model with the exact same parameters several times I get the exact same performance. If I restore the same model several times I also get the same performance. However, the performance when I train the model and when I restore model are not the same, even if the restored model comes directly from the trained model. I have tried to save either the whole model (so tf.train.Saver() which creates huge files) or just trainable variables (so tf.train.Saver(tf.trainable_variables()) which makes much smaller files) and both give the same result, but it is still not identical to directly training the model. Keep in mind that the differences are generally very small, and for some of the individual tasks in my network they are identical, but the difference bugs me.
I have seen several questions about Tensorflow model saving and restoring but none seem to solve my question. As far as I can see the model restores correctly (I check that the weights were the same and they seem to be) but it doesn't give the exact same results. I think I can rule out random events in my code because while training and restoring I can reproduce the same results given the same parameters. I am not sure how to proceed.
Does any one know why I have this problem?
I add just small snippets of the code below. First how I restore the model:
def load_network_state(model_file):
print('Restoring model')
# saver = tf.train.Saver(tf.trainable_variables())
saver = tf.train.Saver()
sess=tf.Session()
saver.restore(sess, model_file)
return sess
In the training function:
# saver = tf.train.Saver(tf.trainable_variables())
saver = tf.train.Saver()
if model_file is not None:
save_path=saver.save(sess, model_file+'_{0}'.format(model_suffix))
In the main loop:
if args['restore_network']:
params = create_parameters_from_file(args['arg_file'], args, j)
else:
params=create_random_parameter(args)
start=time.time()
train_data, test_data = transform(train_data, test_data, params)
kwargs=generate_kwargs_dictionary(generate_network, params)
features, targets, predictions, train_op = generate_network(train_data, test_data, model_suffix=j, **kwargs)
if args['restore_network']:
session = load_network_state(params['model_file']+'_{0}'.format(j), params['seed'])
else:
kwargs=generate_kwargs_dictionary(network_training, params)
session = network_training(train_data, features, targets, train_op, model_suffix=j, **kwargs)
kwargs=generate_kwargs_dictionary(network_prediction, params)
training_results = network_prediction(train_data, features, targets, predictions, session, model_suffix=j, **kwargs)
test_results = network_prediction(test_data, features, targets, predictions, session, model_suffix=j, **kwargs)
elapsed=time.time()-start
Thank you for any help you could provide

Related

tensorflow restores different values for weights each time (from the same file!)

So I'm training a model on a machine with GPU. Of course I save it in the end of the training:
a = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
saver = tf.train.Saver(a)
saver.save(sess, save_path)
Now I have one file, but every time I restore the model from the same file I get different numbers in the matrices, and different predictions for the same examples.
I restore the model like this:
saver = tf.train.import_meta_graph('{}.meta'.format(save_path))
sess.run(tf.global_variables_initializer())
saver.restore(sess, save_path)
What is happening here?
When you call sess.run(tf.global_variables_initializer()) after importing the frozen graph, you probably reinitialise some variables that you should not.
Instead, you should initialise only the uninitialised variables. One way to do it would be (credit to this answer)
uninitialized_vars = []
for var in tf.all_variables():
try:
sess.run(var)
except tf.errors.FailedPreconditionError:
uninitialized_vars.append(var)
init_new_vars_op = tf.initialize_variables(uninitialized_vars)

How to use Test data on saved model with queue approach (without feed_dict) #tensorflow?

I am new to tensorflow. I have build a convonet for mnist image classification as follows I am using queues to read images(png) from the disk batch it and pass it to train op (I am quite comfortable with this now) It's all good till train and I am evaluating my accuracy op at certain number of steps while training.
I am saving the model with Saver object and can see the meta and checkpoint file being written on the disk.
Now the real challenge is to restore the model once it has finished training and use it for predictions on new images
One of the first step in my graph (to train) is like below which takes x_image (images from train queue) h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
As I am not using feed dictionary approach, I can not just restore the accuracy op using saver and pass the new data. I have to define the queue for test data and rebuild the graph (exactly as earlier) with reference x_image changed to point to test data Queue.
How can I now restore the learned weights while training and use it to with this new graph to simply run my predict/accuracy op.
I tried to follow
- https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10.py tutorial but got lost with eval code.
Also if I add a dummy constant in my training graph and then try to retrive it's value, I am able to retrive it.
Can any 1 please help. Thanks
OK, So I have found the answer.
The original challenge was to to toggle between train and test data while training and validation phase when using queues.
Now as queues are part of graph structure, we can't simply modify them.
I found an article to use tf.case to toggle between train and test queue but I wasn't able to use shuffle batch along with it.
The real task at hand was to save the model post training and use the saved model to predict in production.
So here is the flow:
Training
create a method that creates your graph (will take image tensor as
input).
Build a training graph by passing training image batches
Perform training and save the model with saver object.
Evaluation
Now reconstruct the same graph with test image batches.
In the session use saver object to restore the weights (Note you dont need to pass which variables to restore, by default it restores all restore able variables)
Dont run gloabl variable initializer at this time
Run your predict op (generated from the newly constructed graph)
Also make sure you switch off the drop out functionality in the eval as it would keep varying the output for the same input
Below is the pseudocode
train_op, y_predict, accuracy = create_graph(train_input, train_label)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
model_saver = tf.train.Saver()
for i in range(2000):
if i%100 == 0:
train_accuracy = sess.run(accuracy)
print("step %d, training accuracy %f" %(i, train_accuracy))
sess.run(train_op)
print(sess.run(accuracy))
model_saver.save(sess, 'model/simple_model', global_step=100)
coord.request_stop()
coord.join(threads)
For evaluation
_, y_predict, accuracy = create_graph(test_input, test_label)
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, tf.train.latest_checkpoint("./model/"))
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
label_predict = sess.run([y_predict])

Python TensorFlow: How to restart training with optimizer and import_meta_graph?

I'm trying to restart a model training in TensorFlow by picking up where it left off. I'd like to use the recently added (0.12+ I think) import_meta_graph() so as to not reconstruct the graph.
I've seen solutions for this, e.g. Tensorflow: How to save/restore a model?, but I run into issues with AdamOptimizer, specifically I get a ValueError: cannot add op with name <my weights variable name>/Adam as that name is already used error. This can be fixed by initializing, but then my model values are cleared!
There are other answers and some full examples out there, but they always seem older and so don't include the newer import_meta_graph() approach, or don't have a non-tensor optimizer. The closest question I could find is tensorflow: saving and restoring session but there is no final clear cut solution and the example is pretty complicated.
Ideally I'd like a simple run-able example starting from scratch, stopping, then picking up again. I have something that works (below), but do also wonder if I'm missing something. Surely I'm not the only one doing this?
Here is what I came up with from reading the docs, other similar solutions, and trial and error. It's a simple autoencoder on random data. If ran, then ran again, it will continue from where it left off (i.e. cost function on first run goes from ~0.5 -> 0.3 second run starts ~0.3). Unless I missed something, all of the saving, constructors, model building, add_to_collection there are needed and in a precise order, but there may be a simpler way.
And yes, loading the graph with import_meta_graph isn't really needed here since the code is right above, but is what I want in my actual application.
from __future__ import print_function
import tensorflow as tf
import os
import math
import numpy as np
output_dir = "/root/Data/temp"
model_checkpoint_file_base = os.path.join(output_dir, "model.ckpt")
input_length = 10
encoded_length = 3
learning_rate = 0.001
n_epochs = 10
n_batches = 10
if not os.path.exists(model_checkpoint_file_base + ".meta"):
print("Making new")
brand_new = True
x_in = tf.placeholder(tf.float32, [None, input_length], name="x_in")
W_enc = tf.Variable(tf.random_uniform([input_length, encoded_length],
-1.0 / math.sqrt(input_length),
1.0 / math.sqrt(input_length)), name="W_enc")
b_enc = tf.Variable(tf.zeros(encoded_length), name="b_enc")
encoded = tf.nn.tanh(tf.matmul(x_in, W_enc) + b_enc, name="encoded")
W_dec = tf.transpose(W_enc, name="W_dec")
b_dec = tf.Variable(tf.zeros(input_length), name="b_dec")
decoded = tf.nn.tanh(tf.matmul(encoded, W_dec) + b_dec, name="decoded")
cost = tf.sqrt(tf.reduce_mean(tf.square(decoded - x_in)), name="cost")
saver = tf.train.Saver()
else:
print("Reloading existing")
brand_new = False
saver = tf.train.import_meta_graph(model_checkpoint_file_base + ".meta")
g = tf.get_default_graph()
x_in = g.get_tensor_by_name("x_in:0")
cost = g.get_tensor_by_name("cost:0")
sess = tf.Session()
if brand_new:
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
sess.run(init)
tf.add_to_collection("optimizer", optimizer)
else:
saver.restore(sess, model_checkpoint_file_base)
optimizer = tf.get_collection("optimizer")[0]
for epoch_i in range(n_epochs):
for batch in range(n_batches):
batch = np.random.rand(50, input_length)
_, curr_cost = sess.run([optimizer, cost], feed_dict={x_in: batch})
print("batch_cost:", curr_cost)
save_path = tf.train.Saver().save(sess, model_checkpoint_file_base)
There might be a problem when you are creating the saver object at the restoring session.
I obtained the same error as yours when using codes below in the restoring session.
saver = tf.train.import_meta_graph('tmp/hsmodel.meta')
saver.restore(sess, tf.train.latest_checkpoint('tmp/'))
But when I changed in this way,
saver = tf.train.Saver()
saver.restore(sess, "tmp/hsmodel")
The error has gone away.
The "tmp/hsmodel" is the path that I give to the saver.save(sess,"tmp/hsmodel") in the saving session.
An simple examples on storing and restoring session of training MNIST network(containing Adam optimizer) is in here. This was helpful to me to compare with my code and fix the problem.
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/4_Utils/save_restore_model.py
I had the same issue and I just figured out what was wrong, at least in my code.
In the end, I used the wrong file name in saver.restore(). This function must be given the file name without the file extension, just like the saver.save() function:
saver.restore(sess, 'model-1')
instead of
saver.restore(sess, 'model-1.data-00000-of-00001')
With this I do exactly what you wish to do: starting from scratch, stopping, then picking up again. I don't need to initialize a second saver from a meta file using the tf.train.import_meta_graph() function, and I don't need to explicitly state tf.initialize_all_variables() after initializing the optimizer.
My complete model restore looks like this:
with tf.Session() as sess:
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
saver.restore(sess, 'model-1')
I think in protocol V1 you still had to add the .ckpt to the file name, and for import_meta_graph() you still need to add the .meta, which might cause some confusion among users. Maybe this should be pointed out more explicitly in the documentation.
The saver class allows us to save a session via:
saver.save(sess, "checkpoints.ckpt")
And allows us to restore the session:
saver.restore(sess, tf.train.latest_checkpoint("checkpoints.ckpt"))

Saving the state of the AdaGrad algorithm in Tensorflow

I am trying to train a word2vec model, and want to use the embeddings for another application. As there might be extra data later, and my computer is slow when training, I would like my script to stop and resume training later.
To do this, I created a saver:
saver = tf.train.Saver({"embeddings": embeddings,"embeddings_softmax_weights":softmax_weights,"embeddings_softmax_biases":softmax_biases})
I save the embeddings, and softmax weights and biases so I can resume training later. (I assume that this is the correct way, but please correct me if I'm wrong).
Unfortunately when resuming training with this script the average loss seems to go up again.
My idea is that this can be attributed to the AdaGradOptimizer I'm using. Initially the outer product matrix will probably be set to all zero's, where after my training it will be filled (leading to a lower learning rate).
Is there a way to save the optimizer state to resume learning later?
While TensorFlow seems to complain when you attempt to serialize an optimizer object directly (e.g. via tf.add_to_collection("optimizers", optimizer) and a subsequent call to tf.train.Saver().save()), you can save and restore the training update operation which is derived from the optimizer:
# init
if not load_model:
optimizer = tf.train.AdamOptimizer(1e-4)
train_step = optimizer.minimize(loss)
tf.add_to_collection("train_step", train_step)
else:
saver = tf.train.import_meta_graph(modelfile+ '.meta')
saver.restore(sess, tf.train.latest_checkpoint('./'))
train_step = tf.get_collection("train_step")[0]
# training loop
while training:
if iteration % save_interval == 0:
saver = tf.train.Saver()
save_path = saver.save(sess, filepath)
I do not know of a way to get or set the parameters specific to an existing optimizer, so I do not have a direct way of verifying that the optimizer's internal state was restored, but training resumes with loss and accuracy comparable to when the snapshot was created.
I would also recommend using the parameterless call to Saver() so that state variables not specifically mentioned will still be saved, although this might not be strictly necessary.
You may also wish to save the iteration or epoch number for later restoring, as detailed in this example:
http://www.seaandsailor.com/tensorflow-checkpointing.html

Tensorflow: Saving and restoring the model parameters

I am a beginner in TensorFlow, currently training a CNN.
I am using Saver in order to save the parameters used by the model, but I am having concerns whether this would itself store all the Variables used by the model, and is sufficient to restore the values to re-run the program for performing classification/testing on the trained network.
Let us look at the famous example MNIST given by TensorFlow.
In the example, we have bunch of Convolutional blocks, all of which have weight, and bias variables that gets initialised when the program is run.
W_conv1 = init_weight([5,5,1,32])
b_conv1 = init_bias([32])
After having processed several layers, we create a session, and initialise all the variables added to the graph.
sess = tf.Session()
sess.run(tf.initialize_all_variables())
saver = tf.train.Saver()
Here, is it possible to comment the saver.save code, and replace it by saver.restore(sess,file_path) after the training, in order to restore the weight, bias, etc., parameters back to the graph? Is this how it should be ?
for i in range(1000):
...
if i%500 == 0:
saver.save(sess,"model%d.cpkt"%(i))
I am currently training on large dataset, so terminating, and restarting the training is a waste of time, and resources so I request someone to please clarify before the I start the training.
If you want to save the final result only once, you can do this:
with tf.Session() as sess:
for i in range(1000):
...
path = saver.save(sess, "model.ckpt") # out of the loop
print "Saved:", path
In other programs, you can load the model using the path returned from saver.save for prediction or something. You can see some examples at https://github.com/sugyan/tensorflow-mnist.
Based on the explanation in here and Sung Kim solution I wrote a very simple model exactly for this problem. Basically in this way you need to create an object from the same class and restore its variables from the saver. You can find an example of this solution here.