Error upon restoring variable - tensorflow

I stumbled across an error that I am unable to resolve. What I am trying to do is the following thing:
I want to train a (dummy) model that adds a to b on every iteration. When finished, I want to save the variables as checkpoint. The first time I run it, it shall build the model from scratch. Every time I re-run the model, it should start from the last checkpoint and do the additions again. Hereby, I load the complete graph from the .meta file. The global step variable is there to keep track of the total number of steps I have trained.
import tensorflow as tf
from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file
# List ALL tensors.
print_tensors_in_checkpoint_file(tf.train.latest_checkpoint('./'), all_tensors=True, tensor_name='')
tf.reset_default_graph()
global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32, initializer=tf.constant_initializer(0), trainable=False)
def model(a, b):
b = tf.assign_add(b, a)
return b
with tf.Session() as sess:
ckpt = tf.train.latest_checkpoint('./')
if ckpt:
saver = tf.train.import_meta_graph('./my_test_model-1.meta')
saver.restore(sess, ckpt)
else:
a = tf.Variable(3.0, name='a')
b = tf.Variable(5.0, name='b')
b = model(a, b)
### before EDIT
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
###
### after EDIT
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
###
for step in range(5):
global_step.assign_add(1).eval()
print(global_step.eval())
print(b.eval())
saver.save(sess, './my_test_model', global_step=global_step)
The script runs fine for the first time, outputting this:
1 # step
8.0 # value of b
2
11.0
3
14.0
4
17.0
5
20.0
The second time I run the program, I get this output followed by an error:
tensor_name: a
3.0
tensor_name: b
20.0
tensor_name: global_step
0
tensor_name: global_step_1
5
INFO:tensorflow:Restoring parameters from ./my_test_model-5
Traceback (most recent call last): ... FailedPreconditionError:
Attempting to use uninitialized value global_step [[Node:
AssignAdd_2 = AssignAdd[T=DT_INT32, use_locking=false,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](global_step, AssignAdd_2/value)]] ...
The first time, it's clear that it won't throw an error as I run the initializer for all variables. But I thought that restoring a model counts as some sort of initialization? I really cannot wrap my head around this concept. I also tried defining global_step after defining a and b, but this resulted in another error when loading for the first time:
ValueError: Cannot use the default session to evaluate tensor: the
tensor's graph is different from the session's graph. Pass an explicit
session to eval(session=sess).
The error refers to the the line that increments global_step (global_step.assign_add(1).eval()).
What am I doing wrong? Where should I define the variable?
I appreciate any help on this problem! Thank you for reading this far.
EDIT:
Thanks to #Diana, the precondition error vanished. Unfortunately, another error occured. Whenever running the script with loading a checkpoint, it throws a name error:
NameError: name 'global_step' is not defined.
This also happens for variable ´b´. Shouldn't be the name loaded when restoring the checkpoint? The tensors seem to have the right names and values when I check the tensors in the checkpoint file.

You should declare the saver after you ran initialize. Otherwise you don't save any value to it. As the saver does not know it.

Related

Wrong output for restored variable in tensorflow graph

I'm currently toying around with saving and restoring of variables. For this purpose, I created two scripts. One of them saves a simple graph while the other restores it. Here the test script for saving the graph:
import tensorflow as tf
a = tf.Variable(3.0, name='a')
b = tf.Variable(5.0, name='b')
b = tf.assign_add(b, a)
n_steps = 5
global_step = tf.Variable(0, name='global_step', trainable=False)
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for step in range(n_steps):
print(sess.run(b))
global_step.assign_add(1).eval()
print(global_step.eval())
saver.save(sess, './my_test_model', global_step=global_step)
Basically, I want to run through a loop 5 times and everytime I do this, I add a to b. I also want to keep track of the number of steps via global_step. This works as intended. The output is:
8.0 # value of b
1 # step
11.0
2
14.0
3
17.0
4
20.0
5
Now when restoring the variables, I try to get all three of them. Script is:
import tensorflow as tf
from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file
# List ALL tensors.
print_tensors_in_checkpoint_file(tf.train.latest_checkpoint('./'), all_tensors=True, tensor_name='')
tf.reset_default_graph()
a = tf.get_variable('a', shape=[])
b = tf.get_variable('b', shape=[])
global_step = tf.get_variable('global_step', shape=[])
saver = tf.train.Saver()
with tf.Session() as sess:
ckpt = tf.train.latest_checkpoint('./')
if ckpt:
print(ckpt)
saver.restore(sess, ckpt)
else:
print('Nothing restored')
print(a.eval())
print(b.eval())
print(global_step.eval())
The output of this is
tensor_name: a
3.0
tensor_name: b
20.0
tensor_name: global_step
5
./my_test_model-5
INFO:tensorflow:Restoring parameters from ./my_test_model-5
3.0
20.0
7e-45
How is it possible that the value for global_step is stored correctly in the checkpoint, but upon evaluation I get this small 7e-45? Also, upon restoring, I seem to be unable to define any additional variables as it states it cannot find the variable in the checkpoint. How can I, for example, define a variable and add it to the b of the restored graph?
Thank you for your help!
This doesn't appear to be well documented by the TF docs, but you should specify the dtype for the global_step variable.
Incorrect
global_step = tf.get_variable('global_step', shape=[], dtype=tf.float32)
results in global_step=7e-5. The type is assumed to be dtf.float32 by default.
Correct
global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32)
results in global_step=5

TF LSTM: Save State from training session for prediction session later

I am trying to save the latest LSTM State from training to be reused during the prediction stage later. The problem I am encountering is that in the TF LSTM model the State is passed around from one training iteration to next via a combination of a placeholder and a numpy array -- neither of which seems to be included in the Graph by default when the session is saved.
To work around this, I am creating a dedicated TF variable to hold the latest version of the state so as to add it to the Session graph, like so:
# latest State from last training iteration:
_, y, ostate, smm = sess.run([train_step, Y, H, summaries], feed_dict=feed_dict)
# now add to TF variable:
savedState = tf.Variable(ostate, dtype=tf.float32, name='savedState')
tf.variables_initializer([savedState]).run()
save_path = saver.save(sess, pathModel + '/my_model.ckpt')
This seems to add the savedState variable to the saved session graph well, and is easily recoverable later with the rest of the Session.
The problem though, is that the only way I have managed to actually use that variable later in the restored Session, is that if I initialize all variables in the session AFTER I recover it (which seems to reset all trained variables, including the weights/biases/etc.!). If I initialize variables first and THEN recover the session (which works fine in terms of preserving the trained varialbes), then I am getting an error that I'm trying to access an uninitialized variable.
I know there is a way to initialize a specific individual varialbe (which i am using while saving it originally) but the problem is that when we recover them, we refer to them by name as strings, we don't just pass the variable itself?!
# This produces an error 'trying to use an uninitialized varialbe
gInit = tf.global_variables_initializer().run()
new_saver = tf.train.import_meta_graph(pathModel + 'my_model.ckpt.meta')
new_saver.restore(sess, pathModel + 'my_model.ckpt')
fullState = sess.run('savedState:0')
What is the right way to get this done? As a workaround, I am currently saving the State to CSV just as a numpy array and then recover it the same way. It works OK, but clearly not the cleanest solution given that every other aspect of saving/restoring the TF session works perfectly.
Any suggestions appreciated!
**EDIT:
Here's the code that works well, as described in the accepted answer below:
# make sure to define the State variable before the Saver variable:
savedState = tf.get_variable('savedState', shape=[BATCHSIZE, CELL_SIZE * LAYERS])
saver = tf.train.Saver(max_to_keep=1)
# last training iteration:
_, y, ostate, smm = sess.run([train_step, Y, H, summaries], feed_dict=feed_dict)
# now save the State and the whole model:
assignOp = tf.assign(savedState, ostate)
sess.run(assignOp)
save_path = saver.save(sess, pathModel + '/my_model.ckpt')
# later on, in some other program, recover the model and the State:
# make sure to initialize all variables BEFORE recovering the model!
gInit = tf.global_variables_initializer().run()
local_saver = tf.train.import_meta_graph(pathModel + 'my_model.ckpt.meta')
local_saver.restore(sess, pathModel + 'my_model.ckpt')
# recover the state from training and get its last dimension
fullState = sess.run('savedState:0')
h = fullState[-1]
h = np.reshape(h, [1, -1])
I haven't tested yet whether this approach unintentionally initializes any other variables in the saved Session, but don't see why it should, since we only run the specific one.
The issue is that creating a new tf.Variable after the Saver was constructed means that the Saver has no knowledge of the new variable. It still gets saved in the metagraph, but not saved in the checkpoint:
import tensorflow as tf
with tf.Graph().as_default():
var_a = tf.get_variable("a", shape=[])
saver = tf.train.Saver()
var_b = tf.get_variable("b", shape=[])
print(saver._var_list) # [<tf.Variable 'a:0' shape=() dtype=float32_ref>]
initializer = tf.global_variables_initializer()
with tf.Session() as session:
session.run([initializer])
saver.save(session, "/tmp/model", global_step=0)
with tf.Graph().as_default():
new_saver = tf.train.import_meta_graph("/tmp/model-0.meta")
print(saver._var_list) # [<tf.Variable 'a:0' shape=() dtype=float32_ref>]
with tf.Session() as session:
new_saver.restore(session, "/tmp/model-0") # Only var_a gets restored!
I've annotated the quick reproduction of your issue above with the variables that the Saver knows about.
Now, the solution is relatively easy. I would suggest creating the Variable before the Saver, then using tf.assign to update its value (make sure you run the op returned by tf.assign). The assigned value will be saved in checkpoints and restored just like other variables.
This could be handled better by the Saver as a special case when None is passed to its var_list constructor argument (i.e. it could pick up new variables automatically). Feel free to open a feature request on Github for this.

Python TensorFlow: How to restart training with optimizer and import_meta_graph?

I'm trying to restart a model training in TensorFlow by picking up where it left off. I'd like to use the recently added (0.12+ I think) import_meta_graph() so as to not reconstruct the graph.
I've seen solutions for this, e.g. Tensorflow: How to save/restore a model?, but I run into issues with AdamOptimizer, specifically I get a ValueError: cannot add op with name <my weights variable name>/Adam as that name is already used error. This can be fixed by initializing, but then my model values are cleared!
There are other answers and some full examples out there, but they always seem older and so don't include the newer import_meta_graph() approach, or don't have a non-tensor optimizer. The closest question I could find is tensorflow: saving and restoring session but there is no final clear cut solution and the example is pretty complicated.
Ideally I'd like a simple run-able example starting from scratch, stopping, then picking up again. I have something that works (below), but do also wonder if I'm missing something. Surely I'm not the only one doing this?
Here is what I came up with from reading the docs, other similar solutions, and trial and error. It's a simple autoencoder on random data. If ran, then ran again, it will continue from where it left off (i.e. cost function on first run goes from ~0.5 -> 0.3 second run starts ~0.3). Unless I missed something, all of the saving, constructors, model building, add_to_collection there are needed and in a precise order, but there may be a simpler way.
And yes, loading the graph with import_meta_graph isn't really needed here since the code is right above, but is what I want in my actual application.
from __future__ import print_function
import tensorflow as tf
import os
import math
import numpy as np
output_dir = "/root/Data/temp"
model_checkpoint_file_base = os.path.join(output_dir, "model.ckpt")
input_length = 10
encoded_length = 3
learning_rate = 0.001
n_epochs = 10
n_batches = 10
if not os.path.exists(model_checkpoint_file_base + ".meta"):
print("Making new")
brand_new = True
x_in = tf.placeholder(tf.float32, [None, input_length], name="x_in")
W_enc = tf.Variable(tf.random_uniform([input_length, encoded_length],
-1.0 / math.sqrt(input_length),
1.0 / math.sqrt(input_length)), name="W_enc")
b_enc = tf.Variable(tf.zeros(encoded_length), name="b_enc")
encoded = tf.nn.tanh(tf.matmul(x_in, W_enc) + b_enc, name="encoded")
W_dec = tf.transpose(W_enc, name="W_dec")
b_dec = tf.Variable(tf.zeros(input_length), name="b_dec")
decoded = tf.nn.tanh(tf.matmul(encoded, W_dec) + b_dec, name="decoded")
cost = tf.sqrt(tf.reduce_mean(tf.square(decoded - x_in)), name="cost")
saver = tf.train.Saver()
else:
print("Reloading existing")
brand_new = False
saver = tf.train.import_meta_graph(model_checkpoint_file_base + ".meta")
g = tf.get_default_graph()
x_in = g.get_tensor_by_name("x_in:0")
cost = g.get_tensor_by_name("cost:0")
sess = tf.Session()
if brand_new:
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
sess.run(init)
tf.add_to_collection("optimizer", optimizer)
else:
saver.restore(sess, model_checkpoint_file_base)
optimizer = tf.get_collection("optimizer")[0]
for epoch_i in range(n_epochs):
for batch in range(n_batches):
batch = np.random.rand(50, input_length)
_, curr_cost = sess.run([optimizer, cost], feed_dict={x_in: batch})
print("batch_cost:", curr_cost)
save_path = tf.train.Saver().save(sess, model_checkpoint_file_base)
There might be a problem when you are creating the saver object at the restoring session.
I obtained the same error as yours when using codes below in the restoring session.
saver = tf.train.import_meta_graph('tmp/hsmodel.meta')
saver.restore(sess, tf.train.latest_checkpoint('tmp/'))
But when I changed in this way,
saver = tf.train.Saver()
saver.restore(sess, "tmp/hsmodel")
The error has gone away.
The "tmp/hsmodel" is the path that I give to the saver.save(sess,"tmp/hsmodel") in the saving session.
An simple examples on storing and restoring session of training MNIST network(containing Adam optimizer) is in here. This was helpful to me to compare with my code and fix the problem.
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/4_Utils/save_restore_model.py
I had the same issue and I just figured out what was wrong, at least in my code.
In the end, I used the wrong file name in saver.restore(). This function must be given the file name without the file extension, just like the saver.save() function:
saver.restore(sess, 'model-1')
instead of
saver.restore(sess, 'model-1.data-00000-of-00001')
With this I do exactly what you wish to do: starting from scratch, stopping, then picking up again. I don't need to initialize a second saver from a meta file using the tf.train.import_meta_graph() function, and I don't need to explicitly state tf.initialize_all_variables() after initializing the optimizer.
My complete model restore looks like this:
with tf.Session() as sess:
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
saver.restore(sess, 'model-1')
I think in protocol V1 you still had to add the .ckpt to the file name, and for import_meta_graph() you still need to add the .meta, which might cause some confusion among users. Maybe this should be pointed out more explicitly in the documentation.
The saver class allows us to save a session via:
saver.save(sess, "checkpoints.ckpt")
And allows us to restore the session:
saver.restore(sess, tf.train.latest_checkpoint("checkpoints.ckpt"))

Tensorflow freeze_graph script failing on model defined with Keras

I'm attempting to export a model built and trained with Keras to a protobuffer that I can load in a C++ script (as in this example). I've generated a .pb file containing the model definition and a .ckpt file containing the checkpoint data. However, when I try to merge them into a single file using the freeze_graph script I get the error:
ValueError: Fetch argument 'save/restore_all' of 'save/restore_all' cannot be interpreted as a Tensor. ("The name 'save/restore_all' refers to an Operation not in the graph.")
I'm saving the model like this:
with tf.Session() as sess:
model = nndetector.architecture.models.vgg19((3, 50, 50))
model.load_weights('/srv/nn/weights/scratch-vgg19.h5')
init_op = tf.initialize_all_variables()
sess.run(init_op)
graph_def = sess.graph.as_graph_def()
tf.train.write_graph(graph_def=graph_def, logdir='.', name='model.pb', as_text=False)
saver = tf.train.Saver()
saver.save(sess, 'model.ckpt')
nndetector.architecture.models.vgg19((3, 50, 50)) is simply a vgg19-like model defined in Keras.
I'm calling the freeze_graph script like this:
bazel-bin/tensorflow/python/tools/freeze_graph --input_graph=[path-to-model.pb] --input_checkpoint=[path-to-model.ckpt] --output_graph=[output-path] --output_node_names=sigmoid --input_binary=True
If I run the freeze_graph_test script everything works fine.
Does anyone know what I'm doing wrong?
Thanks.
Best regards
Philip
EDIT
I've tried printing tf.train.Saver().as_saver_def().restore_op_name which returns save/restore_all.
Additionally, I've tried a simple pure tensorflow example and still get the same error:
a = tf.Variable(tf.constant(1), name='a')
b = tf.Variable(tf.constant(2), name='b')
add = tf.add(a, b, 'sum')
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
tf.train.write_graph(graph_def=sess.graph.as_graph_def(), logdir='.', name='simple_as_binary.pb', as_text=False)
tf.train.Saver().save(sess, 'simple.ckpt')
And I'm actually also unable to restore the graph in python. Using the following code throws ValueError: No variables to save if I execute it separately from saving the graph (that is, if I both save and restore the model in the same script, everything works fine).
with gfile.FastGFile('simple_as_binary.pb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
with tf.Session() as sess:
tf.import_graph_def(graph_def)
saver = tf.train.Saver()
saver.restore(sess, 'simple.ckpt')
I'm not sure if the two problems are related, or if I'm simply not restoring the model correctly in python.
The problem is the order of these two lines in your original program:
tf.train.write_graph(graph_def=sess.graph.as_graph_def(), logdir='.', name='simple_as_binary.pb', as_text=False)
tf.train.Saver().save(sess, 'simple.ckpt')
Calling tf.train.Saver() adds a set of nodes to the graph, including one called "save/restore_all". However, this program calls it after writing out the graph, so the file you pass to freeze_graph.py doesn't contain those nodes, which are necessary for doing the rewriting.
Reversing the two lines should make the script work as intended:
tf.train.Saver().save(sess, 'simple.ckpt')
tf.train.write_graph(graph_def=sess.graph.as_graph_def(), logdir='.', name='simple_as_binary.pb', as_text=False)
So, I got it working. Sort of.
By using tensorflow.python.client.graph_util.convert_variables_to_constants directly instead of first saving GraphDef and a checkpoint to disk and then using the freeze_graph tool/script, I have been able to save a GraphDef containing both graph definition and variables converted to constants.
EDIT
mrry updated his answer, which solved my issue of freeze_graph not working, but I'll leave this answer as well, in case anyone else could find it useful.

How to get the global_step when restoring checkpoints in Tensorflow?

I'm saving my session state like so:
self._saver = tf.saver()
self._saver.save(self._session, '/network', global_step=self._time)
When I later restore I want to get the value of the global_step for the checkpoint I restore from. This is in order to set some hyper parameters from it.
The hacky way to do this would be to run through and parse the file names in the checkpoint directory. But surly there has to be a better, built in way to do this?
General pattern is to have a global_step variable to keep track of steps
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
Then you can save with
saver.save(sess, save_path, global_step=global_step)
When you restore, the value of global_step is restored as well
This is a bit of a hack, but the other answers did not work for me at all
ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
#Extract from checkpoint filename
step = int(os.path.basename(ckpt.model_checkpoint_path).split('-')[1])
Update 9/2017
I'm not sure if this started working due to updates, but the following method seems to be effective in getting global_step to update and load properly:
Create two ops. One to hold global_step and another to increment it:
global_step = tf.Variable(0, trainable=False, name='global_step')
increment_global_step = tf.assign_add(global_step,1,
name = 'increment_global_step')
Now in your training loop run the increment op every time you run your training op.
sess.run([train_op,increment_global_step],feed_dict=feed_dict)
If you ever want to retrieve you global step value as an integer at any point, just use the following command after loading the model:
sess.run(global_step)
This can be useful for creating filenames or calculating what your current epoch is without having a second tensorflow Variable for holding that value. For instance, calculating the current epoch on loading would be something like:
loaded_epoch = sess.run(global_step)//(batch_size*num_train_records)
I had the same issue as Lawrence Du, I could not find a way to get the global_step by restoring the model. So I applied his hack to the inception v3 training code in the Tensorflow/models github repo I'm using. The code below also contains a fix related to the pretrained_model_checkpoint_path.
If you have a better solution, or know what I'm missing please leave a comment!
In any case, this code works for me:
...
# When not restoring start at 0
last_step = 0
if FLAGS.pretrained_model_checkpoint_path:
# A model consists of three files, use the base name of the model in
# the checkpoint path. E.g. my-model-path/model.ckpt-291500
#
# Because we need to give the base name you can't assert (will always fail)
# assert tf.gfile.Exists(FLAGS.pretrained_model_checkpoint_path)
variables_to_restore = tf.get_collection(
slim.variables.VARIABLES_TO_RESTORE)
restorer = tf.train.Saver(variables_to_restore)
restorer.restore(sess, FLAGS.pretrained_model_checkpoint_path)
print('%s: Pre-trained model restored from %s' %
(datetime.now(), FLAGS.pretrained_model_checkpoint_path))
# HACK : global step is not restored for some unknown reason
last_step = int(os.path.basename(FLAGS.pretrained_model_checkpoint_path).split('-')[1])
# assign to global step
sess.run(global_step.assign(last_step))
...
for step in range(last_step + 1, FLAGS.max_steps):
...
You can use the global_step variable to keep track of steps, but if in your code, you are initializing or assigning this value to another step variable, it may not be consistent.
For instance, you define your global_step using:
global_step = tf.Variable(0, name='global_step', trainable=False)
Assign to your training operation:
train_op = optimizer.minimize(loss, global_step=global_step)
Save in your checkpoint:
saver.save(sess, checkpoint_path, global_step=global_step)
And restore from your checkpoint:
saver.restore(sess, checkpoint_path)
the value of global_step is restored as well but if you are assigning it to another variable, say step, then you must do something like:
step = global_step.eval(session=sess)
The variable step, contains the last saved global_step in the checkpoint.
It will be nice to also define the global_step from graph than as zero variable (as earlier defined):
global_step = tf.train.get_or_create_global_step()
This will get your last global_step if exist or create one if not.
TL;DR
As tensorflow variable (will be evaluated in the session)
global_step = tf.train.get_or_create_global_step()
# use global_step variable to calculate your hyperparameter
# this variable will be evaluated later in the session
saver = tf.train.Saver()
with tf.Session() as sess:
# restore all variables from checkpoint
saver.restore(sess, checkpoint_path)
# than init table and local variables and start training/evaluation ...
Or: As numpy integer (without any session):
reader = tf.train.NewCheckpointReader(absolute_checkpoint_path)
global_step = reader.get_tensor('global_step')
Long Answer
There are at least two ways retrieving the global from a checkpoint. As tensorflow variable or as numpy integer. Parsing the filename will not work, if the global_step was not provided as a parameter in the save method of the Saver. For pretrained models see the remark at the end of the answer.
As Tensorflow variable
If you need the global_step variable to calculate some hyperparameters you can just use tf.train.get_or_create_global_step(). This will return a tensorflow variable. Because the variable will be evaluated later in the session you can only use tensorflow operations to calculate your hyperparameters. So e.g.: max(global_step, 100) will not work. You have to use tensorflow equivalent tf.maximum(global_step, 100) that can be evaluated later in the session.
Within the session you can initialize the global step variable with a checkpoint using saver.restore(sess, checkpoint_path)
global_step = tf.train.get_or_create_global_step()
# use global_step variable to calculate your hyperparameter
# this variable will be evaluated later in the session
hyper_parameter = tf.maximum(global_step, 100)
saver = tf.train.Saver()
with tf.Session() as sess:
# restore all variables from checkpoint
saver.restore(sess, checkpoint_path)
# than init table and local variables and start training/evaluation ...
# for verification you can print the global step and your hyper parameter
print(sess.run([global_step, hyper_parameter]))
Or: As numpy integer (without session)
If you need the global step variable as scalar without starting a session you can also read this variable directly from your checkpoint file(s). You just need a NewCheckpointReader. Because of a bug in older tensorflow versions you should convert the path of the checkpoint file to an absolute path. With the reader you can get all the tensors of the model as numpy variables.
The name of the global step variable is a constant string tf.GraphKeys.GLOBAL_STEP defined as 'global_step'.
absolute_checkpoint_path = os.path.abspath(checkpoint_path)
reader = tf.train.NewCheckpointReader(absolute_checkpoint_path)
global_step = reader.get_tensor(tf.GraphKeys.GLOBAL_STEP)
Remark to pretrained models: In most pretrained models that are available online the global step is reset to zero. So, these models can be used to initialize the model parameters for finetuning without overwrite the global step.
The current 0.10rc0 version seems to be different, there's no tf.saver() any more. Now it's tf.train.Saver(). Also, the save command adds info onto save_path filename for the global_step, so we can't just call restore on the same save_path since that not the actual save file.
The easiest way I see right now is to use the SessionManager along with a saver like this:
my_checkpoint_dir = "/tmp/checkpoint_dir"
# make a saver to use with SessionManager for restoring
saver = tf.train.Saver()
# Build an initialization operation to run below.
init = tf.initialize_all_variables()
# use a SessionManager to help with automatic variable restoration
sm = tf.train.SessionManager()
# try to find the latest checkpoint in my_checkpoint_dir, then create a session with that restored
# if no such checkpoint, then call the init_op after creating a new session
sess = sm.prepare_session("", init_op=init, saver=saver, checkpoint_dir=my_checkpoint_dir))
That's it. Now you have a session that's either restored from the my_checkpoint_dir (make sure that directory exists before calling this), or if there's no checkpoint there then it creates a new session and calls the init_op to initialize your variables.
When you want to save, you just save to any name you want in that directory and pass the global_step in. Here's an example where I save the step variable in a loop as the global_step, so it comes back to that point if you kill the program and restart it so it restores the checkpoint:
checkpoint_path = os.path.join(my_checkpoint_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step=step)
This creates files in my_checkpoint_dir like "model.ckpt-1000" where 1000 is the global_step passed in. If it keeps running, then you get more like "model.ckpt-2000". The SessionManager above picks up the latest one of these when the program is restarted. The checkpoint_path can be whatever file name you want, as long as it's in the checkpoint_dir. The save() will create that file with the global_step appended (as shown above). It also creates a "checkpoint" index file, which is how the SessionManager then finds the latest save checkpoint.
just note my solution on global step saving and restore.
Save:
global_step = tf.Variable(0, trainable=False, name='global_step')
saver.save(sess, model_path + model_name, global_step=_global_step)
Restore:
if os.path.exists(model_path):
saver.restore(sess, tf.train.latest_checkpoint(model_path))
print("Model restore finished, current globle step: %d" % global_step.eval())
The reason that a variable is not restored as expected is most likely due to the fact that it was created after your tf.Saver() object was created.
The place where you create the tf.Saver() object matters when you don't explicitly specify a var_list, or specify None for var_list. The expected behavior for many programmers is that all variables in the graph are saved when the save() method is called, but this is not the case, and it should perhaps be documented as such. A snapshot of all variables in the graph is saved at the time of object creation.
Unless you're having any performance issues, it's safest to create the saver object right when you decide to save your progress. Otherwise, make sure to create the saver object after you create all your variables.
Also, the global_step that is passed to saver.save(sess, save_path, global_step=global_step) is merely a counter used for creating the filename and has nothing to do with whether it will be restored as a global_step variable. This is a parameter misnomer IMO since if you're saving your progress at the end of each epoch, it's probably best to pass your epoch number for this parameter.