How to restore my loss from a saved meta graph? - tensorflow

I have built a simple tensorflow model that is working fine.
While training I save the meta_graph and also some parameters at different steps.
After that (in a new script) I want to restore the saved meta_graph and restore variables and operations.
Everything works fine, but only the
with tf.name_scope('MSE'):
error = tf.losses.mean_squared_error(Y, yhat, scope="error")
is not going to be restored. With the following line
mse_error = graph.get_tensor_by_name("MSE/error:0")
"The name 'MSE/error:0' refers to a Tensor which does not exist. The
operation, 'MSE/error', does not exist in the graph."
there appears this error message.
As I do exactly the same procedure for other variables and ops that are restored without any error, I don't know how to deal with that. Only difference is that there is only a scope attribute and not a name attribute in the tf.losses.mean_squared_error function.
So how do I restore the loss operation with the scope?
Here the code how I save and load the model.
Saving:
# define network ...
saver = tf.train.Saver(max_to_keep=10)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(NUM_EPOCHS):
# do training ..., save model all 1000 optimization steps
if (i + 1) % 1000 == 0:
saver.save(sess, "L:/model/mlp_model", global_step=(i+1))
Restore:
# start a session
sess=tf.Session()
# load meta graph
saver = tf.train.import_meta_graph('L:\\model\\mlp_model-1000.meta')
# restore weights
saver.restore(sess, tf.train.latest_checkpoint('L:\\model\\'))
# access network nodes
graph = tf.get_default_graph()
X = graph.get_tensor_by_name("Input/X:0")
Y = graph.get_tensor_by_name("Input/Y:0")
# restore output-generating operation used for prediction
yhat_op = graph.get_tensor_by_name("OutputLayer/yhat:0")
mse_error = graph.get_tensor_by_name("MSE/error:0") # this one doesn't work

To get your training step back, the documentation suggests you add it to a collection before saving it as a way to be able to point at it to after restoring your graph.
Saving:
saver = tf.train.Saver(max_to_keep=10)
# put op in collection
tf.add_to_collection('train_op', train_op)
...
Restore:
saver = tf.train.import_meta_graph('L:\\model\\mlp_model-1000.meta')
saver.restore(sess, tf.train.latest_checkpoint('L:\\model\\'))
# recover op through collection
train_op = tf.get_collection('train_op')[0]
Why did your attempt at recovering the tensor by name fail?
You can indeed get the tensor by its name -- the catch is that you need the correct name. And notice that your error argument to tf.losses.mean_squared_error is a scope name, not the name of the returned operation. This can be confusing, as other operations, such as tf.nn.l2_loss, accept a name argument.
In the end, the name of your error operation is MSE/error/value:0, which you can use to get it by name.
That is, until it breaks again in the future when you update tensorflow. tf.losses.mean_squared_error does not give you any guarantee on the name of its output, so it very well may change for some reason.
I think this is what motivates the use of collections: the lack of guarantee on the names of the operators you don't control yourself.
Alternatively, if for some reason you really want to use names, you could rename your operator like this:
with tf.name_scope('MSE'):
error = tf.losses.mean_squared_error(Y, yhat, scope='error')
# let me stick my own name on it
error = tf.identity(error, 'my_error')
Then you can rely on graph.get_tensor_by_name('MSE/my_error:0') safely.

tf.losses.mean_squared_error is an operation not a Tensor, you should load it with
get_operation_by_name:
mse_error = graph.get_operation_by_name("MSE/error")
that should work, note that there is no need for ":0"

Related

Tensorflow save/restore batch norm

I trained a model with batch norm in Tensorflow. I would like to save the model and restore it for further using. The batch norm is done by
def batch_norm(input, phase):
return tf.layers.batch_normalization(input, training=phase)
where the phase is True during training and False during testing.
It seems like simply calling
saver = tf.train.Saver()
saver.save(sess, savedir + "ckpt")
would not work well because when I restore the model it first says restored successfully. It also says Attempting to use uninitialized value batch_normalization_585/beta if I just run one node in the graph. Is this related to not saving the model properly or something else that I've missed?
I also had the "Attempting to use uninitialized value batch_normalization_585/beta" error. This comes from the fact that by declaring the saver with the empty brackets like this:
saver = tf.train.Saver()
The saver will save the variables contained in tf.trainable_variables() which do not contain the moving average of the batch normalization. To include this variables into the saved ckpt you need to do:
saver = tf.train.Saver(tf.global_variables())
Which saves ALL the variables, so it is very memory consuming. Or you must identify the variables that have moving avg or variance and save them by declaring them like:
saver = tf.train.Saver(tf.trainable_variables() + list_of_extra_variables)
Not sure if this needs to be explained, but just in case (and for other potential viewers).
Whenever you create an operation in TensorFlow, a new node is added to the graph. No two nodes in a graph can have the same name. You can define the name of any node you create, but if you don't give a name, TensorFlow will pick one for you in a deterministic way (that is, not randomly, but instead always with the same sequence). If you add two numbers, it will probably be Add, but if you do another addition, since no two nodes can have the same name, it may be something like Add_2. Once a node is created in a graph its name cannot be changed. Many functions create several subnodes in turn; for example, tf.layers.batch_normalization creates some internal variables beta and gamma.
Saving and restoring works in the following way:
You create a graph representing the model that you want. This graph contains the variables that will be saved by the saver.
You initialize, train or do whatever you want with that graph, and the variables in the model get assigned some values.
You call save on the saver to, well, save the values of the variables to a file.
Now you recreate the model in a different graph (it can be a different Python session altogether or just another graph coexisting with the first one). The model must be created in exactly the same way the first one was.
You call restore on the saver to retrieve the values of the variables.
In order for this to work, the names of the variables in the first and the second graph must be exactly the same.
In your example, TensorFlow is complaining about the variable batch_normalization_585/beta. It seems that you have called tf.layers.batch_normalization nearly 600 times in the same graph, so you have that many beta variables hanging around. I doubt that you actually need that many, so I guess you are just experimenting with the API and ended up with that many copies.
Here's a draft of something that should work:
import tensorflow as tf
def make_model():
input = tf.placeholder(...)
phase = tf.placeholder(...)
input_norm = tf.layers.batch_normalization(input, training=phase))
# Do some operations with input_norm
output = ...
saver = tf.train.Saver()
return input, output, phase, saver
# We work with one graph first
g1 = tf.Graph()
with g1.as_default():
input, output, phase, saver = make_model()
with tf.Session() as sess:
# Do your training or whatever...
saver.save(sess, savedir + "ckpt")
# We work with a second different graph now
g2 = tf.Graph()
with g2.as_default():
input, output, phase, saver = make_model()
with tf.Session() as sess:
saver.restore(sess, savedir + "ckpt")
# Continue using your model...
Again, the typical case is not to have two graphs side by side, but rather have one graph and then recreate it in another Python session later, but in the end both things are the same. The important part is that the model is created in the same way (and therefore with the same node names) in both cases.

How to get the output of a maxpool layer in a pre-trained model in TensorFlow?

I have a model that I trained. I wish to extract from the model the output of an intermediate maxpool layer.
I tried the following
saver = tf.train.import_meta_graph(BASE_DIR + LOG_DIR + '/model.ckpt.meta')
saver.restore(sess,tf.train.latest_checkpoint(BASE_DIR + LOG_DIR))
sess.run("maxpool/maxpool",feed_dict=feed_dict)
here, feed_dict contains the placeholders and their contents for this run in a dictionary.
I keep getting the following error
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder_1_1' with dtype float and shape...
what can be the cause of this? I generated all of the placeholders and input them in the feed dictionary.
I ran in to a similar issue and it was frustrating. What got me around it was filling out the name field for every variable and operation that I wanted to call later. You also may need to add your maxpool/maxpool op to a collection with tf.add_to_collection('name_for_maxpool_op', maxpool_op_handle). You can then restore the ops and named tensors with:
# Restore from metagraph.
saver = tf.train.import_meta_graph(...)
sess = tf.Session()
saver = restore(sess, ...)
graph = sess.graph
# Restore your ops and tensors.
maxpool_op = tf.get_collection('name_for_maxpool_op')[0] # returns a list, you want the first element
a_tensor = graph.get_tensor_by_name('tensor_name:0') # need the :0 added to your name
Then you would build your feed_dict using your restored tensors. More information can be found here. Also, as you mentioned in your comment, you need to pass the op itself to sess.run, not it's name:
sess.run(maxpool_op, feed_dict=feed_dict)
You can access your tensors and ops from a restored metagraph even if you did not name them (to avoid retraining the model with new fancy tensor names, for instance), but it can be a bit of a pain. The names given to the tensors automatically are not always the most transparent. You can list the names of all variables in your graph with:
print([v.name for v in tf.all_variables()])
You can hopefully find the name that you are looking for there and then restore that tensor using graph.get_tensor_by_name as described above.

TF LSTM: Save State from training session for prediction session later

I am trying to save the latest LSTM State from training to be reused during the prediction stage later. The problem I am encountering is that in the TF LSTM model the State is passed around from one training iteration to next via a combination of a placeholder and a numpy array -- neither of which seems to be included in the Graph by default when the session is saved.
To work around this, I am creating a dedicated TF variable to hold the latest version of the state so as to add it to the Session graph, like so:
# latest State from last training iteration:
_, y, ostate, smm = sess.run([train_step, Y, H, summaries], feed_dict=feed_dict)
# now add to TF variable:
savedState = tf.Variable(ostate, dtype=tf.float32, name='savedState')
tf.variables_initializer([savedState]).run()
save_path = saver.save(sess, pathModel + '/my_model.ckpt')
This seems to add the savedState variable to the saved session graph well, and is easily recoverable later with the rest of the Session.
The problem though, is that the only way I have managed to actually use that variable later in the restored Session, is that if I initialize all variables in the session AFTER I recover it (which seems to reset all trained variables, including the weights/biases/etc.!). If I initialize variables first and THEN recover the session (which works fine in terms of preserving the trained varialbes), then I am getting an error that I'm trying to access an uninitialized variable.
I know there is a way to initialize a specific individual varialbe (which i am using while saving it originally) but the problem is that when we recover them, we refer to them by name as strings, we don't just pass the variable itself?!
# This produces an error 'trying to use an uninitialized varialbe
gInit = tf.global_variables_initializer().run()
new_saver = tf.train.import_meta_graph(pathModel + 'my_model.ckpt.meta')
new_saver.restore(sess, pathModel + 'my_model.ckpt')
fullState = sess.run('savedState:0')
What is the right way to get this done? As a workaround, I am currently saving the State to CSV just as a numpy array and then recover it the same way. It works OK, but clearly not the cleanest solution given that every other aspect of saving/restoring the TF session works perfectly.
Any suggestions appreciated!
**EDIT:
Here's the code that works well, as described in the accepted answer below:
# make sure to define the State variable before the Saver variable:
savedState = tf.get_variable('savedState', shape=[BATCHSIZE, CELL_SIZE * LAYERS])
saver = tf.train.Saver(max_to_keep=1)
# last training iteration:
_, y, ostate, smm = sess.run([train_step, Y, H, summaries], feed_dict=feed_dict)
# now save the State and the whole model:
assignOp = tf.assign(savedState, ostate)
sess.run(assignOp)
save_path = saver.save(sess, pathModel + '/my_model.ckpt')
# later on, in some other program, recover the model and the State:
# make sure to initialize all variables BEFORE recovering the model!
gInit = tf.global_variables_initializer().run()
local_saver = tf.train.import_meta_graph(pathModel + 'my_model.ckpt.meta')
local_saver.restore(sess, pathModel + 'my_model.ckpt')
# recover the state from training and get its last dimension
fullState = sess.run('savedState:0')
h = fullState[-1]
h = np.reshape(h, [1, -1])
I haven't tested yet whether this approach unintentionally initializes any other variables in the saved Session, but don't see why it should, since we only run the specific one.
The issue is that creating a new tf.Variable after the Saver was constructed means that the Saver has no knowledge of the new variable. It still gets saved in the metagraph, but not saved in the checkpoint:
import tensorflow as tf
with tf.Graph().as_default():
var_a = tf.get_variable("a", shape=[])
saver = tf.train.Saver()
var_b = tf.get_variable("b", shape=[])
print(saver._var_list) # [<tf.Variable 'a:0' shape=() dtype=float32_ref>]
initializer = tf.global_variables_initializer()
with tf.Session() as session:
session.run([initializer])
saver.save(session, "/tmp/model", global_step=0)
with tf.Graph().as_default():
new_saver = tf.train.import_meta_graph("/tmp/model-0.meta")
print(saver._var_list) # [<tf.Variable 'a:0' shape=() dtype=float32_ref>]
with tf.Session() as session:
new_saver.restore(session, "/tmp/model-0") # Only var_a gets restored!
I've annotated the quick reproduction of your issue above with the variables that the Saver knows about.
Now, the solution is relatively easy. I would suggest creating the Variable before the Saver, then using tf.assign to update its value (make sure you run the op returned by tf.assign). The assigned value will be saved in checkpoints and restored just like other variables.
This could be handled better by the Saver as a special case when None is passed to its var_list constructor argument (i.e. it could pick up new variables automatically). Feel free to open a feature request on Github for this.

Save and load Tensorflow model

I want to save a Tensorflow (0.12.0) model, including graph and variable values, then later load and execute it. I have the read the docs and other posts on this but cannot get the basics to work. I am using the technique from this page in the Tensorflow docs. Code:
Save a simple model:
myVar = tf.Variable(7.1)
tf.add_to_collection('modelVariables', myVar) # why?
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print sess.run(myVar)
saver0 = tf.train.Saver()
saver0.save(sess, './myModel.ckpt')
saver0.export_meta_graph('./myModel.meta')
Later, load and execute the model:
with tf.Session() as sess:
saver1 = tf.train.import_meta_graph('./myModel.meta')
saver1.restore(sess, './myModel.meta')
print sess.run(myVar)
Question 1: The saving code seems to work but the loading code produces this error:
W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open ./myModel.meta: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
How to fix this?.
Question 2: I included this line to follow the pattern in the TF docs...
tf.add_to_collection('modelVariables', myVar)
... but why is that line necessary? Doesn't expert_meta_graphexport the entire graph by default? If not then does one need to add every variable in the graph to the collection before saving? Or do we just add to the collection those variables that will be accessed after the restore?
---------------------- Update January 12 2017 -----------------------------
Partial success based on Kashyap's suggestion below but a mystery still exists. The code below works but only if I include the lines containing tf.add_to_collection and tf.get_collection. Without those lines, 'load' mode throws an error in the last line:
NameError: name 'myVar' is not defined. My understanding was that by default Saver.save saves and restores all variables in the graph, so why is it necessary to specify the name of variables that will be used in the collection? I assume this has to do with mapping Tensorflow's variable names to Python names, but what are the rules of the game here? For which variables does this need to be done?
mode = 'load' # or 'save'
if mode == 'save':
myVar = tf.Variable(7.1)
init_op = tf.global_variables_initializer()
saver0 = tf.train.Saver()
tf.add_to_collection('myVar', myVar) ### WHY NECESSARY?
with tf.Session() as sess:
sess.run(init_op)
print sess.run(myVar)
saver0.save(sess, './myModel')
if mode == 'load':
with tf.Session() as sess:
saver1 = tf.train.import_meta_graph('./myModel.meta')
saver1.restore(sess, tf.train.latest_checkpoint('./'))
myVar = tf.get_collection('myVar')[0] ### WHY NECESSARY?
print sess.run(myVar)
Question1
This question has been already answered thoroughly here. You don't have to explicitly call export_meta_graph. Call the save method. This will generate the .meta file also (since save method will call the export_meta_graph method internally.)
For example
saver0.save(sess, './myModel.ckpt')
will produce myModel.ckpt file and also the myModel.ckpt.meta file.
Then you can restore the model using
with tf.Session() as sess:
saver1 = tf.train.import_meta_graph('./myModel.ckpt.meta')
saver1.restore(sess, './myModel')
print sess.run(myVar)
Question2
Collections are used to store custom information like learning rate,the regularisation factor that you have used and other information and these will be stored when you export the graph. Tensorflow itself defines some collections like "TRAINABLE_VARIABLES" which are used to get all the trainable variables of the model you built. You can chose to export all the collections in your graph or you can specify which collections to export in the export_meta_graph function.
Yes tensorflow will export all the variables that you define. But if you need any other information that needs to be exported to the graph then they can be added to the collection.
I've been trying to figure out the same thing and was able to successfully do it by using Supervisor. It automatically loads all variables and your graph etc. Here is the documentation - https://www.tensorflow.org/programmers_guide/supervisor. Below is my code -
sv = tf.train.Supervisor(logdir="/checkpoint', save_model_secs=60)
with sv.managed_session() as sess:
if not sv.should_stop():
#Do run/eval/train ops on sess as needed. Above works for both saving and loading
As you see, this is much simpler than using the Saver object and dealing with individual variables etc as long as the graph stays the same (my understanding is that Saver comes handy when we want to reuse a pre-trained model for a different graph).

How to get the global_step when restoring checkpoints in Tensorflow?

I'm saving my session state like so:
self._saver = tf.saver()
self._saver.save(self._session, '/network', global_step=self._time)
When I later restore I want to get the value of the global_step for the checkpoint I restore from. This is in order to set some hyper parameters from it.
The hacky way to do this would be to run through and parse the file names in the checkpoint directory. But surly there has to be a better, built in way to do this?
General pattern is to have a global_step variable to keep track of steps
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)
Then you can save with
saver.save(sess, save_path, global_step=global_step)
When you restore, the value of global_step is restored as well
This is a bit of a hack, but the other answers did not work for me at all
ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
#Extract from checkpoint filename
step = int(os.path.basename(ckpt.model_checkpoint_path).split('-')[1])
Update 9/2017
I'm not sure if this started working due to updates, but the following method seems to be effective in getting global_step to update and load properly:
Create two ops. One to hold global_step and another to increment it:
global_step = tf.Variable(0, trainable=False, name='global_step')
increment_global_step = tf.assign_add(global_step,1,
name = 'increment_global_step')
Now in your training loop run the increment op every time you run your training op.
sess.run([train_op,increment_global_step],feed_dict=feed_dict)
If you ever want to retrieve you global step value as an integer at any point, just use the following command after loading the model:
sess.run(global_step)
This can be useful for creating filenames or calculating what your current epoch is without having a second tensorflow Variable for holding that value. For instance, calculating the current epoch on loading would be something like:
loaded_epoch = sess.run(global_step)//(batch_size*num_train_records)
I had the same issue as Lawrence Du, I could not find a way to get the global_step by restoring the model. So I applied his hack to the inception v3 training code in the Tensorflow/models github repo I'm using. The code below also contains a fix related to the pretrained_model_checkpoint_path.
If you have a better solution, or know what I'm missing please leave a comment!
In any case, this code works for me:
...
# When not restoring start at 0
last_step = 0
if FLAGS.pretrained_model_checkpoint_path:
# A model consists of three files, use the base name of the model in
# the checkpoint path. E.g. my-model-path/model.ckpt-291500
#
# Because we need to give the base name you can't assert (will always fail)
# assert tf.gfile.Exists(FLAGS.pretrained_model_checkpoint_path)
variables_to_restore = tf.get_collection(
slim.variables.VARIABLES_TO_RESTORE)
restorer = tf.train.Saver(variables_to_restore)
restorer.restore(sess, FLAGS.pretrained_model_checkpoint_path)
print('%s: Pre-trained model restored from %s' %
(datetime.now(), FLAGS.pretrained_model_checkpoint_path))
# HACK : global step is not restored for some unknown reason
last_step = int(os.path.basename(FLAGS.pretrained_model_checkpoint_path).split('-')[1])
# assign to global step
sess.run(global_step.assign(last_step))
...
for step in range(last_step + 1, FLAGS.max_steps):
...
You can use the global_step variable to keep track of steps, but if in your code, you are initializing or assigning this value to another step variable, it may not be consistent.
For instance, you define your global_step using:
global_step = tf.Variable(0, name='global_step', trainable=False)
Assign to your training operation:
train_op = optimizer.minimize(loss, global_step=global_step)
Save in your checkpoint:
saver.save(sess, checkpoint_path, global_step=global_step)
And restore from your checkpoint:
saver.restore(sess, checkpoint_path)
the value of global_step is restored as well but if you are assigning it to another variable, say step, then you must do something like:
step = global_step.eval(session=sess)
The variable step, contains the last saved global_step in the checkpoint.
It will be nice to also define the global_step from graph than as zero variable (as earlier defined):
global_step = tf.train.get_or_create_global_step()
This will get your last global_step if exist or create one if not.
TL;DR
As tensorflow variable (will be evaluated in the session)
global_step = tf.train.get_or_create_global_step()
# use global_step variable to calculate your hyperparameter
# this variable will be evaluated later in the session
saver = tf.train.Saver()
with tf.Session() as sess:
# restore all variables from checkpoint
saver.restore(sess, checkpoint_path)
# than init table and local variables and start training/evaluation ...
Or: As numpy integer (without any session):
reader = tf.train.NewCheckpointReader(absolute_checkpoint_path)
global_step = reader.get_tensor('global_step')
Long Answer
There are at least two ways retrieving the global from a checkpoint. As tensorflow variable or as numpy integer. Parsing the filename will not work, if the global_step was not provided as a parameter in the save method of the Saver. For pretrained models see the remark at the end of the answer.
As Tensorflow variable
If you need the global_step variable to calculate some hyperparameters you can just use tf.train.get_or_create_global_step(). This will return a tensorflow variable. Because the variable will be evaluated later in the session you can only use tensorflow operations to calculate your hyperparameters. So e.g.: max(global_step, 100) will not work. You have to use tensorflow equivalent tf.maximum(global_step, 100) that can be evaluated later in the session.
Within the session you can initialize the global step variable with a checkpoint using saver.restore(sess, checkpoint_path)
global_step = tf.train.get_or_create_global_step()
# use global_step variable to calculate your hyperparameter
# this variable will be evaluated later in the session
hyper_parameter = tf.maximum(global_step, 100)
saver = tf.train.Saver()
with tf.Session() as sess:
# restore all variables from checkpoint
saver.restore(sess, checkpoint_path)
# than init table and local variables and start training/evaluation ...
# for verification you can print the global step and your hyper parameter
print(sess.run([global_step, hyper_parameter]))
Or: As numpy integer (without session)
If you need the global step variable as scalar without starting a session you can also read this variable directly from your checkpoint file(s). You just need a NewCheckpointReader. Because of a bug in older tensorflow versions you should convert the path of the checkpoint file to an absolute path. With the reader you can get all the tensors of the model as numpy variables.
The name of the global step variable is a constant string tf.GraphKeys.GLOBAL_STEP defined as 'global_step'.
absolute_checkpoint_path = os.path.abspath(checkpoint_path)
reader = tf.train.NewCheckpointReader(absolute_checkpoint_path)
global_step = reader.get_tensor(tf.GraphKeys.GLOBAL_STEP)
Remark to pretrained models: In most pretrained models that are available online the global step is reset to zero. So, these models can be used to initialize the model parameters for finetuning without overwrite the global step.
The current 0.10rc0 version seems to be different, there's no tf.saver() any more. Now it's tf.train.Saver(). Also, the save command adds info onto save_path filename for the global_step, so we can't just call restore on the same save_path since that not the actual save file.
The easiest way I see right now is to use the SessionManager along with a saver like this:
my_checkpoint_dir = "/tmp/checkpoint_dir"
# make a saver to use with SessionManager for restoring
saver = tf.train.Saver()
# Build an initialization operation to run below.
init = tf.initialize_all_variables()
# use a SessionManager to help with automatic variable restoration
sm = tf.train.SessionManager()
# try to find the latest checkpoint in my_checkpoint_dir, then create a session with that restored
# if no such checkpoint, then call the init_op after creating a new session
sess = sm.prepare_session("", init_op=init, saver=saver, checkpoint_dir=my_checkpoint_dir))
That's it. Now you have a session that's either restored from the my_checkpoint_dir (make sure that directory exists before calling this), or if there's no checkpoint there then it creates a new session and calls the init_op to initialize your variables.
When you want to save, you just save to any name you want in that directory and pass the global_step in. Here's an example where I save the step variable in a loop as the global_step, so it comes back to that point if you kill the program and restart it so it restores the checkpoint:
checkpoint_path = os.path.join(my_checkpoint_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step=step)
This creates files in my_checkpoint_dir like "model.ckpt-1000" where 1000 is the global_step passed in. If it keeps running, then you get more like "model.ckpt-2000". The SessionManager above picks up the latest one of these when the program is restarted. The checkpoint_path can be whatever file name you want, as long as it's in the checkpoint_dir. The save() will create that file with the global_step appended (as shown above). It also creates a "checkpoint" index file, which is how the SessionManager then finds the latest save checkpoint.
just note my solution on global step saving and restore.
Save:
global_step = tf.Variable(0, trainable=False, name='global_step')
saver.save(sess, model_path + model_name, global_step=_global_step)
Restore:
if os.path.exists(model_path):
saver.restore(sess, tf.train.latest_checkpoint(model_path))
print("Model restore finished, current globle step: %d" % global_step.eval())
The reason that a variable is not restored as expected is most likely due to the fact that it was created after your tf.Saver() object was created.
The place where you create the tf.Saver() object matters when you don't explicitly specify a var_list, or specify None for var_list. The expected behavior for many programmers is that all variables in the graph are saved when the save() method is called, but this is not the case, and it should perhaps be documented as such. A snapshot of all variables in the graph is saved at the time of object creation.
Unless you're having any performance issues, it's safest to create the saver object right when you decide to save your progress. Otherwise, make sure to create the saver object after you create all your variables.
Also, the global_step that is passed to saver.save(sess, save_path, global_step=global_step) is merely a counter used for creating the filename and has nothing to do with whether it will be restored as a global_step variable. This is a parameter misnomer IMO since if you're saving your progress at the end of each epoch, it's probably best to pass your epoch number for this parameter.