After upgrading Tensorflow to r1.0, the restore command does not seem to work.
For example, can anyone tell me what is wrong with the following?
def foo():
v1 = tf.Variable(1., name="v1")
v2 = tf.Variable(2., name="v2")
v3 = v1 + v2
saver = tf.train.Saver()
with tf.Session() as sess:
tf.global_variables_initializer().run()
saver.save(sess, "temp")
# do something
saver.restore(sess, "temp")
From the last line, I got an error:
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for temp
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Tensorflow documentation still holds the explanation of old versions for this matter.
TensorFlow 1.0 has a bug where it doesn't recognize tf.Saver.restore() filenames that contain only a filename (and no path component). This will be fixed in the next version, but for now you should be able to use the following workaround to add a path component:
saver.restore(sess, "./temp")
Related
I stumbled across an error that I am unable to resolve. What I am trying to do is the following thing:
I want to train a (dummy) model that adds a to b on every iteration. When finished, I want to save the variables as checkpoint. The first time I run it, it shall build the model from scratch. Every time I re-run the model, it should start from the last checkpoint and do the additions again. Hereby, I load the complete graph from the .meta file. The global step variable is there to keep track of the total number of steps I have trained.
import tensorflow as tf
from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file
# List ALL tensors.
print_tensors_in_checkpoint_file(tf.train.latest_checkpoint('./'), all_tensors=True, tensor_name='')
tf.reset_default_graph()
global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32, initializer=tf.constant_initializer(0), trainable=False)
def model(a, b):
b = tf.assign_add(b, a)
return b
with tf.Session() as sess:
ckpt = tf.train.latest_checkpoint('./')
if ckpt:
saver = tf.train.import_meta_graph('./my_test_model-1.meta')
saver.restore(sess, ckpt)
else:
a = tf.Variable(3.0, name='a')
b = tf.Variable(5.0, name='b')
b = model(a, b)
### before EDIT
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
###
### after EDIT
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
###
for step in range(5):
global_step.assign_add(1).eval()
print(global_step.eval())
print(b.eval())
saver.save(sess, './my_test_model', global_step=global_step)
The script runs fine for the first time, outputting this:
1 # step
8.0 # value of b
2
11.0
3
14.0
4
17.0
5
20.0
The second time I run the program, I get this output followed by an error:
tensor_name: a
3.0
tensor_name: b
20.0
tensor_name: global_step
0
tensor_name: global_step_1
5
INFO:tensorflow:Restoring parameters from ./my_test_model-5
Traceback (most recent call last): ... FailedPreconditionError:
Attempting to use uninitialized value global_step [[Node:
AssignAdd_2 = AssignAdd[T=DT_INT32, use_locking=false,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](global_step, AssignAdd_2/value)]] ...
The first time, it's clear that it won't throw an error as I run the initializer for all variables. But I thought that restoring a model counts as some sort of initialization? I really cannot wrap my head around this concept. I also tried defining global_step after defining a and b, but this resulted in another error when loading for the first time:
ValueError: Cannot use the default session to evaluate tensor: the
tensor's graph is different from the session's graph. Pass an explicit
session to eval(session=sess).
The error refers to the the line that increments global_step (global_step.assign_add(1).eval()).
What am I doing wrong? Where should I define the variable?
I appreciate any help on this problem! Thank you for reading this far.
EDIT:
Thanks to #Diana, the precondition error vanished. Unfortunately, another error occured. Whenever running the script with loading a checkpoint, it throws a name error:
NameError: name 'global_step' is not defined.
This also happens for variable ´b´. Shouldn't be the name loaded when restoring the checkpoint? The tensors seem to have the right names and values when I check the tensors in the checkpoint file.
You should declare the saver after you ran initialize. Otherwise you don't save any value to it. As the saver does not know it.
I'm trying to generate a pb file using the method given in this tutorial,
http://cv-tricks.com/how-to/freeze-tensorflow-models/
import tensorflow as tf
saver = tf.train.import_meta_graph('/Users/pr/tensorflow/dogs-cats-model.meta', clear_devices=True)
graph = tf.get_default_graph()
input_graph_def = graph.as_graph_def()
sess = tf.Session()
saver.restore(sess, "./dogs-cats-model")
When I try to run this code I get this error -
DataLossError (see above for traceback): Unable to open table file ./dogs-cats-model: Data loss: file is too short to be an sstable: perhaps your file is in a different file format and you need to use a different restore operator?
WHen I googled this error most of them recommend to generate the meta file using version 2 format? Is that the right approach?
Tensorflow version used -
1.3.0
Apparently, you are using both '/Users/pr/tensorflow/dogs-cats-model.meta' and './dogs-cats-model.meta'. Are you sure they point to the same file?
The following code works well on my machine:
import tensorflow as tf
saver = tf.train.import_meta_graph('./dogs-cats-model.meta', clear_devices=True)
graph = tf.get_default_graph()
input_graph_def = graph.as_graph_def()
sess = tf.Session()
saver.restore(sess, "./dogs-cats-model")
With tensorflow 1.2.0, I am trying to restore a saved model but I receive the error:
DataLossError (see above for traceback): Unable to open table file checkpoints/saved_2/saved_2_model_1.meta: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[Node: save/RestoreV2_185 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_185/tensor_names, save/RestoreV2_185/shape_and_slices)]]
I am using the same tensorflow version for saving and restoring.
For saving:
saver = tf.train.Saver()
ckpt_dir = os.path.join(params['CHK_PATH'], folder)
if not os.path.exists(ckpt_dir):
os.makedirs(ckpt_dir)
ckpt_file = os.path.join(ckpt_dir, '{}'.format(name))
path = saver.save(sess, ckpt_file)
For restoring:
saver.restore(sess, ckpt_file)
I tried: model_saver = tf.train.Saver(write_version = saver_pb2.SaverDef.V1)
But the same problem remains.
saver.restore(sess,tf.train.latest_checkpoint(ckpt_dir))
works
I am trying to import a saved neural network in Tensorflow. I saved it after training with:
saver = tf.train.Saver()
saver.save(sess, filename)
and in the script I use for inference, I restore it with:
sess = tf.Session()
saver = tf.train.import_meta_graph(filename.meta)
saver.restore(sess, tf.train.latest_checkpoint('./'))
But during the import_meta_graph line, I get this error:
KeyError: "The name 'dropout1/cond/dropout/Shape/Switch:1' refers to a Tensor which does not exist. The operation, 'dropout1/cond/dropout/Shape/Switch', does not exist in the graph."
I looked at the names of the tensors and operations in the original notebook in which I trained the model, and the names mentionned in the error message do exist. Moreover, I used the same code for saving and importing other models and it works. The only difference is that I trained these on an AWS machine, with an older version of tensorflow, while I trained the problematic one on my computer.
I have run the distributed mnist example:
https://github.com/tensorflow/tensorflow/blob/r0.12/tensorflow/tools/dist_test/python/mnist_replica.py
Though I have set the
saver = tf.train.Saver(max_to_keep=0)
In previous release, like r11, I was able to run over each check point model and evaluate the precision of the model. This gave me a plot of the progress of the precision versus global steps (or iterations).
Prior to r12, tensorflow checkpoint models were saved in two files, model.ckpt-1234 and model-ckpt-1234.meta. One could restore a model by passing the model.ckpt-1234 filename like so saver.restore(sess,'model.ckpt-1234').
However, I've noticed that in r12, there are now three output files model.ckpt-1234.data-00000-of-000001, model.ckpt-1234.index, and model.ckpt-1234.meta.
I see that the the restore documentation says that a path such as /train/path/model.ckpt should be given to restore instead of a filename. Is there any way to load one checkpoint file at a time to evaluate it? I have tried passing the model.ckpt-1234.data-00000-of-000001, model.ckpt-1234.index, and model.ckpt-1234.meta files, but get errors like below:
W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open logdir/2016-12-08-13-54/model.ckpt-0.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
NotFoundError (see above for traceback): Tensor name "hid_b" not found in checkpoint files logdir/2016-12-08-13-54/model.ckpt-0.index
[[Node: save/RestoreV2_1 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]
W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open logdir/2016-12-08-13-54/model.ckpt-0.meta: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
I'm running on OSX Sierra with tensorflow r12 installed via pip.
Any guidance would be helpful.
Thank you.
I also used Tensorlfow r0.12 and I didn't think there is any issue for saving and restoring model. The following is a simple code that you can have a try:
import tensorflow as tf
# Create some variables.
v1 = tf.Variable(tf.random_normal([784, 200], stddev=0.35), name="v1")
v2 = tf.Variable(tf.random_normal([784, 200], stddev=0.35), name="v2")
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
# Save the variables to disk.
save_path = saver.save(sess, "/tmp/model.ckpt")
print("Model saved in file: %s" % save_path)
# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/tmp/model.ckpt")
print("Model restored.")
# Do some work with the model
although in r0.12, the checkpoint is stored in multiple files, you can restore it by using the common prefix, which is 'model.ckpt' in your case.
The R12 has changed the checkpoint format. You should save the model in the old format.
import tensorflow as tf
from tensorflow.core.protobuf import saver_pb2
...
saver = tf.train.Saver(write_version = saver_pb2.SaverDef.V1)
saver.save(sess, './model.ckpt', global_step = step)
According to the TensorFlow v0.12.0 RC0’s release note:
New checkpoint format becomes the default in tf.train.Saver. Old V1
checkpoints continue to be readable; controlled by the write_version
argument, tf.train.Saver now by default writes out in the new V2
format. It significantly reduces the peak memory required and latency
incurred during restore.
see details in my blog.
You can restore the model like this:
saver = tf.train.import_meta_graph('./src/models/20170512-110547/model-20170512-110547.meta')
saver.restore(sess,'./src/models/20170512-110547/model-20170512-110547.ckpt-250000'))
Where the path '/src/models/20170512-110547/' contains three files:
model-20170512-110547.meta
model-20170512-110547.ckpt-250000.index
model-20170512-110547.ckpt-250000.data-00000-of-00001
And if in one directory there are more than one checkpoints,eg: there are checkpoint files in the path
./20170807-231648/:
checkpoint
model-20170807-231648-0.data-00000-of-00001
model-20170807-231648-0.index
model-20170807-231648-0.meta
model-20170807-231648-100000.data-00000-of-00001
model-20170807-231648-100000.index
model-20170807-231648-100000.meta
you can see that there are two checkpoints, so you can use this:
saver = tf.train.import_meta_graph('/home/tools/Tools/raoqiang/facenet/models/facenet/20170807-231648/model-20170807-231648-0.meta')
saver.restore(sess,tf.train.latest_checkpoint('/home/tools/Tools/raoqiang/facenet/models/facenet/20170807-231648/'))
OK, I can answer my own question. What I found was that my python script was adding an extra '/' to my path so I was executing:
saver.restore(sess,'/path/to/train//model.ckpt-1234')
somehow that was causing a problem with tensorflow.
When I removed it, calling:
saver.restore(sess,'/path/to/trian/model.ckpt-1234')
it worked as expected.
use only model.ckpt-1234
at least it works for me
I'm new to TF and met the same issue. After reading Yuan Ma's comments, I copied the '.index' to the same 'train\ckpt' folder together with '.data-00000-of-00001' file. Then it worked!
So, the .index file is sufficient when restoring the models.
I used TF on Win7, r12.