How to rewrite a tensorflow's checkpoint files? - tensorflow

I want to change a ckpt files's tensor's value by many other ckpt files's tensors, and use the modified ckpt files to restart TF training jobs.
Hope you some advices!
Thanks!

There are standalone utilities for reading checkpoint files (search for CheckpointReader or NewCheckpointReader) but not modifying them. The easiest approach is probably to load the checkpoint into your model, assign a new value to the variable you want to change, and save this new checkpoint.

Related

How do I load a non-latest Tensorflow checkpoint?

I made checkpoints every 1000 steps of training, and I have 16 files in my checkpoints directory. However it seems that when I want to retrieve the latest one it's reverting to its pre-trained state. I am assuming something to do with the summary logs not documenting that later checkpoints exist.
chkpt.restore(tf.train.latest_checkpoint(chkpt_dir))
# fit(train_ds, test_ds, steps=100000)
for i in range(10):
ex_input, ex_output = next(iter(test_ds.take(1)))
generate_images(generator, ex_input, ex_output, i, test=True)
How can I manually ask the checkpoint manager to retrieve this or that particular checkpoint file, as oppossed to .latest_checkpoint()?
Edit: Solved it myself, open the checkpoints.txt file in your checkpoint folder and set the suffix number to whichever checkpoint you want to load.
you can use the checkpoints.restore() method to restore checkpoints of your preference. For example, if you want to load checkpoint at iteration 1000, then you write:
checkpoint.restore('./test/model.ckpt-1000')
For more details please refer to this documentation. Thank You.

Tensorflow eager choose checkpoint max to keep

I'm writing a process-based implementation of a3c with tensorflow in eager mode. After every gradient update, my general model writes its parameters as checkpoints to a folder. The workers then update their parameters by loading the last checkpoints from this folder. However, there is a problem.
Often times, while the worker is reading the last available checkpoint from the folder, the master network will write new checkpoints to the folder and sometimes will erase the checkpoint that the worker is reading. A simple solution would be raising the maximum of checkpoints to keep. However, tfe.Checkpoint and tfe.Saver don't have a parameter to choose the max to keep.
Is there a way to achieve this?
For the tf.train.Saver you can specify max_to_keep:
tf.train.Saver(max_to_keep = 10)
and max_to_keep seems to be present in the both fte.Saver and it's tf.training.Saver.
I haven't tried if it works though.
It seems the suggested way of doing checkpoint deletion is to use the CheckpointManager.
import tensorflow as tf
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
manager = tf.contrib.checkpoint.CheckpointManager(
checkpoint, directory="/tmp/model", max_to_keep=5)
status = checkpoint.restore(manager.latest_checkpoint)
while True:
# train
manager.save()

Loading a checkpoint from a trained model using estimator

I want to do very simple task. Let us assume that I have executed a model and saved multiple checkpoints and metada for this model using tf.estimator. We can again assume that I have 3 checkpoints. 1, 2 and 3. While I am evaluating the trained results on the tensorboard, I am realizing that checkpoint 2 is providing the better weights for my objective.
Therefore I want to load checkpoint 2 and make my predictions. What I want to ask simply is that, is it possible to delete checkpoint 3 from the model dir and let the estimator load it automatically from checkpoint 2 or is there anything I can do to load a specific checkpoint for.my predictions?
Thank you.
Yes, You can. By default, Estimator will load latest available checkpoint in model_dir. So you can either delete files manually, or specify checkpoint file with
warm_start = tf.estimator.WarmStartSettings(ckpt_to_initialize_from='file.ckpt')
and pass this to estimator
tf.estimator.Estimator(model_fn=model_fn,
config=run_config,
model_dir='dir',
warm_start_from=warm_start)
The latter option will not mess tensorboard summaries, so it's generally cleaner

How to restore a model by filename in Tensorflow r12?

I have run the distributed mnist example:
https://github.com/tensorflow/tensorflow/blob/r0.12/tensorflow/tools/dist_test/python/mnist_replica.py
Though I have set the
saver = tf.train.Saver(max_to_keep=0)
In previous release, like r11, I was able to run over each check point model and evaluate the precision of the model. This gave me a plot of the progress of the precision versus global steps (or iterations).
Prior to r12, tensorflow checkpoint models were saved in two files, model.ckpt-1234 and model-ckpt-1234.meta. One could restore a model by passing the model.ckpt-1234 filename like so saver.restore(sess,'model.ckpt-1234').
However, I've noticed that in r12, there are now three output files model.ckpt-1234.data-00000-of-000001, model.ckpt-1234.index, and model.ckpt-1234.meta.
I see that the the restore documentation says that a path such as /train/path/model.ckpt should be given to restore instead of a filename. Is there any way to load one checkpoint file at a time to evaluate it? I have tried passing the model.ckpt-1234.data-00000-of-000001, model.ckpt-1234.index, and model.ckpt-1234.meta files, but get errors like below:
W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open logdir/2016-12-08-13-54/model.ckpt-0.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
NotFoundError (see above for traceback): Tensor name "hid_b" not found in checkpoint files logdir/2016-12-08-13-54/model.ckpt-0.index
[[Node: save/RestoreV2_1 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]
W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open logdir/2016-12-08-13-54/model.ckpt-0.meta: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
I'm running on OSX Sierra with tensorflow r12 installed via pip.
Any guidance would be helpful.
Thank you.
I also used Tensorlfow r0.12 and I didn't think there is any issue for saving and restoring model. The following is a simple code that you can have a try:
import tensorflow as tf
# Create some variables.
v1 = tf.Variable(tf.random_normal([784, 200], stddev=0.35), name="v1")
v2 = tf.Variable(tf.random_normal([784, 200], stddev=0.35), name="v2")
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
# Save the variables to disk.
save_path = saver.save(sess, "/tmp/model.ckpt")
print("Model saved in file: %s" % save_path)
# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/tmp/model.ckpt")
print("Model restored.")
# Do some work with the model
although in r0.12, the checkpoint is stored in multiple files, you can restore it by using the common prefix, which is 'model.ckpt' in your case.
The R12 has changed the checkpoint format. You should save the model in the old format.
import tensorflow as tf
from tensorflow.core.protobuf import saver_pb2
...
saver = tf.train.Saver(write_version = saver_pb2.SaverDef.V1)
saver.save(sess, './model.ckpt', global_step = step)
According to the TensorFlow v0.12.0 RC0’s release note:
New checkpoint format becomes the default in tf.train.Saver. Old V1
checkpoints continue to be readable; controlled by the write_version
argument, tf.train.Saver now by default writes out in the new V2
format. It significantly reduces the peak memory required and latency
incurred during restore.
see details in my blog.
You can restore the model like this:
saver = tf.train.import_meta_graph('./src/models/20170512-110547/model-20170512-110547.meta')
saver.restore(sess,'./src/models/20170512-110547/model-20170512-110547.ckpt-250000'))
Where the path '/src/models/20170512-110547/' contains three files:
model-20170512-110547.meta
model-20170512-110547.ckpt-250000.index
model-20170512-110547.ckpt-250000.data-00000-of-00001
And if in one directory there are more than one checkpoints,eg: there are checkpoint files in the path
./20170807-231648/:
checkpoint
model-20170807-231648-0.data-00000-of-00001
model-20170807-231648-0.index
model-20170807-231648-0.meta
model-20170807-231648-100000.data-00000-of-00001
model-20170807-231648-100000.index
model-20170807-231648-100000.meta
you can see that there are two checkpoints, so you can use this:
saver = tf.train.import_meta_graph('/home/tools/Tools/raoqiang/facenet/models/facenet/20170807-231648/model-20170807-231648-0.meta')
saver.restore(sess,tf.train.latest_checkpoint('/home/tools/Tools/raoqiang/facenet/models/facenet/20170807-231648/'))
OK, I can answer my own question. What I found was that my python script was adding an extra '/' to my path so I was executing:
saver.restore(sess,'/path/to/train//model.ckpt-1234')
somehow that was causing a problem with tensorflow.
When I removed it, calling:
saver.restore(sess,'/path/to/trian/model.ckpt-1234')
it worked as expected.
use only model.ckpt-1234
at least it works for me
I'm new to TF and met the same issue. After reading Yuan Ma's comments, I copied the '.index' to the same 'train\ckpt' folder together with '.data-00000-of-00001' file. Then it worked!
So, the .index file is sufficient when restoring the models.
I used TF on Win7, r12.

Fail to read the new format of tensorflow checkpoint?

I pip installed tensorflow 0.12. I am able to resume training by loading old checkpoints which ends with .ckpk. However, tensorflow 0.12 dumps new checkpoints in a different format including *.index, .data-00000-of-00001 and *.meta. After that, I am not able to restore from the new checkpoint.
What is the proper way of loading the new format? Besides, how to read *index?
Mostly duplicate of How to restore a model by filename in Tensorflow r12?
Troubleshooting:
Read common suffix
-stop before the first dot after ckpt
Check model path
absolute
saver.restore(sess, "/full/path/to/model.ckpt")
or relative
saver.restore(sess, "./model.ckpt")
Regarding reading the .index file, as the name suggests, it is the first file to be opened by the restore function. No .index file, no restore (you could still restore without a .meta file).
The .index files needs the data-xxxx-of-xxxx shards, so it would be kind of pointless to read only the .index file, without any tensor data restored. What are you trying to achieve?