Restoring a checkpoint if ckpt.index file is missing - tensorflow

Is it possible to restore a checkpoint if ckpt.index file is missing, and only ckpt.data, meta and .pb (the frozen model corresponding to this checkpoint) files are available?
Context: I want to load the model from the checkpoint and resume training.

No, you need to have ckpt.index file as well.

Related

How do I load a non-latest Tensorflow checkpoint?

I made checkpoints every 1000 steps of training, and I have 16 files in my checkpoints directory. However it seems that when I want to retrieve the latest one it's reverting to its pre-trained state. I am assuming something to do with the summary logs not documenting that later checkpoints exist.
chkpt.restore(tf.train.latest_checkpoint(chkpt_dir))
# fit(train_ds, test_ds, steps=100000)
for i in range(10):
ex_input, ex_output = next(iter(test_ds.take(1)))
generate_images(generator, ex_input, ex_output, i, test=True)
How can I manually ask the checkpoint manager to retrieve this or that particular checkpoint file, as oppossed to .latest_checkpoint()?
Edit: Solved it myself, open the checkpoints.txt file in your checkpoint folder and set the suffix number to whichever checkpoint you want to load.
you can use the checkpoints.restore() method to restore checkpoints of your preference. For example, if you want to load checkpoint at iteration 1000, then you write:
checkpoint.restore('./test/model.ckpt-1000')
For more details please refer to this documentation. Thank You.

How to convert a checkpoint to a Keras .h5 model?

I have a tensorflow model that saves checkpoints, but I need to to load the weights and save the Kereas .h5 model. How can I do that?
I am assuming you need to convert your previous checkpoint into .h5
Given an already trained model, you want to load its weights and save as .h5. I am assuming you have it saved as a .model file. Lets say it was called first.model
In your script, you will want to use load_model, loading your checkpoint with
model = load_model('first.model')
then you will simply need to use
model.save('goal.h5')
to save as a .h5 file.
For future reference, you can avoid this conversion process by saving checkpoints as .h5:
When using the Checkpoints feature, you have the option to save as either a .model .h5, or .hdf5. The line might look something like this:
checkpoint = ModelCheckpoint("**FILE_NAME_HERE**.model",monitor='val_loss',verbose=1,mode='min',save_best_only=True,save_weights_only=False,period=1)
That is how you save your checkpoint as a .model, but to save it as a h5 as you are looking to do:
checkpoint = ModelCheckpoint("**FILE_NAME_HERE**.h5",monitor='val_loss',verbose=1,mode='min',save_best_only=True,save_weights_only=False,period=1)

How to rewrite a tensorflow's checkpoint files?

I want to change a ckpt files's tensor's value by many other ckpt files's tensors, and use the modified ckpt files to restart TF training jobs.
Hope you some advices!
Thanks!
There are standalone utilities for reading checkpoint files (search for CheckpointReader or NewCheckpointReader) but not modifying them. The easiest approach is probably to load the checkpoint into your model, assign a new value to the variable you want to change, and save this new checkpoint.

Tensorflow can't save model

I encountered this weird problem...I use this code to construct tensorflow saver:
tf.train.Saver(tf.all_variables(), max_to_keep=FLAGS.keep)
which is supposed to be very standard. However, when I point the saving directory to my custom directory (under my username) instead of "/tmp", all of a sudden, the saved models are files like
translate.ckpt-329.data-00000-of-00001
translate.ckpt-329.index
translate.ckpt-329.meta
I can't find the file "translate.ckpt-329".
The generated checkpoint file is pointing to:
model_checkpoint_path: "/Users/.../train_dir/translate.ckpt-329"
all_model_checkpoint_paths: "/Users/.../train_dir/translate.ckpt-329"
while this file does not exist and create problems for me restoring my model.
Can someone shed any light on this?? What could possibly be the problem?
Thanks for the first answer! I guess my bigger problem is the restore method:
The original code uses this way to restore a session:
ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
model.saver.restore(session, ckpt.model_checkpoint_path)
Which failed with V2 saving :(
if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path):
logging.info("Reading model parameters from %s" % ckpt.model_checkpoint_path)
model.saver.restore(session, ckpt.model_checkpoint_path)
else:
logging.info("Created model with fresh parameters.")
session.run(tf.global_variables_initializer())
TL;DR: In the new checkpoint format, the "filename" that you pass to the saver is actually used as the prefix of several filenames, and no file with that exact name is written. You can use the old checkpoint format by constructing your tf.train.Saver with the optional argument write_version=tf.train.SaverDef.V1.
From the names of the saved files, it appears that you are using the "V2" checkpoint format, which became the default in TensorFlow 0.12. This format stores the checkpoint data in multiple files: one or more data files (e.g. translate.ckpt-329.data-00000-of-00001 in your case) and an index file (translate.ckpt-329.index) that tells TensorFlow where each saved variable is located in the data files. The tf.train.Saver uses the "filename" that you pass as the prefix for these files' names, but doesn't produce a file with that exact name.
Although there is no file with the exact name you gave, you can use the value returned from saver.save() as the argument to a subsequent saver.restore(), and the other checkpoint locating mechanisms should continue to work as before.

Fail to read the new format of tensorflow checkpoint?

I pip installed tensorflow 0.12. I am able to resume training by loading old checkpoints which ends with .ckpk. However, tensorflow 0.12 dumps new checkpoints in a different format including *.index, .data-00000-of-00001 and *.meta. After that, I am not able to restore from the new checkpoint.
What is the proper way of loading the new format? Besides, how to read *index?
Mostly duplicate of How to restore a model by filename in Tensorflow r12?
Troubleshooting:
Read common suffix
-stop before the first dot after ckpt
Check model path
absolute
saver.restore(sess, "/full/path/to/model.ckpt")
or relative
saver.restore(sess, "./model.ckpt")
Regarding reading the .index file, as the name suggests, it is the first file to be opened by the restore function. No .index file, no restore (you could still restore without a .meta file).
The .index files needs the data-xxxx-of-xxxx shards, so it would be kind of pointless to read only the .index file, without any tensor data restored. What are you trying to achieve?