Tensorflow eager choose checkpoint max to keep - tensorflow

I'm writing a process-based implementation of a3c with tensorflow in eager mode. After every gradient update, my general model writes its parameters as checkpoints to a folder. The workers then update their parameters by loading the last checkpoints from this folder. However, there is a problem.
Often times, while the worker is reading the last available checkpoint from the folder, the master network will write new checkpoints to the folder and sometimes will erase the checkpoint that the worker is reading. A simple solution would be raising the maximum of checkpoints to keep. However, tfe.Checkpoint and tfe.Saver don't have a parameter to choose the max to keep.
Is there a way to achieve this?

For the tf.train.Saver you can specify max_to_keep:
tf.train.Saver(max_to_keep = 10)
and max_to_keep seems to be present in the both fte.Saver and it's tf.training.Saver.
I haven't tried if it works though.

It seems the suggested way of doing checkpoint deletion is to use the CheckpointManager.
import tensorflow as tf
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
manager = tf.contrib.checkpoint.CheckpointManager(
checkpoint, directory="/tmp/model", max_to_keep=5)
status = checkpoint.restore(manager.latest_checkpoint)
while True:
# train
manager.save()

Related

How do I load a non-latest Tensorflow checkpoint?

I made checkpoints every 1000 steps of training, and I have 16 files in my checkpoints directory. However it seems that when I want to retrieve the latest one it's reverting to its pre-trained state. I am assuming something to do with the summary logs not documenting that later checkpoints exist.
chkpt.restore(tf.train.latest_checkpoint(chkpt_dir))
# fit(train_ds, test_ds, steps=100000)
for i in range(10):
ex_input, ex_output = next(iter(test_ds.take(1)))
generate_images(generator, ex_input, ex_output, i, test=True)
How can I manually ask the checkpoint manager to retrieve this or that particular checkpoint file, as oppossed to .latest_checkpoint()?
Edit: Solved it myself, open the checkpoints.txt file in your checkpoint folder and set the suffix number to whichever checkpoint you want to load.
you can use the checkpoints.restore() method to restore checkpoints of your preference. For example, if you want to load checkpoint at iteration 1000, then you write:
checkpoint.restore('./test/model.ckpt-1000')
For more details please refer to this documentation. Thank You.

Loading a checkpoint from a trained model using estimator

I want to do very simple task. Let us assume that I have executed a model and saved multiple checkpoints and metada for this model using tf.estimator. We can again assume that I have 3 checkpoints. 1, 2 and 3. While I am evaluating the trained results on the tensorboard, I am realizing that checkpoint 2 is providing the better weights for my objective.
Therefore I want to load checkpoint 2 and make my predictions. What I want to ask simply is that, is it possible to delete checkpoint 3 from the model dir and let the estimator load it automatically from checkpoint 2 or is there anything I can do to load a specific checkpoint for.my predictions?
Thank you.
Yes, You can. By default, Estimator will load latest available checkpoint in model_dir. So you can either delete files manually, or specify checkpoint file with
warm_start = tf.estimator.WarmStartSettings(ckpt_to_initialize_from='file.ckpt')
and pass this to estimator
tf.estimator.Estimator(model_fn=model_fn,
config=run_config,
model_dir='dir',
warm_start_from=warm_start)
The latter option will not mess tensorboard summaries, so it's generally cleaner

How to control amount of checkpoint kept by tensorflow estimator?

I've noticed that the new Estimator API automatically saves checkpoints during the training and automatically restarts from the last checkpoint when training was interrupted. Unfortunately, it seems it only keeps the last 5 checkpoints.
Do you know how to control the number of checkpoints that are kept during the training?
Tensorflow tf.estimator.Estimator takes config as an optional argument, which can be a tf.estimator.RunConfig object to configure runtime settings.You can achieve this as follows:
# Change maximum number checkpoints to 25
run_config = tf.estimator.RunConfig()
run_config = run_config.replace(keep_checkpoint_max=25)
# Build your estimator
estimator = tf.estimator.Estimator(model_fn,
model_dir=job_dir,
config=run_config,
params=None)
config parameter is available in all classes (DNNClassifier, DNNLinearCombinedClassifier, LinearClassifier, etc.) that extend estimator.Estimator.
As a side note I would like to add that in TensorfFlow2 the situation is a little bit simpler. To keep a certain number of checkpoint files you can modify the model_main_tf2.py source code. First you can add and define an integer flag as
# Keep last 25 checkpoints
flags.DEFINE_integer('checkpoint_max_to_keep', 25,
'Integer defining how many checkpoint files to keep.')
Then use this pre-defined value in a call to model_lib_v2.train_loop:
# Ensure training loop keeps last 25 checkpoints
model_lib_v2.train_loop(...,
checkpoint_max_to_keep=FLAGS.checkpoint_max_to_keep,
...)
The symbol ... above denotes other options to model_lib_v2.train_loop.

How to add selective nodes in the log file

I am trying to perform a testing on a trained model. I have restored the checkpoint in the session. The model has a lot of operations but I am only testing part of it.
I would like to see the op network that I am testing in the tensorboard, but currently, it's all entangled with all other operations that I do not want.
Is there any way to store the tf.summary for the selective operations that you are performing in the session?
Instead of merging all summaries by merged_summary = tf.summary.merge_all(), you can merge the ops that you wanted like merged_summary_group1 = tf.summary.merge([op1, op2, ...]). After that, replacing all merged_summary in sess.run with merged_summary_group1.

how to properly train TensorFlow on one machine and evaluate on another?

I'm training a TensorFlow (1.2) model on one machine and attempting to evaluate it on another. Everything works fine when I stay local to one machine.
I am not using placeholders and feed-dict's to get data to the model but rather TF file queues and batch generators. I suspect with placeholders this would be much easier but I am trying to make the TF batch generator machinery work.
In my evaluation code I have lines like:
saver = tf.train.Saver()
ckpt = tf.train.get_checkpoint_state(os.path.dirname(ckpt_dir))
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path)
This produces errors like:
017-08-16 12:29:06.387435: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /data/perdue/minerva/tensorflow/models/11/20170816/checkpoints-20: Not found: /data/perdue/minerva/tensorflow/models/11/20170816
The referenced directory (/data/...) exists on my training machine but not the evaluation machine. I have tried things like
saver = tf.train.import_meta_graph(
'/local-path/checkpoints-XXX.meta',
clear_devices=True
)
saver.restore(
sess, '/local-path/checkpoints-XXX',
)
but this produces a different error:
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value train_file_queue/limit_epochs/epochs
or, if I explicitly call the initializer functions immediately after the restore,
AttributeError: 'Tensor' object has no attribute 'initializer'
Here, train_file_queue/limit_epochs/epochs is an element of the training graph that I would like the evaluation function to ignore (I have another, new element test_file_queue that is pointing at a different file queue with the evaluation data files in it).
I think in the second case when I'm calling the initializers right after the restore that there is something in the local variables that won't doesn't work quite like a "normal" Tensor, but I'm not sure exactly what the issue is.
If I just use a generic Saver and restore TF does the right thing on the original machine - it just restores model parameters and then uses my new file queue for evaluation. But I can't be restricted to that machine, I need to be able to evaluate the model on other machines.
I've also tried freezing a protobuf and a few other options and there are always difficulties associated with the fact that I need to use file queues as the most upstream inputs.
What is the proper way to train using TensorFlow's file queues and batch generators and then deploy the model on a different machine / in a different environment? I suspect if I were using feed-dict's to get data to the graph this would be fairly simple, but it isn't as clear when using the built in file queues and batch generators.
Thanks for any comments or suggestions!
At least part of the answer to this dilemma was answered in TF 1.2 or 1.3. There is a new flag for the Saver() constructor:
saver = tf.train.Saver(save_relative_paths=True)
that makes it such that when you save the checkpoint directory and move it to another machine, and use it to restore() a model, everything works without errors relating to nonexistent paths for the data (the paths from the old machine where training was performed).
It isn't clear my use of the API is really idiomatic in this case, but at least the code works such that I can export trained models from one machine to another.