Nothing running after loading weights with training checkpoints - tensorflow

I am trying to load weights from checkpoints after training a model and then stopping before end of training (still lots of steps). I seem to be able to load the weights by doing:
latest = tf.train.latest_checkpoint(checkpoint_dir)
model.load_weights(latest)
The checkpoint dir has files like [checpoint, ckpt-44.data-00000-of-00001, ckpt-44.inde, ...] and by running the load weights I get a bunch af arrays that seem to be the weights as output but also warning (just a portion of the warning):
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.layer_with_weights-86.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.layer_with_weights-86.bias
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.layer_with_weights-87.alpha
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.layer_with_weights-88.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.layer_with_weights-88.bias
and after this has run the program just ends. Nothing else is run. If I try to print or show image after this then nothing happens (or if I try to do what I actually want to do). Does anybody have any idea what is happening? Thanks and sorry if similar, have not seen and will remove if I find similar.
I tried a bunch of different ways to load the checkpoint and they didn't seem to work but this seems to load it but not let me do much after loading. I was expecting to be able to continue training or do what I want with the loaded weights.

I was able to make predictions after loading weights from the check points file which has a checkpoint for each epoch
latest = tf.train.latest_checkpoint('/content/training')
model1.load_weights(latest)
For complete code you can refer to this gist. Thank You,

Related

How do I load a non-latest Tensorflow checkpoint?

I made checkpoints every 1000 steps of training, and I have 16 files in my checkpoints directory. However it seems that when I want to retrieve the latest one it's reverting to its pre-trained state. I am assuming something to do with the summary logs not documenting that later checkpoints exist.
chkpt.restore(tf.train.latest_checkpoint(chkpt_dir))
# fit(train_ds, test_ds, steps=100000)
for i in range(10):
ex_input, ex_output = next(iter(test_ds.take(1)))
generate_images(generator, ex_input, ex_output, i, test=True)
How can I manually ask the checkpoint manager to retrieve this or that particular checkpoint file, as oppossed to .latest_checkpoint()?
Edit: Solved it myself, open the checkpoints.txt file in your checkpoint folder and set the suffix number to whichever checkpoint you want to load.
you can use the checkpoints.restore() method to restore checkpoints of your preference. For example, if you want to load checkpoint at iteration 1000, then you write:
checkpoint.restore('./test/model.ckpt-1000')
For more details please refer to this documentation. Thank You.

Tensorflow eager choose checkpoint max to keep

I'm writing a process-based implementation of a3c with tensorflow in eager mode. After every gradient update, my general model writes its parameters as checkpoints to a folder. The workers then update their parameters by loading the last checkpoints from this folder. However, there is a problem.
Often times, while the worker is reading the last available checkpoint from the folder, the master network will write new checkpoints to the folder and sometimes will erase the checkpoint that the worker is reading. A simple solution would be raising the maximum of checkpoints to keep. However, tfe.Checkpoint and tfe.Saver don't have a parameter to choose the max to keep.
Is there a way to achieve this?
For the tf.train.Saver you can specify max_to_keep:
tf.train.Saver(max_to_keep = 10)
and max_to_keep seems to be present in the both fte.Saver and it's tf.training.Saver.
I haven't tried if it works though.
It seems the suggested way of doing checkpoint deletion is to use the CheckpointManager.
import tensorflow as tf
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
manager = tf.contrib.checkpoint.CheckpointManager(
checkpoint, directory="/tmp/model", max_to_keep=5)
status = checkpoint.restore(manager.latest_checkpoint)
while True:
# train
manager.save()

Loading a checkpoint from a trained model using estimator

I want to do very simple task. Let us assume that I have executed a model and saved multiple checkpoints and metada for this model using tf.estimator. We can again assume that I have 3 checkpoints. 1, 2 and 3. While I am evaluating the trained results on the tensorboard, I am realizing that checkpoint 2 is providing the better weights for my objective.
Therefore I want to load checkpoint 2 and make my predictions. What I want to ask simply is that, is it possible to delete checkpoint 3 from the model dir and let the estimator load it automatically from checkpoint 2 or is there anything I can do to load a specific checkpoint for.my predictions?
Thank you.
Yes, You can. By default, Estimator will load latest available checkpoint in model_dir. So you can either delete files manually, or specify checkpoint file with
warm_start = tf.estimator.WarmStartSettings(ckpt_to_initialize_from='file.ckpt')
and pass this to estimator
tf.estimator.Estimator(model_fn=model_fn,
config=run_config,
model_dir='dir',
warm_start_from=warm_start)
The latter option will not mess tensorboard summaries, so it's generally cleaner

Estimator's model_fn includes params argument, but params are not passed to Estimator

I'm trying to run Object Detection API locally.
I believe I have everything set up as described in the TensorFlow Object Detection API documents, however, when I'm trying to run model_main.py, this warning shows and model doesn't train. (I can't really tell if model is training or not, because the process isn't terminated, but no further logs appear)
WARNING:tensorflow:Estimator's model_fn (.model_fn at 0x0000024BDBB3D158>) includes
params argument, but params are not passed to Estimator.
The code I'm passing in is:
python tensorflow-models/research/object_detection/model_main.py \
--model_dir=training \
--pipeline_config_path=ssd_mobilenet_v1_coco.config \
--checkpoint_dir=ssd_mobilenet_v1_coco_2017_11_17/model.ckpt \
--num_tain_steps=2000 \
--num_eval_steps=200 \
--alsologtostderr
What could be causing this warning?
Why would the code seem stuck?
Please help!
I met the same problem, and I found that this warning has nothing to do with the problem that the model doesn't work. I can make the model work as this warning showing.
My mistake was that I misunderstood the line in the document of running_locally.md
"${MODEL_DIR} points to the directory in which training checkpoints and events will be written to"
I changed the MODEL_DIR to the {project directory}/models/model where the structure of the directory is:
+data
-label_map file
-train TFRecord file
-eval TFRecord file
+models
+ model
-pipeline config file
+train
+eval
And it worked. Hoping this can help you.
Edit: while this may work, in this case model_dir does not contain any saved checkpoint files, if you stop the training after some checkpoint files are saved and restart again, the training would still be skipped. The doc specifies the recommended directory structure, but it is not necessary to be the same structure as all paths to tfrecord, pretrained checkpoints can be configured in the config file.
The actual reason is when model_dir contains checkpoint files which already reached the NUM_TRAIN_STEP, the script will assume the training is finished and exit. Remove the checkpoint files and restart training will work.
In my case, I had the same error because I had inside of the folder where my .cpkt files were, the checkpoint of the pre-trained models too.
Removing that file came inside of the .tar.gz file, the training worked.
I also received this error, and it was because I had previously trained a model on a different dataset/model/config file, and the previous ckpt files still existed in the directory I was working with, moving the old ckpt training data to a different directory fixed the issue
Your script seems good.
One thing we should notice is that, the new model_main.py will not print the log of training(like training step, lr, loss and so on.) It only print the evaluation result after one or multi-epoches, which will be a long time.
So "the process isn't terminated, but no further logs appear" is normal. You can confirm its running by using "nvidia-smi" to check the gpu situation, or use tensorboard to check.
I also encountered this warning message. I checked nvidia-smi and it seemed training wasn't started. Also tried re-organizing output directory and it didn't work out. After checking out Configuring the Object Detection Training Pipeline (tensorflow official), I found it was configuration problem. Solved the problem by adding load_all_detection_checkpoint_vars: true.

How are weights saved in the CIFAR10 tutorial for tensorflow?

In the TensorFlow tutorial to train a network on CIFAR-10, where and how do they save the weights/parameters between running training and evaluation? I cannot see any files saved to my project directory.
Here are the links to the tutorial and the code:
https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/image/cifar10
It saves the logs and checkpoints to the /tmp/ folder by default.
The weights are included in the checkpoint files.
As you can see in both eval and train files, it does take a checkpoint dir as parameter.
cifar10_train.py:
tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train',
"""Directory where to write event logs """
"""and checkpoint.""")
cifar10_eval.py:
tf.app.flags.DEFINE_string('eval_dir', '/tmp/cifar10_eval',
"""Directory where to write event logs.""")
tf.app.flags.DEFINE_string('eval_data', 'test',
"""Either 'test' or 'train_eval'.""")
tf.app.flags.DEFINE_string('checkpoint_dir', '/tmp/cifar10_train',
"""Directory where to read model checkpoints.""")
You can call those scripts with custom values for those. For my project using Inception I have to change it since the main hard drive does not have enough space for the bottlenecks created by inception.
It might be a good practice to explicitly set those values since the /tmp/ folder is not persistent and thus you might lose your training data.
The following code will save the training data into a custom folder.
python cifar10_train.py --train_dir="/home/username/train_folder"
and then, to evaluate:
python cifar10_eval.py --checkpoint_dir="/home/username/train_folder"
It also applies to the other examples.
Let's assume you're running cifar10_train, saving happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L122
And the default location is defined in this line (it's "/tmp/cifar10_train"):
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L51
In cifar10_eval, restoring the weights happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_eval.py#L75