I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.
Error:
Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485
The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.
last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path))
I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.
saver.save(session, sv.save_path, global_step)
time.sleep(2) #wait for gcs to be updated
From your comment I think I understand what is going on. I may be wrong.
The cloud_ml distributed sample
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426
uses a temporary file by default. As a consequence, it works locally on the /tmp. Once the training is complete, it copies the result on gs:// but it does not correct the checkpoint file which stills contains references to local model files on /tmp. Basically, this is a bug.
In order to avoid this, you should launch the training process with --write_to_tmp 0 or modify the task.py file directly for disabling this option. Tensorflow will then directly work on gs:// and the resulting checkpoint will therefore be consistent. At least it worked for me.
One way of checking if my assumptions are correct is to copy the resulting checkpoint file from gs:// on your local filesystem using gsutils and then output its content.
Related
I made checkpoints every 1000 steps of training, and I have 16 files in my checkpoints directory. However it seems that when I want to retrieve the latest one it's reverting to its pre-trained state. I am assuming something to do with the summary logs not documenting that later checkpoints exist.
chkpt.restore(tf.train.latest_checkpoint(chkpt_dir))
# fit(train_ds, test_ds, steps=100000)
for i in range(10):
ex_input, ex_output = next(iter(test_ds.take(1)))
generate_images(generator, ex_input, ex_output, i, test=True)
How can I manually ask the checkpoint manager to retrieve this or that particular checkpoint file, as oppossed to .latest_checkpoint()?
Edit: Solved it myself, open the checkpoints.txt file in your checkpoint folder and set the suffix number to whichever checkpoint you want to load.
you can use the checkpoints.restore() method to restore checkpoints of your preference. For example, if you want to load checkpoint at iteration 1000, then you write:
checkpoint.restore('./test/model.ckpt-1000')
For more details please refer to this documentation. Thank You.
I'm trying to run Object Detection API locally.
I believe I have everything set up as described in the TensorFlow Object Detection API documents, however, when I'm trying to run model_main.py, this warning shows and model doesn't train. (I can't really tell if model is training or not, because the process isn't terminated, but no further logs appear)
WARNING:tensorflow:Estimator's model_fn (.model_fn at 0x0000024BDBB3D158>) includes
params argument, but params are not passed to Estimator.
The code I'm passing in is:
python tensorflow-models/research/object_detection/model_main.py \
--model_dir=training \
--pipeline_config_path=ssd_mobilenet_v1_coco.config \
--checkpoint_dir=ssd_mobilenet_v1_coco_2017_11_17/model.ckpt \
--num_tain_steps=2000 \
--num_eval_steps=200 \
--alsologtostderr
What could be causing this warning?
Why would the code seem stuck?
Please help!
I met the same problem, and I found that this warning has nothing to do with the problem that the model doesn't work. I can make the model work as this warning showing.
My mistake was that I misunderstood the line in the document of running_locally.md
"${MODEL_DIR} points to the directory in which training checkpoints and events will be written to"
I changed the MODEL_DIR to the {project directory}/models/model where the structure of the directory is:
+data
-label_map file
-train TFRecord file
-eval TFRecord file
+models
+ model
-pipeline config file
+train
+eval
And it worked. Hoping this can help you.
Edit: while this may work, in this case model_dir does not contain any saved checkpoint files, if you stop the training after some checkpoint files are saved and restart again, the training would still be skipped. The doc specifies the recommended directory structure, but it is not necessary to be the same structure as all paths to tfrecord, pretrained checkpoints can be configured in the config file.
The actual reason is when model_dir contains checkpoint files which already reached the NUM_TRAIN_STEP, the script will assume the training is finished and exit. Remove the checkpoint files and restart training will work.
In my case, I had the same error because I had inside of the folder where my .cpkt files were, the checkpoint of the pre-trained models too.
Removing that file came inside of the .tar.gz file, the training worked.
I also received this error, and it was because I had previously trained a model on a different dataset/model/config file, and the previous ckpt files still existed in the directory I was working with, moving the old ckpt training data to a different directory fixed the issue
Your script seems good.
One thing we should notice is that, the new model_main.py will not print the log of training(like training step, lr, loss and so on.) It only print the evaluation result after one or multi-epoches, which will be a long time.
So "the process isn't terminated, but no further logs appear" is normal. You can confirm its running by using "nvidia-smi" to check the gpu situation, or use tensorboard to check.
I also encountered this warning message. I checked nvidia-smi and it seemed training wasn't started. Also tried re-organizing output directory and it didn't work out. After checking out Configuring the Object Detection Training Pipeline (tensorflow official), I found it was configuration problem. Solved the problem by adding load_all_detection_checkpoint_vars: true.
I'm training a TensorFlow (1.2) model on one machine and attempting to evaluate it on another. Everything works fine when I stay local to one machine.
I am not using placeholders and feed-dict's to get data to the model but rather TF file queues and batch generators. I suspect with placeholders this would be much easier but I am trying to make the TF batch generator machinery work.
In my evaluation code I have lines like:
saver = tf.train.Saver()
ckpt = tf.train.get_checkpoint_state(os.path.dirname(ckpt_dir))
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path)
This produces errors like:
017-08-16 12:29:06.387435: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /data/perdue/minerva/tensorflow/models/11/20170816/checkpoints-20: Not found: /data/perdue/minerva/tensorflow/models/11/20170816
The referenced directory (/data/...) exists on my training machine but not the evaluation machine. I have tried things like
saver = tf.train.import_meta_graph(
'/local-path/checkpoints-XXX.meta',
clear_devices=True
)
saver.restore(
sess, '/local-path/checkpoints-XXX',
)
but this produces a different error:
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value train_file_queue/limit_epochs/epochs
or, if I explicitly call the initializer functions immediately after the restore,
AttributeError: 'Tensor' object has no attribute 'initializer'
Here, train_file_queue/limit_epochs/epochs is an element of the training graph that I would like the evaluation function to ignore (I have another, new element test_file_queue that is pointing at a different file queue with the evaluation data files in it).
I think in the second case when I'm calling the initializers right after the restore that there is something in the local variables that won't doesn't work quite like a "normal" Tensor, but I'm not sure exactly what the issue is.
If I just use a generic Saver and restore TF does the right thing on the original machine - it just restores model parameters and then uses my new file queue for evaluation. But I can't be restricted to that machine, I need to be able to evaluate the model on other machines.
I've also tried freezing a protobuf and a few other options and there are always difficulties associated with the fact that I need to use file queues as the most upstream inputs.
What is the proper way to train using TensorFlow's file queues and batch generators and then deploy the model on a different machine / in a different environment? I suspect if I were using feed-dict's to get data to the graph this would be fairly simple, but it isn't as clear when using the built in file queues and batch generators.
Thanks for any comments or suggestions!
At least part of the answer to this dilemma was answered in TF 1.2 or 1.3. There is a new flag for the Saver() constructor:
saver = tf.train.Saver(save_relative_paths=True)
that makes it such that when you save the checkpoint directory and move it to another machine, and use it to restore() a model, everything works without errors relating to nonexistent paths for the data (the paths from the old machine where training was performed).
It isn't clear my use of the API is really idiomatic in this case, but at least the code works such that I can export trained models from one machine to another.
In the TensorFlow tutorial to train a network on CIFAR-10, where and how do they save the weights/parameters between running training and evaluation? I cannot see any files saved to my project directory.
Here are the links to the tutorial and the code:
https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/image/cifar10
It saves the logs and checkpoints to the /tmp/ folder by default.
The weights are included in the checkpoint files.
As you can see in both eval and train files, it does take a checkpoint dir as parameter.
cifar10_train.py:
tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train',
"""Directory where to write event logs """
"""and checkpoint.""")
cifar10_eval.py:
tf.app.flags.DEFINE_string('eval_dir', '/tmp/cifar10_eval',
"""Directory where to write event logs.""")
tf.app.flags.DEFINE_string('eval_data', 'test',
"""Either 'test' or 'train_eval'.""")
tf.app.flags.DEFINE_string('checkpoint_dir', '/tmp/cifar10_train',
"""Directory where to read model checkpoints.""")
You can call those scripts with custom values for those. For my project using Inception I have to change it since the main hard drive does not have enough space for the bottlenecks created by inception.
It might be a good practice to explicitly set those values since the /tmp/ folder is not persistent and thus you might lose your training data.
The following code will save the training data into a custom folder.
python cifar10_train.py --train_dir="/home/username/train_folder"
and then, to evaluate:
python cifar10_eval.py --checkpoint_dir="/home/username/train_folder"
It also applies to the other examples.
Let's assume you're running cifar10_train, saving happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L122
And the default location is defined in this line (it's "/tmp/cifar10_train"):
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L51
In cifar10_eval, restoring the weights happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_eval.py#L75
Using a setup similar to https://github.com/tensorflow/models/tree/master/inception, the chief worker automatically saves a checkpoint file periodically on the node this process is running on. I'm running two ps on two different nodes. Two workers are also running on the two nodes each, with one out of 4 workers being the chief.
When restarting training without any modification, the Supervisor automatically tries to restore the last checkpoint file, but ends up giving an error that it could not find the ckpt on the second node (the node other than the chief worker), because the chief never saved the ckpt on the second node.
W tensorflow/core/framework/op_kernel.cc:936] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/muneebs/tf_train/model.ckpt-275
If I copy the ckpt directory to the second node, it restores fine. Is it a bug? Should the saver be initialized as sharded=True? If so, is that the only way, and we can't have the ckpt as a single file in case the number of nodes change later on?
A distributed file system like hdfs would help.
U can save the model (ckpt) to a directory in hdfs, thus avoiding the question of restoring ckpt.
Another method is launch the ps and worker whose task_index=0 in a same machine.