How to use the fine_tune_checkpoint in TensorFlow Object Detection API - tensorflow

I'm trying to load a fine_tune_checkpoint to start the training with.
I added the appropriate field in the config file, and set the value to be a checkpoint I have from a model previously trained from scratch.
The way tfod saves checkpoints is by using 3 files with different postfixes (data, index, meta) and I set the value to be the name of the checkpoint without the postfix.
fine_tune_checkpoint: "/path/to/my/checkpoint/dir/model.ckpt-190000"
There's no indication whatsoever that what I'm doing is either right or wrong. No logs stating the checkpoint is loaded and no errors, warnings or indications that it is not.
How can I tell if the checkpoint was actually loaded?
I'm happy for any suggestions to verify it either way.
Thanks in advance.

you need to do two things to solve this:
1- give absolute path to this parameter like this:
fine_tune_checkpoint:"c:/user/desktop/path/to/my/checkpoint/dir/model.ckpt-190000"
2- the fine tune should be in the same folder of the previous folder you train these steps
"model.ckpt-190000" other ways the model will start from step 0 .
regards.

Related

How do I load a non-latest Tensorflow checkpoint?

I made checkpoints every 1000 steps of training, and I have 16 files in my checkpoints directory. However it seems that when I want to retrieve the latest one it's reverting to its pre-trained state. I am assuming something to do with the summary logs not documenting that later checkpoints exist.
chkpt.restore(tf.train.latest_checkpoint(chkpt_dir))
# fit(train_ds, test_ds, steps=100000)
for i in range(10):
ex_input, ex_output = next(iter(test_ds.take(1)))
generate_images(generator, ex_input, ex_output, i, test=True)
How can I manually ask the checkpoint manager to retrieve this or that particular checkpoint file, as oppossed to .latest_checkpoint()?
Edit: Solved it myself, open the checkpoints.txt file in your checkpoint folder and set the suffix number to whichever checkpoint you want to load.
you can use the checkpoints.restore() method to restore checkpoints of your preference. For example, if you want to load checkpoint at iteration 1000, then you write:
checkpoint.restore('./test/model.ckpt-1000')
For more details please refer to this documentation. Thank You.

INFO:tensorflow:Waiting for new checkpoint at models/faster_rcnn

I used the transfer learning approach to develop a detection model using the faster_rcnn algorithm.
To evaluate my model, I used the following commands-
!python model_main_tf2.py --model_dir=models/faster_rcnn_inception_resnet_v2 --pipeline_config_path=models/faster_rcnn_inception_resnet_v2/pipeline.config --checkpoint_dir=models/faster_rcnn_inception_resnet_v2
However, I have been getting the following error/info message: -
INFO:tensorflow:Waiting for new checkpoint at models/faster_rcnn_inception_resnet_v2
I0331 23:23:11.699681 140426971481984 checkpoint_utils.py:139] Waiting for new checkpoint at models/faster_rcnn_inception_resnet_v2
I checked the path to the checkpoint_dir is correct. What could be the problem and how can I resolve it?
Thanks in advance.
You need to run another script for training to generate new checkpoint. model_main_tf2.py does not do both at once, i.e., it won't train model and evaluate the model at the end of each epoch.
One way to get what you want modifying checkpoint_max_to_keep in https://github.com/tensorflow/models/blob/13ec3c1460b928301d208115aed0c94fb47538b7/research/object_detection/model_lib_v2.py#L445
to keep all checkpoints, then evaluate separately. This does not work exactly same as you want, but it generates the curves.
A similar situation happened to me. I don't know if this is the solution or just a workaround but it did work for me. I simply exported my model and provided the path to that checkpoint folder.
fintune_checkpoint_model_directory
|
\---checkpoint(folder)
|
\---checkpoint(file with no extension)
\---ckpt-1.data0000of0001
\---ckpt-1.index
and then simply run the model_main_tf.py file for evaluation.
if you trained your model with a few number of steps it can be a problem maybe a few checkpoints can affect that TensorFlow can't generate evaluation so try to increase the number of steps

Estimator's model_fn includes params argument, but params are not passed to Estimator

I'm trying to run Object Detection API locally.
I believe I have everything set up as described in the TensorFlow Object Detection API documents, however, when I'm trying to run model_main.py, this warning shows and model doesn't train. (I can't really tell if model is training or not, because the process isn't terminated, but no further logs appear)
WARNING:tensorflow:Estimator's model_fn (.model_fn at 0x0000024BDBB3D158>) includes
params argument, but params are not passed to Estimator.
The code I'm passing in is:
python tensorflow-models/research/object_detection/model_main.py \
--model_dir=training \
--pipeline_config_path=ssd_mobilenet_v1_coco.config \
--checkpoint_dir=ssd_mobilenet_v1_coco_2017_11_17/model.ckpt \
--num_tain_steps=2000 \
--num_eval_steps=200 \
--alsologtostderr
What could be causing this warning?
Why would the code seem stuck?
Please help!
I met the same problem, and I found that this warning has nothing to do with the problem that the model doesn't work. I can make the model work as this warning showing.
My mistake was that I misunderstood the line in the document of running_locally.md
"${MODEL_DIR} points to the directory in which training checkpoints and events will be written to"
I changed the MODEL_DIR to the {project directory}/models/model where the structure of the directory is:
+data
-label_map file
-train TFRecord file
-eval TFRecord file
+models
+ model
-pipeline config file
+train
+eval
And it worked. Hoping this can help you.
Edit: while this may work, in this case model_dir does not contain any saved checkpoint files, if you stop the training after some checkpoint files are saved and restart again, the training would still be skipped. The doc specifies the recommended directory structure, but it is not necessary to be the same structure as all paths to tfrecord, pretrained checkpoints can be configured in the config file.
The actual reason is when model_dir contains checkpoint files which already reached the NUM_TRAIN_STEP, the script will assume the training is finished and exit. Remove the checkpoint files and restart training will work.
In my case, I had the same error because I had inside of the folder where my .cpkt files were, the checkpoint of the pre-trained models too.
Removing that file came inside of the .tar.gz file, the training worked.
I also received this error, and it was because I had previously trained a model on a different dataset/model/config file, and the previous ckpt files still existed in the directory I was working with, moving the old ckpt training data to a different directory fixed the issue
Your script seems good.
One thing we should notice is that, the new model_main.py will not print the log of training(like training step, lr, loss and so on.) It only print the evaluation result after one or multi-epoches, which will be a long time.
So "the process isn't terminated, but no further logs appear" is normal. You can confirm its running by using "nvidia-smi" to check the gpu situation, or use tensorboard to check.
I also encountered this warning message. I checked nvidia-smi and it seemed training wasn't started. Also tried re-organizing output directory and it didn't work out. After checking out Configuring the Object Detection Training Pipeline (tensorflow official), I found it was configuration problem. Solved the problem by adding load_all_detection_checkpoint_vars: true.

Checkpoint file not found, restoring evaluation graph

I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.
Error:
Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485
The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.
last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path))
I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.
saver.save(session, sv.save_path, global_step)
time.sleep(2) #wait for gcs to be updated
From your comment I think I understand what is going on. I may be wrong.
The cloud_ml distributed sample
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426
uses a temporary file by default. As a consequence, it works locally on the /tmp. Once the training is complete, it copies the result on gs:// but it does not correct the checkpoint file which stills contains references to local model files on /tmp. Basically, this is a bug.
In order to avoid this, you should launch the training process with --write_to_tmp 0 or modify the task.py file directly for disabling this option. Tensorflow will then directly work on gs:// and the resulting checkpoint will therefore be consistent. At least it worked for me.
One way of checking if my assumptions are correct is to copy the resulting checkpoint file from gs:// on your local filesystem using gsutils and then output its content.

How to use model.ckpt and inception v-3 to predict images?

Now I'm in front of the problem about inception v-3 and checkpoint data.
I have been tackling with updating inception-v3's checkpoint data by my images, reading the git page below and succeeded to make new checkpoint data.
https://github.com/tensorflow/models/tree/master/inception
I thought at first just by little change of the code, I can use those checkpoint data to recognise new image datas like the below url.
https://www.tensorflow.org/versions/master/tutorials/image_recognition/index.html
I thought at first that "classify.py" or something reads the new check point datas and just by "python classify.py -image something.png", the program recognises the image data. But It doesn't....
I really need a help.
thanks.
To have input .pb file, during training, import also tf.train.write_graph(sess.graph.as_graph_def(), 'path_to_folder', 'input_graph.pb',False)
If you have downloaded the inception v3 source code, in the inception_train.py, add the line I wrote above, under
saver.save(sess, checkpoint_path, global_step=step). (Where you save the checkpoint/s)
Hope this helps!
To use your checkpoints and model in something like the label_image example, you'll need to run the tensorflow/python/tools/freeze_graph script to convert your variables into constants stored inside the GraphDef. That's how we created the graph file used in that sample code, for example.