I am using ubuntu 16.04, with GPU Geforce 1080, 8 GB GPU memory.
I have properly created TF-record files, and I trained the model successfully. However I still have two problems.
I did the following steps and I still have two problems, just tell me please what I am missing:-
I used VOCdevkit and I properly created two files which are:- pascal_train.record and pascal_val.record
Then,
1- From this link, I used the raccoon images, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/JPEGImages (after I deleted the previous images).
Then, I used the raccoon annotation, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/Annotation (after I deleted the previous ones).
2- I modified the models/object_detection/data/pascal_label_map.pbxt and I wrote one class name which is 'raccoon'
3- I used ssd_mobilenet_v1_pets.config. I modified it, the number of class is only one and I did not train from scratch, I used ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
fine_tune_checkpoint: "/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_coco_11_06_2017/model.ckpt"
from_detection_checkpoint: true
4- From this link I arrange my data structure which is like that:-
models
1.1 model
1.1.1 ssd_mobilenet_v1_pets.config
1.1.2 train
1.1.3 evaluation
1.1.4 ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
1.2 object_detection
1.2.1 data that contains (pascal_train.record, pascal_val.record, and pascal_label_map.pbtxt)
1.2.2 VOCdevkit
1.2.2.1 VOC2012
1.2.2.1.1 JPEGImages (my own images)
1.2.2.1.2 Annotations (raccoon annotation)
1.2.2.1.3 ImageSets
1.2.2.1.3.1 Main (raccoon_train.txt,raccoon_val.txt,raccoon_train_val.txt)
5- Now, I will train my model
(abdu-py2) jesse#jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/train.py --logtostderr --pipeline_config_path=/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config --train_dir=/home/jesse/abdu-py2/models/model/train
Every thing looks fine, I created it many files like checkpoint and events.out.tfevents.1503337171 file (and others) after many thousands of training steps.
However, my two problems are:-
1- Based on this link, I can not run evaluation eval.py (for memory reason) at the same time with train.py.
2- I tried to use events.out.tfevents.1503337171 file that I created from training steps, but it seems it has not been created correctly.
So, I don't know where I am mistaken, I think my data structure is not correct, I tried to arrange it based on my understanding.
Thanks in advance
Edit:-
Regarding Q2/
I figured it out how to convert the events files and model.ckpt files (that I created them from training process) to inference_graph_.pb . The inference_graph_.pb could be tested later with object_detection_tutorial.ipynb. For my case I tried it, but I could not detect anything since I am mistaken somewhere during train.py process.
The following steps convert the trained files to .pb files
(abdu-py2) jesse#jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path /home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix /home/jesse/abdu-py2/models/model/train/model.ckpt-27688 \
--output_directory /home/jesse/abdu-py2/models/model
Question 1 - this is just a problem that you'll encounter because of your hardware. Once you get to a point where you'd like to a evaluate the model, just stop your training and run your eval command (it seems as though you've successfully evaluated your model, so you know the command). It will provide you a some metrics for the most recent model checkpoint. You can iterate through this process until you're comfortable with the performance of your model.
Question 2 - These event files are used as input into Tensorboard. The events files are in binary format, thus are not human readable. Start a Tensorboard application while your model is training and/or evaluating. To do so, run something like this:
tensorboard --logdir=train:/home/grasp001/abdu-py2/models/object_detection/train1/train,eval:/home/grasp001/abdu-py2/models/object_detection/train1/eval
Once you have Tensorboard running, use your web browser to navigate to localhost:6006 to check out your metrics. You can use this during training as well to monitor loss and other metrics for each step of training.
Trainer.py line 370 after the session_config
Limit the gpu proccess power
session_config.gpu_options.per_process_gpu_memory_fraction = 0.5
and then you can run eval.py at the same time. The tensorflow use all the free memory independently if it needs it
Related
I was trying to follow this page https://www.tensorflow.org/tutorials/sequences/audio_recognition
I successfully executed the following command:
python tensorflow/examples/speech_commands/train.py
I used a virtual environment in Anaconda. Used Tensorflow 14 and Python 3.6
It took about about 22 hours to train it. it said "/tmp/speech_commands_train/conv.ckpt-100" after every 100 iterations
(there were 18000 in total)
but now when I try to find conv.ckpt-18000.meta or just speech_commands_train I cannot find it.
I am very new to this. This is my first effort in deep learning.
how the terminal looked when training ended
Firstly, what you mean by " Where It saved", by it you mean logs, the trained model or weights.
In your case, you are just storing the weights at given checkpoints hence you can acess them at given paths said in the tutorial
I0730 16:54:41.813438 55030 train.py:252] Saving to "/tmp/speech_commands_train/conv.ckpt-100"
*This is saving out the current trained weights to a checkpoint file. If your training script gets interrupted, you can look for the last saved checkpoint and then restart the script with -*
Also you can store logs using file writer and model using save_model or tensorboard callback with logdir.
Don't forget to upvote if found it useful
I've beeen trying out the Tensorflow 2 alpha and I have been trying to freeze and export a model to a .pb graphdef file.
In Tensorflow 1 I could do something like this:
# Freeze the graph.
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph to .pb file.
with open('model.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
However this doesn't seem possible anymore as convert_variables_to_constants is removed and use of sessions is discouraged.
I looked and found there is the freeze graph util
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py that works with SavedModel exports.
Is there some way to do it within Python still or I am meant to switch and use this tool now?
I have also faced this same problem while migrating from tensorflow1.x to tensoflow2.0 beta.
This problem can be solved by 2 methods:
1st is to go to the tensflow2.0 docs search for the methods you have used and change the syntax for each line &
To use google's tf_ugrade_v2 script
tf_upgrade_v2 --infile your_tf1_script_file --outfile converted_tf2_file
You try above command to change your tensorflow1.x script to tensorflow2.0, it will solve all your problem.
Also, you can rename the method (Manual step by refering documentation)
Rename 'tf.graph_util.convert_variables_to_constants' to 'tf.compat.v1.graph_util.convert_variables_to_constants'
The measure problem is that in tensorflow2.0 is that many syntax and function has changed try referring the tensoflow2.0 docs or use the google's tf_upgrade_v2 script
Not sure if you've seen this Tensorflow 2.0 issue, but this response seems to be a work-around:
https://github.com/tensorflow/tensorflow/issues/29253#issuecomment-530782763
Note: this hasn't worked for my nlp model but maybe it will work for you. The suggested work-around is to use model.save_weights('weights.h5') while in TF 2.0 environment. Then create new environment with TF 1.14 and do all following steps in TF 1.14 env. Build your model model = create_model() and use model.load_weights('weights.h5') to load weights back into your model. Then save entire model with model.save('final_model.h5'). If you manage to have success with the above steps, then follow the rest of the steps in the link to use freeze_graph.
I'm trying to run Object Detection API locally.
I believe I have everything set up as described in the TensorFlow Object Detection API documents, however, when I'm trying to run model_main.py, this warning shows and model doesn't train. (I can't really tell if model is training or not, because the process isn't terminated, but no further logs appear)
WARNING:tensorflow:Estimator's model_fn (.model_fn at 0x0000024BDBB3D158>) includes
params argument, but params are not passed to Estimator.
The code I'm passing in is:
python tensorflow-models/research/object_detection/model_main.py \
--model_dir=training \
--pipeline_config_path=ssd_mobilenet_v1_coco.config \
--checkpoint_dir=ssd_mobilenet_v1_coco_2017_11_17/model.ckpt \
--num_tain_steps=2000 \
--num_eval_steps=200 \
--alsologtostderr
What could be causing this warning?
Why would the code seem stuck?
Please help!
I met the same problem, and I found that this warning has nothing to do with the problem that the model doesn't work. I can make the model work as this warning showing.
My mistake was that I misunderstood the line in the document of running_locally.md
"${MODEL_DIR} points to the directory in which training checkpoints and events will be written to"
I changed the MODEL_DIR to the {project directory}/models/model where the structure of the directory is:
+data
-label_map file
-train TFRecord file
-eval TFRecord file
+models
+ model
-pipeline config file
+train
+eval
And it worked. Hoping this can help you.
Edit: while this may work, in this case model_dir does not contain any saved checkpoint files, if you stop the training after some checkpoint files are saved and restart again, the training would still be skipped. The doc specifies the recommended directory structure, but it is not necessary to be the same structure as all paths to tfrecord, pretrained checkpoints can be configured in the config file.
The actual reason is when model_dir contains checkpoint files which already reached the NUM_TRAIN_STEP, the script will assume the training is finished and exit. Remove the checkpoint files and restart training will work.
In my case, I had the same error because I had inside of the folder where my .cpkt files were, the checkpoint of the pre-trained models too.
Removing that file came inside of the .tar.gz file, the training worked.
I also received this error, and it was because I had previously trained a model on a different dataset/model/config file, and the previous ckpt files still existed in the directory I was working with, moving the old ckpt training data to a different directory fixed the issue
Your script seems good.
One thing we should notice is that, the new model_main.py will not print the log of training(like training step, lr, loss and so on.) It only print the evaluation result after one or multi-epoches, which will be a long time.
So "the process isn't terminated, but no further logs appear" is normal. You can confirm its running by using "nvidia-smi" to check the gpu situation, or use tensorboard to check.
I also encountered this warning message. I checked nvidia-smi and it seemed training wasn't started. Also tried re-organizing output directory and it didn't work out. After checking out Configuring the Object Detection Training Pipeline (tensorflow official), I found it was configuration problem. Solved the problem by adding load_all_detection_checkpoint_vars: true.
I am training an inception model from scratch using flowers dataset. Using the scripts provided by tensorflow models. The output of the training are these files:
checkpoint
events.out.tfevents.xxxxxx
model.ckpt-xxxx.data-00000-of-00001
model.ckpt-xxxx.index
model.ckpt-xxxx.meta
model.ckpt-xxxx.data-00000-of-00001
model.ckpt-xxxx.index
model.ckpt-xxxx.meta
These were some of the files I got. Does someone have a script to convert these files in something I can use to classify my images? How can I use it to test my own image ?
It is a three step process.
Step 1: Hope you already have tensorflow models directory, since you trained from it. Run the following command pertaining to the models directory you have:
python models/research/slim/export_inference_graph.py --model_name=<MODEL_NAME> --output_file=<NAME_OF_PB_FILE_CREATED> --dataset_dir=<PATH_TO_TF_RECORDS_DIRECTORY>
Eg:
python models/research/slim/export_inference_graph.py --model_name=inception_v3 --output_file=/home/user1/inception_v3_inf_graph.pb --dataset_dir=/home/user1/tfRecords
Step 2: Clone tensorflow github repository. (git clone https://github.com/tensorflow/tensorflow.git ).
Run the following commands w.r.t to the cloned tensorflow directory as below:
python tensorflow/tensorflow/python/tools/freeze_graph.py --input_graph=<PATH_TO_PB_FILE_CREATED_IN_STAGE1> --input_checkpoint=<PATH_TO_CKPT_FILES_GENERATED_DURING_TRAINING> --input_binary=true --output_graph=<PATH_TO_SAVE_OUTPUT_FROZEN_GRAPH> --output_node_names=<OUTPUT_NODE_NAMES_OF_MODEL>
Eg:
python tensorflow/tensorflow/python/tools/freeze_graph.py --input_graph=/home/user1/inception_v3_inf_graph.pb --input_checkpoint=/home/user1/model.ckpt-50000 --input_binary=true --output_graph=/home/user1/frozen_inception_v3.pb --output_node_names=InceptionV3/Predictions/Reshape_1
Please note the number 50000 in the example. This indicates the number of iterations. If you have trained your model for 10 iterations, then it would be 10. Also, even though there are 3 types of files for each checkpoint (meta, data & index), we just mention the first part. The rest will be parsed by the script automatically.
Step 3: Run the following commands w.r.t to the cloned tensorflow directory as below:
python tensorflow/tensorflow/examples/label_image/label_image.py --image=<PATH_TO_TEST_IMAGE_FILE> --input_layer=input --output_layer=<MODEL_OUTPUT_LAYER_NAME> --graph=<PATH_TO_FROZEN_GRAPH_CREATED_IN_STAGE2> --labels=<PATH_TO_LABELS_FILE> --input_mean=<MEAN> --input_std=<STD_DEVIATION>
Eg:
python tensorflow/tensorflow/examples/label_image/label_image.py --image=/home/user1/test_img.jpg --input_layer=input --output_layer=InceptionV3/Predictions/Reshape_1 --graph=/home/user1/frozen_inception_v3.pb --labels=/home/user1/labels.txt --input_mean=0 --input_std=255
The last one will give you the prediction result of test_img.jpg
In the TensorFlow tutorial to train a network on CIFAR-10, where and how do they save the weights/parameters between running training and evaluation? I cannot see any files saved to my project directory.
Here are the links to the tutorial and the code:
https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/image/cifar10
It saves the logs and checkpoints to the /tmp/ folder by default.
The weights are included in the checkpoint files.
As you can see in both eval and train files, it does take a checkpoint dir as parameter.
cifar10_train.py:
tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train',
"""Directory where to write event logs """
"""and checkpoint.""")
cifar10_eval.py:
tf.app.flags.DEFINE_string('eval_dir', '/tmp/cifar10_eval',
"""Directory where to write event logs.""")
tf.app.flags.DEFINE_string('eval_data', 'test',
"""Either 'test' or 'train_eval'.""")
tf.app.flags.DEFINE_string('checkpoint_dir', '/tmp/cifar10_train',
"""Directory where to read model checkpoints.""")
You can call those scripts with custom values for those. For my project using Inception I have to change it since the main hard drive does not have enough space for the bottlenecks created by inception.
It might be a good practice to explicitly set those values since the /tmp/ folder is not persistent and thus you might lose your training data.
The following code will save the training data into a custom folder.
python cifar10_train.py --train_dir="/home/username/train_folder"
and then, to evaluate:
python cifar10_eval.py --checkpoint_dir="/home/username/train_folder"
It also applies to the other examples.
Let's assume you're running cifar10_train, saving happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L122
And the default location is defined in this line (it's "/tmp/cifar10_train"):
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L51
In cifar10_eval, restoring the weights happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_eval.py#L75