saturated contrast and low brightness in tensorboard when training TF2 object_detection API - tensorflow

I'm trying to fine-tune an efficient det model. Here is a recap of what I've done:
download coco dataset 2014
convert to tfrecord with a script from tensorflow
download efficientDet D0 from official model zoo
edit pipeline.config (batch_size: 1, sync_replicas: false, replicas_to_aggregate: 1, fine_tune_checkpoint_type: "detection", use_bfloat16: false) and adjust the paths.
clone github.com/tensorflow/models.git, docker-compose run object_detection.
inside the container:
python models/research/object_detection/model_main_tf2.py \
--pipeline_config_path=efficientdet_d0_coco17_tpu-32/pipeline.config \
--model_dir=foo/model/ \
--alsologtostderr
My problem is that as seen in tensorboard (ie after data preprocessing), contrast is maxed out (or sometimes not maxed, but still way too high), and brightness is often too low:
I checked the content of the tfrecords with https://github.com/sulc/tfrecord-viewer, the colors are fine.
I tried on another machine with a different nvidia GPU model, same problem.
Any idea where the problem could come from? Thanks!

this seems to be a visualization issue, and not a training issue. It can be solved by changing the normalization from (-1,1) to (0,1).
follow these changes in the code:
https://github.com/tensorflow/models/pull/9019

Related

Different results in TfLite model vs model before quantization

I have taken Object Detection model from TF zoo v2,
I took mobilenet and trained it on my own TFrecords
I am using mobilenet because it is often found in the examples of converting it to Tflite and this is what I need because I run it on RPi3.
I am following ideas from the official example from Sagemaker docs
and github you can find here
What is interesting the accuracy done after step 2) training and 3) deploying is pretty nice! My trucks are discovered nicely with the custom trained model.
However, when converted to tflite the accuracy goes down no matter if I use tfliteconvert tool or using python tf.lite.Converter.
What is more, all detections are on borders of images, and usually in the bottom-right corner. Maybe I am not preparing images correctly? Or some misunderstanding of results?
You can check images I uploaded.
https://ibb.co/fSzfZvz
https://ibb.co/0GF101s
What could possibly go wrong?
I was lacking proper preprocessing of image.
After I have used pipeline config to build detection object which has preprocess function I utilized to build tensor before feeding it into Interpreter.
num_classes = 2
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']
model_config.ssd.num_classes = num_classes
model_config.ssd.freeze_batchnorm = True
detection_model = model_builder.build(
model_config=model_config, is_training=True)

MobileNet: High Accuracy On Validation and Poor Prediction Results

I am training MobileNet_v1_1.0_224 using TensorFlow. I am using the python scripts present in the TensorFlow-Slim image classification model library for training. My dataset distribution with 4 classes is as follows:
normal_faces: 42070
oncall_faces: 13563 (People faces with mobile in the image when they're on call)
smoking_faces: 5949
yawning_faces: 1630
All images in the dataset are square images and larger than 224x224
I am using train_image_classifier.py to train the model with following arguments,
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=custom \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=mobilenet_v1 \
--batch_size=32\
--max_number_of_steps=25000
After training the model, eval_image_classifier.py shows an accuracy greater than 95% on Validation set but when I exported the frozen graph and used it for predictions, it performs very poorly.
I have also tried this notebook but this also produced similar results.
Log: Training Log
Plots: Loss and Accuracy
What is the reason for this? How do I fix this issue?
I have seen similar issues on SO but nothing related to MobileNets specifically.
Did you use a validation set? If so what was the validation accuracy?
If you used a validation set a good way to check if you are doing predictions properly is to run model.predict on the validation set.

Using ssd_inception_v2 to train on different resolution

The dataset contains images of different sizes.
The pretrained weights are trained on 300x300 resolution.
I am training on widerface dataset where objects are as small as 15x15.
Q1. I want to train with 800x800 resolution do i need to resize all the images manually or this will be done by Tensorflow automatically ?
I am using the following command to train:
python3 /opt/github/models/research/object_detection/legacy/train.py --logtostderr --train_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q2. I also tried training it using the model_main.py but after 1000 iterations it is evaluating the dataset with each iteration.
I am using the following command to train:
python3 /opt/github/models/research/object_detection/model_main.py --num_train_steps=200000 --logtostderr --model_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q3. Also if you can suggest any model i should use for real time face detection apart from mobilenet and inception, please suggest.
Thanks.
Q1. No you do not need to resize manually. See this detailed answer.
Q2. By 1000 iterations you meant steps right? (An iteration counts as a complete cycle of the dataset.) Usually the model performed evaluation after a certain amount of time, e.g. 10 minutes. So in every 10 minutes, the checkpoints are saved and an evaluation of the model on evaluation set is performed.
Q3. SSD models with mobilenet is one of the fast detectors, apart from that you can try YOLO models for real time detection

How to get rid of additional ops added in the graph while fine-tuning Tensorflow Inception_V3 model?

I am trying to convert a fine-tuned tensorflow inception_v3 model to uff format which can be run on NVIDIA's Jetson TX2. For conversion to uff, certain ops are supported, some are not. I am able to successfully freeze and convert to uff inception_v3 model with imagenet checkpoint provided by tensorflow. However if I fine-tune the model, additional ops like Floor, RandomUniform, etc are added in the new graph which are not yet supported. These layers remain even after freezing the model. This is happening in the fine-tuning for flowers sample provided on tensorflow site as well.
I want to understand why additional ops are added in the graph, while fine-tuning is just supposed to modify the final layer to match number of outputs required.
If they are added while training, how can I get rid of them? What post-processing steps tensorflow team followed before releasing inception_v3 model for imagenet?
I can share the pbtxt files if needed. For now, model layers details are uploaded at https://github.com/shrutim90/TF_to_UFF_Issue. I am using Tensorflow 1.6 with GPU.
I am following the steps to freeze or fine-tune the model from: https://github.com/tensorflow/models/tree/master/research/slim#Pretrained. As described in the above link, to reproduce the issue, install TF-Slim image models library and follow these steps:
1. python export_inference_graph.py \
--alsologtostderr \
--model_name=inception_v3 \
--output_file=/tmp/inception_v3_inf_graph.pb
2. python freeze_graph.py \
--input_graph=/tmp/inception_v3_inf_graph.pb \
--input_checkpoint=/tmp/checkpoints/inception_v3.ckpt \
--input_binary=true --output_graph=/tmp/frozen_inception_v3.pb \
--output_node_names=InceptionV3/Predictions/Reshape_1
3. DATASET_DIR=/tmp/flowers
TRAIN_DIR=/tmp/flowers-models/inception_v3
CHECKPOINT_PATH=/tmp/my_checkpoints/inception_v3.ckpt
python train_image_classifier.py --train_dir=$TRAIN_DIR --dataset_dir=$DATASET_DIR --dataset_name=flowers --dataset_split_name=train --model_name=inception_v3 --checkpoint_path=${CHECKPOINT_PATH} --checkpoint_exclude_scopes=InceptionV3/Logits,InceptionV3/AuxLogits --trainable_scopes=InceptionV3/Logits,InceptionV3/AuxLogits
4. python freeze_graph.py \
--input_graph=/tmp/graph.pbtxt \
--input_checkpoint=/tmp/checkpoints/model.ckpt-2539 \
--input_binary=false --output_graph=/tmp/frozen_inception_v3_flowers.pb \
--output_node_names=InceptionV3/Predictions/Reshape_1
To check the layers, you can check out .pbtxt file or use NVIDIA's convert-to-uff utility.
Run training script -> export_inference_graph -> freeze_graph . This gets rid of all the extra nodes and model can be easily converted to uff.

Train Tensorflow with my own images successfully, but still have problems

I am using ubuntu 16.04, with GPU Geforce 1080, 8 GB GPU memory.
I have properly created TF-record files, and I trained the model successfully. However I still have two problems.
I did the following steps and I still have two problems, just tell me please what I am missing:-
I used VOCdevkit and I properly created two files which are:- pascal_train.record and pascal_val.record
Then,
1- From this link, I used the raccoon images, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/JPEGImages (after I deleted the previous images).
Then, I used the raccoon annotation, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/Annotation (after I deleted the previous ones).
2- I modified the models/object_detection/data/pascal_label_map.pbxt and I wrote one class name which is 'raccoon'
3- I used ssd_mobilenet_v1_pets.config. I modified it, the number of class is only one and I did not train from scratch, I used ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
fine_tune_checkpoint: "/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_coco_11_06_2017/model.ckpt"
from_detection_checkpoint: true
4- From this link I arrange my data structure which is like that:-
models
1.1 model
1.1.1 ssd_mobilenet_v1_pets.config
1.1.2 train
1.1.3 evaluation
1.1.4 ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
1.2 object_detection
1.2.1 data that contains (pascal_train.record, pascal_val.record, and pascal_label_map.pbtxt)
1.2.2 VOCdevkit
1.2.2.1 VOC2012
1.2.2.1.1 JPEGImages (my own images)
1.2.2.1.2 Annotations (raccoon annotation)
1.2.2.1.3 ImageSets
1.2.2.1.3.1 Main (raccoon_train.txt,raccoon_val.txt,raccoon_train_val.txt)
5- Now, I will train my model
(abdu-py2) jesse#jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/train.py --logtostderr --pipeline_config_path=/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config --train_dir=/home/jesse/abdu-py2/models/model/train
Every thing looks fine, I created it many files like checkpoint and events.out.tfevents.1503337171 file (and others) after many thousands of training steps.
However, my two problems are:-
1- Based on this link, I can not run evaluation eval.py (for memory reason) at the same time with train.py.
2- I tried to use events.out.tfevents.1503337171 file that I created from training steps, but it seems it has not been created correctly.
So, I don't know where I am mistaken, I think my data structure is not correct, I tried to arrange it based on my understanding.
Thanks in advance
Edit:-
Regarding Q2/
I figured it out how to convert the events files and model.ckpt files (that I created them from training process) to inference_graph_.pb . The inference_graph_.pb could be tested later with object_detection_tutorial.ipynb. For my case I tried it, but I could not detect anything since I am mistaken somewhere during train.py process.
The following steps convert the trained files to .pb files
(abdu-py2) jesse#jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path /home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix /home/jesse/abdu-py2/models/model/train/model.ckpt-27688 \
--output_directory /home/jesse/abdu-py2/models/model
Question 1 - this is just a problem that you'll encounter because of your hardware. Once you get to a point where you'd like to a evaluate the model, just stop your training and run your eval command (it seems as though you've successfully evaluated your model, so you know the command). It will provide you a some metrics for the most recent model checkpoint. You can iterate through this process until you're comfortable with the performance of your model.
Question 2 - These event files are used as input into Tensorboard. The events files are in binary format, thus are not human readable. Start a Tensorboard application while your model is training and/or evaluating. To do so, run something like this:
tensorboard --logdir=train:/home/grasp001/abdu-py2/models/object_detection/train1/train,eval:/home/grasp001/abdu-py2/models/object_detection/train1/eval
Once you have Tensorboard running, use your web browser to navigate to localhost:6006 to check out your metrics. You can use this during training as well to monitor loss and other metrics for each step of training.
Trainer.py line 370 after the session_config
Limit the gpu proccess power
session_config.gpu_options.per_process_gpu_memory_fraction = 0.5
and then you can run eval.py at the same time. The tensorflow use all the free memory independently if it needs it