Difference between `train.py` and `model_main.py` in Tensorflow Object Detection API - tensorflow

I usually just use train.py to train using Tensorflow Object Detection API. However, I read from https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/discussion/68581 that you can also use model_main.py to train your model and see real-time plots and images on Tensorboard.
How do you exactly use model_main.py on Tensorboard?
What is the difference between train.py and model_main.py?

On TensorBoard, the model_main.py output similar graphs like train.py, but in model_main.py, the performance of the model on the evaluation dataset is measured too.
model_main.py is the newer version in TensorFlow Object Detection API. It is used for training and also evaluating the model. When using train.py we have to run a separate program for evaluation (eval.py), while model_main.py executes both. For example, training code will be running for a certain time (for example 5 mins or every 2000 steps), then the training will be stopped and evaluation will be run. After the evaluation has finished, the training will be continued again. Then the same cycle is repeated again.

The newer version of Object Detection API of Tensorflow offers model_main.py that trains as well as evaluates the model using the various pre-conditions and preprocessing where as the older versions of Tensorflow Object Detection APIs uses train.py for training and eval.py for evaluating.
Reference : https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10

Related

Tensorflow object detection API only evaluates the latest checkpoint

I've trained SSD mobilent v2 320x320 model with around 4k steps which produced quite a few checkpoints that are saved in my training folder.The issue I am experiencing now is that this only evaluates the latest checkpoint, but I'd like to evaluate all of them at once.
Ideally I would like to see the results in TensorBoard which shows the validation accuracy (mAP) of the different checkpoints as graph - which it does already, but just for the one checkpoint.
I have tried to run my evaluation code to generate a graph for my mAP but it shows my mAP with a simple dot.
Each checkpoint refers to a previous state of your model during training. The graph you see on TensorBoard for mAP, at some points, is the same as the dots that are produced when you run the evaluation once on checkpoint because the checkpoints are not actually different models but your model at different times during training. So the graph of the last model is what you need.

Tensor Flow 2 Object Detection API2 Batch Non Max Suppression in trained Faster-RCNN network on TPU does not seem to work. Is this a bug?

I followed the new Tensorflow 2 Object Detection API 2 documentation to train a Faster RCNN detector using transfer learning on Google Cloud Platform TPU. After the training is completed, I dowloaded the result on my workstation and exported the model using the tensorflow 2 implementation ('object_detection/exporter_main_v2.py'). I followed the official instructions and setup the environment locally (running on macOS catalina, tensorflow 2.2, python 3.6 etc)
However the Non-Max-Supprersion (NMS) part of the inference pipeline seems not to be working as there are cases where bounding boxes of different classes overlap almost completed. I debugged the code to ensure that the object detection api implementation of NMS (batch_multiclass_non_max_suppression method in object_detection/core/post_processing.py) is called in the inference pipeline for the Faster-RCNN model. It is called twice as expected by the Fast-RCNN architecture on inference.
The instructions I used for GCPs AI-Platform TPU, are the ones in the official object detection api page: link. I made corrections in the training parameters to use the TPU runtime and Python version that are supported on GCP as the actual example are not supported. Instead I used:
gcloud ai-platform jobs submit training whoamiobject_detectiondate +%m_%d_%Y_%H_%M_%S
--job-dir=gs://${MODEL_DIR}
--package-path ./object_detection
--module-name object_detection.model_main_tf2
--runtime-version 2.2
--python-version 3.7
--scale-tier BASIC_TPU
--region us-central1
--
--use_tpu true
--model_dir=gs://${MODEL_DIR}
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
The dataset I used for training was the Pets example from the official object detection api page: link. However I exported it using the Tensorflow 2 Object Detection API 2 methods for consistency.
The pre-trained neural network I uses was the Faster R-CNN ResNet101 V1 1024x1024 trained on TPU.
The configuration file I used was faster_rcnn_resnet101_v1_1024x1024_coco17_tpu-8.config for TPU training.
I changed the number of classes to 37.
I also changed the number of batches to batch_size: 32 as gpc on tpu v2 was crashing.
The fine_tune_checkpoint_type was changed to fine_tune_checkpoint_type: "detection" and the only data augmentation I used was random_horizontal_flip.
The official object detection 2 model zoo reports results on TPU trained architectures other than SSD.
However the official object detection tpu compatibility guide mentions that currently SSD is only supported while non max suppression is not.
Why NMS is not working?
I think that's because the batch_multiclass_non_max_suppression method is a class-aware NMS (or at least is what I understood). This means that, for each class, among all the boxes that belong to the same class with IOUs greater than a threshold only the box with the highest score is retained.
I think you want a class-agnostic NMS (use_class_agnostic_nms: True). Moreover, if you want one class for detection you should also set max_classes_per_detection: 1.

How to convert model trained on custom data-set for the Edge TPU board?

I have trained my custom data-set using the Tensor Flow Object Detection API. I run my "prediction" script and it works fine on the GPU. Now , I want to convert the model to lite and run it on the Google Coral Edge TPU Board to detect my custom objects. I have gone through the documentation that Google Coral Board Website provides but I found it very confusing.
How to convert and run it on the Google Coral Edge TPU Board?
Thanks
Without reading the documentation, it will be very hard to continue. I'm not sure what your "prediction script" means, but I'm assuming that the script loaded a .pb tensorflow model, loaded some image data, and run inference on it to produce prediction results. That means you have a .pb tensorflow model at the "Frozen graph" stage of the following pipeline:
Image taken from coral.ai.
The next step would be to convert your .pb model to a "fully quantized .tflite model" using the post training quantization technique. The documentation to do that are given here. I also created a github gist, containing an example of Post Training Quantization here. Once you have produced the .tflite model, you'll need to compile the model via the edgetpu_compiler. Although everything you need to know about the edgetpu compiler is in that link, for your purpose, compiling a model is as simple as:
$ edgetpu_compiler your_model_name.tflite
Which will creates a your_model_name_edgetpu.tflite model that is compatible with the EdgeTPU. Now, if at this stage, instead of creating an edgetpu compatible model, you are getting some type of errors, then that means your model did not meets the requirements that are posted in the models-requirements section.
Once you have produced a compiled model, you can then deploy it on an edgetpu device. Currently are 2 main APIs that can be use to run inference with the model:
EdgeTPU API
python api
C++ api
tflite API
C++ api
python api
Ultimately, there are many demo examples to run inference on the model here.
The previous answer works with general classification models, but not with TF object detection API trained models.
You cannot do post-training quantization with TF Lite converter on TF object detection API models.
In order to run object detection models on EdgeTPU-s:
You must train the models in quantized aware training mode with this addition in model config:
graph_rewriter {
quantization {
delay: 48000
weight_bits: 8
activation_bits: 8
}
}
This might not work with all the models provided in the model-zoo, try a quantized model first.
After training, export the frozen graph with: object_detection/export_tflite_ssd_graph.py
Run tensorflow/lite/toco tool on the frozen graph to make it TFLite compatible
And finally run edgetpu_complier on the .tflite file
You can find more in-depth guide here:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_mobile_tensorflowlite.md

Can we run training and validation on separate GPUs using tensorflow object detection API running on tensorflow 1.12?

I have two Nvidia Titan X cards on my machine and want to finetune COCO pretrained Inception V2 model on a single specific class. I have created the train/val tfrecords and changed the config to run the tensorflow object detection training pipeline.
I am able to start the training but it hangs (without any OOM) whenever it tries to evaluate a checkpoint. Currently it is using only GPU 0 with other resource parameters (like RAM, CPU, IO etc) in normal range. So I am guessing that GPU is the bottleneck. I wanted to try splitting training and validation on separate GPUs and see if it works.
I tried to look for a place where I could do something like setting "CUDA_VISIBLE_DEVICES" differently for both the processes but unfortunately the latest tensorflow object detection API code (using tensorflow 1.12) makes it very difficult to do so. I am also unable to verify my assumption about training and validation running in same process as my machine hangs. Could someone please suggest where to look for to solve it?

Evaluate a model created using Tensorflow Object Detection API

I trained a model using Tensorflow object detection API for detecting swimming pools using satellite images. I used 'faster_rcnn_inception_v2_coco_2018_01_28' model for training. I generated a frozen inference graph (.pb). I want to evaluate the precision and recall of the model. Can someone tell me how I can do that, preferably without using pycocotools as I was facing some issues with that. Any suggestions are welcome :)
From the Object Detection API you can run "eval.py" from "models/research/object_detection/legacy/".
Your have to define an evaluation metric in your config file (see the supported evaluation protocols)
For example:
eval_config: {metrics_set: "coco_detection_metrics"}
The Pascal VOC e.g. then gives you the mean Average Precsion (mAP)