Specify GPU to run training on (TF object detection API, model zoo) - tensorflow

I am training object detection on a device with multiple GPU's, and want to run training on gpu 1 (keeping 0 and 2 free) and cannot see an option to do so when starting training. I have looked through train.py and model_main.py and cannot find a line to change in there as well. Any suggestions?

use CUDA_VISIBLE_DEVICES environment variable.
you can do it by adding the following to your command line:
CUDA_VISIBLE_DEVICES=1 python <your-python-script>
it will enable only GPU 1 for TF.

Related

Tensor Flow 2 Object Detection API2 Batch Non Max Suppression in trained Faster-RCNN network on TPU does not seem to work. Is this a bug?

I followed the new Tensorflow 2 Object Detection API 2 documentation to train a Faster RCNN detector using transfer learning on Google Cloud Platform TPU. After the training is completed, I dowloaded the result on my workstation and exported the model using the tensorflow 2 implementation ('object_detection/exporter_main_v2.py'). I followed the official instructions and setup the environment locally (running on macOS catalina, tensorflow 2.2, python 3.6 etc)
However the Non-Max-Supprersion (NMS) part of the inference pipeline seems not to be working as there are cases where bounding boxes of different classes overlap almost completed. I debugged the code to ensure that the object detection api implementation of NMS (batch_multiclass_non_max_suppression method in object_detection/core/post_processing.py) is called in the inference pipeline for the Faster-RCNN model. It is called twice as expected by the Fast-RCNN architecture on inference.
The instructions I used for GCPs AI-Platform TPU, are the ones in the official object detection api page: link. I made corrections in the training parameters to use the TPU runtime and Python version that are supported on GCP as the actual example are not supported. Instead I used:
gcloud ai-platform jobs submit training whoamiobject_detectiondate +%m_%d_%Y_%H_%M_%S
--job-dir=gs://${MODEL_DIR}
--package-path ./object_detection
--module-name object_detection.model_main_tf2
--runtime-version 2.2
--python-version 3.7
--scale-tier BASIC_TPU
--region us-central1
--
--use_tpu true
--model_dir=gs://${MODEL_DIR}
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
The dataset I used for training was the Pets example from the official object detection api page: link. However I exported it using the Tensorflow 2 Object Detection API 2 methods for consistency.
The pre-trained neural network I uses was the Faster R-CNN ResNet101 V1 1024x1024 trained on TPU.
The configuration file I used was faster_rcnn_resnet101_v1_1024x1024_coco17_tpu-8.config for TPU training.
I changed the number of classes to 37.
I also changed the number of batches to batch_size: 32 as gpc on tpu v2 was crashing.
The fine_tune_checkpoint_type was changed to fine_tune_checkpoint_type: "detection" and the only data augmentation I used was random_horizontal_flip.
The official object detection 2 model zoo reports results on TPU trained architectures other than SSD.
However the official object detection tpu compatibility guide mentions that currently SSD is only supported while non max suppression is not.
Why NMS is not working?
I think that's because the batch_multiclass_non_max_suppression method is a class-aware NMS (or at least is what I understood). This means that, for each class, among all the boxes that belong to the same class with IOUs greater than a threshold only the box with the highest score is retained.
I think you want a class-agnostic NMS (use_class_agnostic_nms: True). Moreover, if you want one class for detection you should also set max_classes_per_detection: 1.

TensorFlow Keras Sequential API GPU usage

When using TensorFlow's Keras sequential API is there any way to force my model to be trained on a certain piece of hardware? My understanding is that if there is a GPU to use (and I have tensorflow-gpu installed) I will, by default, do my training on the GPU.
Do I have to switch to a different API to gain more control over where my model is deployed?
I am a keras user and I work on ubuntu. I specify a certain GPU as follows:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
where 0 is the number of GPU. By default, tensorflow uses the first GPU (whose number is 0) if there are several ones on your computer. You can obtain the information of GPUs by typing the following command on your terminal:
nvidia-smi
or
watch -n 1 -d nvidia-smi
if you want to refresh your terminal every second. The following picture shows the information of my GPU, and the number of it has been circled by a red box.

How to specify which GPU to use when running tensorflow?

We have a DGX-1 in Lab.
I see many tasks are running on different GPU.
For MLperf docker application, I can use NV_GPU=x to assign which GPU to use.
However, I have a python Keras/TensorFlow code, I used this same way, the loading doesn't go to the specified GPU.
You could use CUDA_VISIBLE_DEVICES to specify the GPU to be used by your model:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = 0,1 #Will assign GPUs 0 and 1 to the model

Can we run training and validation on separate GPUs using tensorflow object detection API running on tensorflow 1.12?

I have two Nvidia Titan X cards on my machine and want to finetune COCO pretrained Inception V2 model on a single specific class. I have created the train/val tfrecords and changed the config to run the tensorflow object detection training pipeline.
I am able to start the training but it hangs (without any OOM) whenever it tries to evaluate a checkpoint. Currently it is using only GPU 0 with other resource parameters (like RAM, CPU, IO etc) in normal range. So I am guessing that GPU is the bottleneck. I wanted to try splitting training and validation on separate GPUs and see if it works.
I tried to look for a place where I could do something like setting "CUDA_VISIBLE_DEVICES" differently for both the processes but unfortunately the latest tensorflow object detection API code (using tensorflow 1.12) makes it very difficult to do so. I am also unable to verify my assumption about training and validation running in same process as my machine hangs. Could someone please suggest where to look for to solve it?

Difference between `train.py` and `model_main.py` in Tensorflow Object Detection API

I usually just use train.py to train using Tensorflow Object Detection API. However, I read from https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/discussion/68581 that you can also use model_main.py to train your model and see real-time plots and images on Tensorboard.
How do you exactly use model_main.py on Tensorboard?
What is the difference between train.py and model_main.py?
On TensorBoard, the model_main.py output similar graphs like train.py, but in model_main.py, the performance of the model on the evaluation dataset is measured too.
model_main.py is the newer version in TensorFlow Object Detection API. It is used for training and also evaluating the model. When using train.py we have to run a separate program for evaluation (eval.py), while model_main.py executes both. For example, training code will be running for a certain time (for example 5 mins or every 2000 steps), then the training will be stopped and evaluation will be run. After the evaluation has finished, the training will be continued again. Then the same cycle is repeated again.
The newer version of Object Detection API of Tensorflow offers model_main.py that trains as well as evaluates the model using the various pre-conditions and preprocessing where as the older versions of Tensorflow Object Detection APIs uses train.py for training and eval.py for evaluating.
Reference : https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10