Faster R-CNN (frozen inference graph inception v2 based) execution time is same for 360p and 1080p. How is this possible? - object-detection

I just implemented Faster R-CNN (frozen inference graph inception v2 based) object detection model on Jetson TX2 with jetpack 4.2 and tensorflow version 1.14. The model was given an input frame of resolution 1080p and later 360p. Surprisingly, there was no change in execution time. What could possibly be the reason for this?

Faster R-CNN consists of 3 main blocks:
Base feature network (generates feature map from input image/frame),
Region proposal network (generates/proposes/selects interesting regions for
final bounding box generation from anchors) and
Detection network (RPN) (classifies the region as background or foreground
and refines bounding boxes).
Most of the complexities of Faster R-CNN lie in RPN and detection network and the RPN has a fixed input shape. Therefore, the execution time of the model is not affected significantly.

Related

Small batch size using Object Detection API

I've developed a custom model using tf object detection api for human keypoint estimation.
Architecture is MobilenetV3 + FPN + Centernet. In the model zoo I saw there is an example using MobilenetV2 as feature extractor instead, and the pipeline.config there seems to be using batch size 512. I'm training on an Nvidia A100 80GB GPU, and it can only fit a batch size of 32. I've tried with only powers of 2 batch sizes because it makes adapting the training steps number easy.
This would suggest that I might need 16 such GPUs to train the model with the suggested 512 batch size. Are needed resources for training such a model expected to be this high?

Is it possible to significantly reduce the inference time of images by reducing the number of object classes?

I am using YOLOv4 to train my custom detector. Source: https://github.com/AlexeyAB/darknet
One of the issues while training is the computing power of GPU and available video RAM. What is the relationship between number of object classes and the time it takes to train the model? Also, is it possible to significantly reduce the inference time of images by reducing the number of object classes? The goal is to run inference on a Raspberry Pi or a Jetson Nano.
Any help is much appreciated. Thanks.
Change is number of classes doesn't have significant impact on
inference time.
For example in case of Yolov4, which has got 3 Yolo layers, change in classes leads to change in filter size for conv layers preceding Yolo layers and some computation reduction within Yolo layers that's all. This is very minute compared to overall inference time as conv layers preceding Yolo layers are bottom layers with very small width and hight and also time spent on logic that depends upon number of classes within Yolo layer is very less.
Here:
filters=(classes + 5)x3
Note that tinier version of yolov4 i.e tiny-yolov4 have got two Yolo layers only, instead of 3.
If your intent is to reduce inference time, especially on raspberry pi or a jetson nano, without losing on accuracy/mAP, do following things:
Quantisation: Run inference with INT8 instead of FP32. You can use this repo for this purpose. You can do this for both Jetson nano and raspberry pi.
Use inference library such as tkDNN, which is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards. You can use this for Jetson nano. Note that with TensorRT, you can use INT8 and FP16 instead of FP32 to reduce detection time.
Following techniques can be used to reduce inference time, but they come at the cost of significant drop in accuracy/mAP:
You can train the models with tinier versions rather than full Yolo versions.
Model Pruning - If you could rank the neurons in the network according to how much they contribute, you could then remove the low ranking neurons from the network, resulting in a smaller and faster network. Pruned yolov3 research paper and it's implementation. This is another pruned Yolov3 implementation.
I tried reducing the number of classes from 80 to 5 classes, I was aiming to detect vehicles only, on YOLOv3 and found a reduction in time. For example, using Intel Core i5-6300U CPU # 2.40 GHz, the time was reduced by 50%, and for Nvidia GeForce 930M, it was reduced by 20%. Generally, the stronger the processor, the less reduction in time you get.

Large input image limitations for VGG19 transfer learning

I'm using the Tensorflow (using the Keras API) in Python 3.0. I'm using the VGG19 pre-trained network to perform style transfer on an Nvidia RTX 2070.
The largest input image that I have is 4500x4500 pixels (I have removed the fully-connected layers in the VGG19 to allow for a fully-convolutional network that handles arbitrary image sizes.) If it helps, my batch size is just 1 image at a time currently.
1.) Is there an option for parallelizing the evaluation of the model on the image input given that I am not training the model, but just passing data through the pre-trained model?
2.) Is there any increase in capacity for handling larger images in going from 1 GPU to 2 GPUs? Is there a way for the memory to be shared across the GPUs?
I'm unsure if larger images make my GPU compute-bound or memory-bound. I'm speculating that it's a compute issue, which is what started my search for parallel CNN evaluation discussions. I've seen some papers on tiling methods that seem to allow for larger images

Faster RCNN + inception v2 input size

What is the input size of faster RCNN RPN?
I'm using an object detection API of Tensorflow which is using faster RCNN as region proposal network ( RPN ) and Inception as feature extractor ( according to the config file ). The API is using the online approach in prediction phase and detects every input image singly. however, I'm now trying to feed images to the network in the batch manner by use of Tensorflow dataset API.
as you know for make batch out of the data, firstly we need to resize all of the images to a same size. I think the best way of resizing the images is to resize them exactly to the input size of faster RCNN to avoid duplicate resizing. Now my question is what is the input size of the faster RCNN RPN?
thanks in advance
It depends on the input resolution which was specified in the pipeline config file, in image_resizer.
For example, for Faster R-CNN over InceptionV2 trained on COCO dataset, see this config file.
The specified resolution is 600x1024.
On a side note, fully convolutional architectures (such as RFCN, SSD, YOLO) don't restrict to a single resolution, i.e. you can apply them on different input resolution without modifying the architecture.
But this doesn't mean that the model will be robust to it if you're training on a single resolution.

Tensorflow object detection: why is the location in image affecting detection accuracy when using ssd mobilnet v1?

I'm training a model to detect meteors within a picture of the night sky and I have a fairly small dataset with about 85 images and each image is annotated with a bounding box. I'm using the transfer learning technique starting with the ssd_mobilenet_v1_coco_11_06_2017 checkpoint and Tensorflow 1.4. I'm resizing images to 600x600pixels during training. I'm using data augmentation in the pipeline configuration to randomly flip the images horizontally, vertically and rotate 90 deg. After 5000 steps, the model converges to a loss of about 0.3 and will detect meteors but it seems to matter where in the image the meteor is located. Do I have to train the model by giving examples of every possible location? I've attached a sample of a detection run where I tiled a meteor over the entire image and received various levels of detection (filtered to 50%). How can I improve this?detected meteors in image example
It could very well be your data and I think you are making a prudent move by improving the heterogeneity of your dataset, BUT it could also be your choice of model.
It is worth noting that ssd_mobilenet_v1_coco has the lowest COCO mAP relative to the other models in the TensorFlow Object Detection API model zoo. You aren't trying to detect a COCO object, but the mAP numbers are a reasonable aproximation for generic model accuracy.
At the highest possible level, the choice of model is largely a tradeoff between speed/accuracy. The model you chose, ssd_mobilenet_v1_coco, favors speed over accuracy. Consequently, I would reccomend you try one of the Faster RCNN models (e.g., faster_rcnn_inception_v2_coco) before you spend a signifigant amount of time preprocessing images.