suggestion how to train pre trained models by using Yolov4? - object-detection

I am looking the optimal way to train pre-trained models for YOLOv4
I have my local environment Debian 10 OS,
GeForce RTX 2060 SUPER
GeForce GTX 750 Ti
I planning to train the models based on different size of images, and the trained models I am going to use as part of microservices developed on java, or python.
Should I use any third party services like Google colab?
and the second question what the framework better to use ? (pytorch, tensorflow etc)
Thank you for suggestion

You can use this repo, https://github.com/AlexeyAB/darknet. It has all instructions for custom training, transfer learning and also colab training, inference script.
You can use the provided convolutional layer weights to improve results faster and on small dataset. Different dimension images can be used for training.
Custom training instructions,
https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
Training and inference in colab,
https://colab.research.google.com/drive/1_GdoqCJWXsChrOiY8sZMr_zbr_fH-0Fg

Related

Optimize Tensorflow Object Detection Model V2 Centernet Model for Evaluation

I am using the tensorflow centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 object detection model on a Nvidia Tesla P100 to extract bounding boxes and keypoints for detecting people in a video. Using the pre-trained from tensorflow.org, I am able to process about 16 frames per second. Is there any way I can imporve the evaluation speed for this model? Here are some ideas I have been looking into:
Pruning the model graph since I am only detecting 1 type of object (people)
Have not been successful in doing this. Changing the label_map when building the model does not seem to improve performance.
Hard coding the input size
Have not found a good way to do this.
Compiling the model to an optimized form using something like TensorRT
Initial attempts to convert to TensorRT did not have any performance improvements.
Batching predictions
It looks like the pre-trained model has the batch size hard coded to 1, and so far when I try to change this using the model_builder I see a drop in performance.
My GPU utilization is about ~75% so I don't know if there is much to gain here.
TensorRT should in most cases give a large increase in frames per second compared to Tensorflow.
centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 can be found in the TensorFlow Model Zoo.
Nvidia has released a blog post describing how to optimize models from the TensorFlow Model Zoo using Deepstream and TensorRT:
https://developer.nvidia.com/blog/deploying-models-from-tensorflow-model-zoo-using-deepstream-and-triton-inference-server/
Now regarding your suggestions:
Pruning the model graph: Pruning the model graph can be done by converting your tensorflow model to a TF-TRT model.
Hardcoding the input size: Use the static mode in TF-TRT. This is the default mode and enabled by: is_dynamic_op=False
Compiling the model: My advise would be to convert you model to TF-TRT or first to ONNX and then to TensorRT.
Batching: Specifying the batch size is also covered in the NVIDIA blog post.
Lastly, for my model a big increase in performance came from using FP16 in my inference engine. (mixed precision) You could even try INT8 although then you first have to callibrate.

Is it possible to significantly reduce the inference time of images by reducing the number of object classes?

I am using YOLOv4 to train my custom detector. Source: https://github.com/AlexeyAB/darknet
One of the issues while training is the computing power of GPU and available video RAM. What is the relationship between number of object classes and the time it takes to train the model? Also, is it possible to significantly reduce the inference time of images by reducing the number of object classes? The goal is to run inference on a Raspberry Pi or a Jetson Nano.
Any help is much appreciated. Thanks.
Change is number of classes doesn't have significant impact on
inference time.
For example in case of Yolov4, which has got 3 Yolo layers, change in classes leads to change in filter size for conv layers preceding Yolo layers and some computation reduction within Yolo layers that's all. This is very minute compared to overall inference time as conv layers preceding Yolo layers are bottom layers with very small width and hight and also time spent on logic that depends upon number of classes within Yolo layer is very less.
Here:
filters=(classes + 5)x3
Note that tinier version of yolov4 i.e tiny-yolov4 have got two Yolo layers only, instead of 3.
If your intent is to reduce inference time, especially on raspberry pi or a jetson nano, without losing on accuracy/mAP, do following things:
Quantisation: Run inference with INT8 instead of FP32. You can use this repo for this purpose. You can do this for both Jetson nano and raspberry pi.
Use inference library such as tkDNN, which is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards. You can use this for Jetson nano. Note that with TensorRT, you can use INT8 and FP16 instead of FP32 to reduce detection time.
Following techniques can be used to reduce inference time, but they come at the cost of significant drop in accuracy/mAP:
You can train the models with tinier versions rather than full Yolo versions.
Model Pruning - If you could rank the neurons in the network according to how much they contribute, you could then remove the low ranking neurons from the network, resulting in a smaller and faster network. Pruned yolov3 research paper and it's implementation. This is another pruned Yolov3 implementation.
I tried reducing the number of classes from 80 to 5 classes, I was aiming to detect vehicles only, on YOLOv3 and found a reduction in time. For example, using Intel Core i5-6300U CPU # 2.40 GHz, the time was reduced by 50%, and for Nvidia GeForce 930M, it was reduced by 20%. Generally, the stronger the processor, the less reduction in time you get.

Large input image limitations for VGG19 transfer learning

I'm using the Tensorflow (using the Keras API) in Python 3.0. I'm using the VGG19 pre-trained network to perform style transfer on an Nvidia RTX 2070.
The largest input image that I have is 4500x4500 pixels (I have removed the fully-connected layers in the VGG19 to allow for a fully-convolutional network that handles arbitrary image sizes.) If it helps, my batch size is just 1 image at a time currently.
1.) Is there an option for parallelizing the evaluation of the model on the image input given that I am not training the model, but just passing data through the pre-trained model?
2.) Is there any increase in capacity for handling larger images in going from 1 GPU to 2 GPUs? Is there a way for the memory to be shared across the GPUs?
I'm unsure if larger images make my GPU compute-bound or memory-bound. I'm speculating that it's a compute issue, which is what started my search for parallel CNN evaluation discussions. I've seen some papers on tiling methods that seem to allow for larger images

Is it possible to train a H2O model with GPU and predict with a CPU?

For trainining speed, it would be nice to be able to train a H2O model with GPUs, take the model file, and then predict on a machine without GPUs.
It seems like that should be possible in theory, but with the H2O release 3.13.0.341, that doesn't seem to happen, except for XGBoost model.
When I run gpustat -cup I can see the GPUs kick in when I train H2O's XGBoost model. This doesn't happen with DL, DRF, GLM, or GBM.
I wouldn't be surprised if a difference in float point size (16, 32, 64) could cause some inconsistency, not to mention the vagaries due to multiprocessor modeling, but I think I could live with that.
(This is related to my question here, but now that I understand the environment better I can see that the GPUs aren't used all the time.)
How can I tell if H2O 3.11.0.266 is running with GPUs?
The new XGBoost integration in H2O is the only GPU-capable algorithm in H2O (proper) at this time. So you can train an XGBoost model on GPUs and score on CPUs, but that's not true for the other H2O algorithms.
There is also the H2O Deep Water project, which provides integration between H2O and three third-party deep learning backends (MXNet, Caffe and TensorFlow), all of which are GPU-capable. So you can train those models using a GPU and score on a CPU as well. You can download the H2O Deep Water jar file (or R package, or Python module) at the Deep Water link above, and you can find out more info in the Deep Water GitHub repo README.
Yes, you do the heavy job of training on a GPU, save weights and then, your CPU will only do the matrix multiplication for predictions.
In Keras you can train your model and save Neural Network weights:
model.save_weights('your_model_weights.h5')
model.load_weights('your_model_weights.h5')

why is multi GPU tensorflow retraining not working

I have been training my tensorflow retraining algorithm using a single GTX Titan and it works just fine, but when I try to use multiple gpus in the flower of retraining example it does not work and seems to only utilize one GPU when I run it in Nvidia SMI.
Why is this happening as it does work with multiple gpus when retraining at Inception model from scratch but not during retraining?
TensorFlow's flower retraining example does not work with multiple GPUs at all, even if you set --num_gpus > 1. It should support a single GPU as you noted.
The model needs to be modified to utilize multiple GPUs in parallel. Unfortunately, a single TensorFlow operation like the flower retraining example can't automatically be split over multiple GPUs at this time.