YoloV3 deployment on JETSON TX2 - tensorflow

I faced problem regarding Yolo object detection deployment on TX2.
I use pre-trained Yolo3 (trained on Coco dataset) to detect some limited objects (I mostly concern on five classes, not all the classes), The speed is low for real-time detection, and the accuracy is not perfect (but acceptable) on my laptop. I’m thinking to make it faster by multithreading or multiprocessing on my laptop, is it possible for yolo?
But my main problem is that algorithm is not running on raspberry pi and nvidia TX2.
Here are my questions:
In general, is it possible to run yolov3 on TX2 without any modification like accelerators and model compression techniques?
I cannot run the model on TX2. Firstly I got error regarding camera, so I decided to run the model on a video, this time I got the 'cannot allocate memory in static TLS block' error, what is the reason of getting this error? the model is too big. It uses 16 GB GPU memory on my laptop.The GPU memory of raspberry and TX2 are less than 8GB. As far as I know there are two solutions, using a smaller model or using tensor RT or pruning. Do you have any idea if there is any other way?
if I use tiny-yolo I will get lower accuracy and this is not what I want. Is there any way to run any object detection model with high performance for real-time in terms of both accuracy and speed (FPS) on raspberry pi or NVIDIA TX2?
If I clean the coco data for just the objects I concern and then train the same model, I would get higher accuracy and speed but the size would not change, Am I correct?
In general, what is the best model in terms of accuracy for real-time detection and what is the best in terms of speed?
How is Mobilenet? Is it better than YOLOs in terms of both accuracy and speed?

1- Yes it is possible. I already run Yolov3 on Jetson Nano.
2- It depends on model and input resolution of data. You can decrease input resolution. Input images are transferred to GPU VRAM to use on model. Big input sizes can allocate much memory. As far as I remember I have run normal Yolov3 on Jetson Nano(which is worse than tx2) 2 years ago. Also, you can use Yolov3-tiny and Tensorrt as you mention. There are many sources on the web like this & this.
3- I suggest you to have a look at here. In this repo, you can make transfer learning with your dataset & optimize the model with TensorRT & run it on Jetson.
4- Size not dependent to dataset. It depend the model architecture(because it contains weights). Speed probably does not change. Accuracy depends on your dataset. It can be better or worser. If any class on COCO is similiar to your dataset's any class, I suggest you to transfer learning.
5- You have to find right model with small size, enough accuracy, gracefully speed. There is not best model. There is best model for your case which depend on also your dataset. You can compare some of the model's accuracy and fps here.
6- Most people uses mobilenet as feature extractor. Read this paper. You will see Yolov3 have better accuracy, SSD with MobileNet backbone have better FPS. I suggest you to use jetson-inference repo.
By using jetson-inference repo, I get enough accuracy on SSD model & get 30 FPS. Also, I suggest you to use MIPI-CSI camera on Jetson. It is faster than USB cameras.

I fixed the problem 1 and 2 only by replacing import order of the opencv and tensorflow inside the script.Now I can run Yolov3 without any modification on tx2. I got average FPS of 3.

Related

TRT versus TF-TRT

I need to convert some models to be able to deploy them on jetson devices.
I have tried the TensorRT for Yolov3 trained on coco 80, but I wasn't successful to inference it so I decided to do the TF-TRT. It worked on my laptop, the FPS is increased but the size and the GPU memory usage didn't changed. Size of model was 300MB, it gets abit bigger. Before and after TF-TRT model still using 16 GB GPU memory.
Is it sth usual? I mean is it ok or there is sth wrong? I expected to achieve lower size, lesser GPU memory usage and higher FPS (BTW nodes are reduced).
The important thing is that the FPS jumps hardly after TF-TRT. I got around 3FPS before TF-TRT but after that I am getting 4,6,7,8,9 FPSs, but the FPS is not changing smoothly, for example for the first frame I get 4, and for the second frame I get 9 FPS, I can see these jumps in the visualization over the video as well. why this happened? How can I fix it?
I have read that TRT has better performance than TF-TRT. Is it True?
What is the exact difference between them? I am confused
I have another model that I need to convert it to TRT but it is a pytorch model (HourGlass CNN). Do you know how I can do it? Is there any valid/working repo on github or tutorials on YouTube which you can share?
Tensorflow to TRT is easier or Pytorch to TRT?
Thank you very much
Hope my experience match your needs
1 - Yes it is usual with models that are not prepared to be optimized a lot. Yolo is a very huge model, no matters if you translate to TRT. TRT make it works and better than TF-TRT, because with TRT the model is optimized 100% or it fail. With TF-TRT the optimization ocurrs only on the layers that could be optimized and other are leave as it is.
2 - Yes you could fix it! For Jetson Nano you have deepstream, a optimized framwork to run all inference over GPU wthout using CPU to move memory (using TRT inside). For deepstream you have a YOlo demo optimized, in Jetson nano I have achive 12 FPS for YOlov3, and you have the option of tinyYolo for better performance.
https://www.reddit.com/r/learnmachinelearning/comments/hy50dl/a_tutorial_on_implementing_yolo_v3_with/
3 - As I mention before. IF you translate your model to TRT from ONNX or etlt using TRTexec or deepstream, the system will optimize 100% of the layers or it will fail in the process. With TF-TRT the system "do it best" but not guarantee that all layers are optmized to the specific hardware. TF-TRT is a better solution for custom/rare models or if you need to make quick test.
4/5 - In the past, if you have a Pytorch model you need first to convert it to ONNX and then to TRT with trtExec. In the last month, with TRT 8.0 you have the posibility yo use pytoch-TRT, like tensorflow-trt. So today is the same. but if performance FPS is your concern I recommend you to go from tensorflow/pytorch to ONNX and then to TRT with trtexec or deepstream.

Is it possible to significantly reduce the inference time of images by reducing the number of object classes?

I am using YOLOv4 to train my custom detector. Source: https://github.com/AlexeyAB/darknet
One of the issues while training is the computing power of GPU and available video RAM. What is the relationship between number of object classes and the time it takes to train the model? Also, is it possible to significantly reduce the inference time of images by reducing the number of object classes? The goal is to run inference on a Raspberry Pi or a Jetson Nano.
Any help is much appreciated. Thanks.
Change is number of classes doesn't have significant impact on
inference time.
For example in case of Yolov4, which has got 3 Yolo layers, change in classes leads to change in filter size for conv layers preceding Yolo layers and some computation reduction within Yolo layers that's all. This is very minute compared to overall inference time as conv layers preceding Yolo layers are bottom layers with very small width and hight and also time spent on logic that depends upon number of classes within Yolo layer is very less.
Here:
filters=(classes + 5)x3
Note that tinier version of yolov4 i.e tiny-yolov4 have got two Yolo layers only, instead of 3.
If your intent is to reduce inference time, especially on raspberry pi or a jetson nano, without losing on accuracy/mAP, do following things:
Quantisation: Run inference with INT8 instead of FP32. You can use this repo for this purpose. You can do this for both Jetson nano and raspberry pi.
Use inference library such as tkDNN, which is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards. You can use this for Jetson nano. Note that with TensorRT, you can use INT8 and FP16 instead of FP32 to reduce detection time.
Following techniques can be used to reduce inference time, but they come at the cost of significant drop in accuracy/mAP:
You can train the models with tinier versions rather than full Yolo versions.
Model Pruning - If you could rank the neurons in the network according to how much they contribute, you could then remove the low ranking neurons from the network, resulting in a smaller and faster network. Pruned yolov3 research paper and it's implementation. This is another pruned Yolov3 implementation.
I tried reducing the number of classes from 80 to 5 classes, I was aiming to detect vehicles only, on YOLOv3 and found a reduction in time. For example, using Intel Core i5-6300U CPU # 2.40 GHz, the time was reduced by 50%, and for Nvidia GeForce 930M, it was reduced by 20%. Generally, the stronger the processor, the less reduction in time you get.

How to estimate how much GPU memory required for deep learning?

We are trying to train our model for object recognition using tensorflow. Since there are too many images (100GB), I guess our current GPU server (1*2080Ti) could not work. We may need to purchase a more powerful one, but I do not sure how to estimate how much GPU memory we need. Is there some approach to estimate the requirements? thanks!
Your 2080Ti would do just fine for your task. The GPU memory for DL tasks are dependent on many factors such as number of trainable parameters in the network, size of the images you are feeding, batch size, floating point type (FP16 or FP32) and number of activations and etc. I think you get confused about loading all of the images to GPU memory at once. We do not do that, instead we use minibatches of different sizes to fit all of the images and params into memory. Throw any kind of network to your 2080Ti and adjust batch size then your training will run smoothly. You could go with your 2080Ti or can get another or two increase training speed. This blogpost provides beautiful insights about creating optimal DL environments.

Is there a standard way to optimize models to run well on different mobile devices?

I’m working on a few side projects that involve deploying ML models to the edge. One of them is a photo-editing app that includes CNN’s for facial recognition, object detection, classification, and style transfer. The other is a NLP app that assists in the writing process by suggesting words and sentence completions..
Once I have a trained model that’s accurate, it ends up being really slow on one or more mobile devices that I'm testing on (usually the lower end Android). I’ve read that there are optimizations one can do to speed models up, but I don’t know how. Is there a standard, go-to tool for optimizing models for mobile/edge?
I will be talking about TensorFlow Lite specifically it is a platform for running TensorFlow ops on Android and iOS. There are several optimisation techniques mentioned on their website but I will discuss the ones which feel important to me.
Constructing relevant models for platforms:
The first step in model optimization is its construction from scratch meaning TensorFlow. We need to create a model which can be used exported to a memory constrained device.
We definitely need to train different models for different machines. A model constructed to work on a high-end TPU will never run efficiently on a Mobile processor.
Create a model which has minimum layers and ops.
Do this without compromising the model's accuracy.
For this, you will need expertise in ML and also which ops are the best to preprocess data.
Also, extra preprocessing of input data brings down the model complexity to a great extent.
Model quantization:
We convert the high precision floats or decimals to lower precision floats. It affects the model's performance slightly but greatly reduces the model size and then holds less of the memory.
Post-training quantization is a general technique to reduce model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights from floating point to 8-bits of precision - from TF docs.
You can see the TensorFlow Lite TFLiteConverter example:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
Also you should try using the post_training_quantize= flag which reduces the model size considerably.
Hope it helps.

Why is the output video too slow in darknet?

I trained my own dataset for yolov2 in darknet. I am using ubuntu 18.04 and has no GPU. When I play a video(which i have taken in my smart phone) for testing, it is too slow. Is it because i don't have a GPU? Or is it because of some other reasons?
Can someone reply me.
Without a gpu, yolov2 is going to be very slow and if you have a modern smart phone it's likely that video is high resolution with a high frame rate. I'm not sure of your implementation but it's likely you're processing every frame in the video instead of skipping every other frame or only processing every 10th frame.
If you don't have a gpu available (and aren't going to) another way to get gpu type performance is using Intel's Openvino if you have a recent I-series processor. You'd be able to convert your yolov2 model to open vino and run it on a cpu with really fast inference times (likely <100ms per frame). I will say I ran yolov3 off of Openvino though and it was really slow compared to other object detectors and especially compared to a mobilenet.
I also have some demo's set up to test between yolov3 on a cpu and open vino on a cpu, you can check those out on SugarKubes
1 big reason is of course because you don't have GPU. The other reason is the model that you use. You use YoloV2 which is faster than YoloV3 but still slower compared to TinyYolo or TinyYoloV3.
So, this is the trade off between accuracy and speed, the faster your model the lower the accuracy. If you are going for speed, than there are 3 solutions that I can think of :
Use GPU (I know it's expensive but worth the price, nvidia gtx 1060++ would be great)
Change your model to TinyYolo or TinyYoloV3. I recommend using TinyYolov3 for higher fps
TinyYoloV3 : 220 fps
TinyYolo : 207 fps
YoloV2 : 67 fps
Use OpenVino as Andrew Pierno said
Download model from here : https://pjreddie.com/darknet/yolo/
Yolov2's link : https://pjreddie.com/darknet/yolov2/