I'm running a Mask R-CNN model on an edge device (with an NVIDIA GTX 1080). I am currently using the Detectron2 Mask R-CNN implementation and I archieve an inference speed of around 5 FPS.
To speed this up I looked at other inference engines and model implementations. For example ONNX, but I'm not able to gain a faster inference speed.
TensorRT looks very promising to me but I did not found a ready "out-of-the-box" implementation for it.
Are there any other mature and fast inference engines or other techniques to speed up the inference?
It's almost impossible to get higher inference speed for Mask R-CNN on GTX 1080. You may check detectron2 by Facebook AI Research.
Otherwise, I'd suggest to use YOLACT - (You Only Look At CoefficienTs), it can achieve real-time instance segmentation.
On the other hand, if you don't need instance segmentation, you can use YOLO, SSD, etc for object detection.
OpenCV 4.5.0 with DNN_BACKEND_CUDA and DNN_TARGET_CUDA/DNN_TARGET_CUDA_FP16.
Mask RCNN with 1024 x 1024 input image
Device | FPS
------------------ | -------
GTX 1080 Ti (FP32) | 29
RTX 2080 Ti (FP16) | 60
FPS measured includes NMS but excludes other preprocessing and postprocessing. The network fully runs end-to-end on GPU.
Benchmark code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf
As #kkHarshit already mentioned it is very hard to speed up a Mask R-CNN any further.
The fastest instance segmentation model that I found is YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS).
It's perfomance is worse than Mask R-CNN or Yolact even but still very good.
Related
I am using YOLOv4 to train my custom detector. Source: https://github.com/AlexeyAB/darknet
One of the issues while training is the computing power of GPU and available video RAM. What is the relationship between number of object classes and the time it takes to train the model? Also, is it possible to significantly reduce the inference time of images by reducing the number of object classes? The goal is to run inference on a Raspberry Pi or a Jetson Nano.
Any help is much appreciated. Thanks.
Change is number of classes doesn't have significant impact on
inference time.
For example in case of Yolov4, which has got 3 Yolo layers, change in classes leads to change in filter size for conv layers preceding Yolo layers and some computation reduction within Yolo layers that's all. This is very minute compared to overall inference time as conv layers preceding Yolo layers are bottom layers with very small width and hight and also time spent on logic that depends upon number of classes within Yolo layer is very less.
Here:
filters=(classes + 5)x3
Note that tinier version of yolov4 i.e tiny-yolov4 have got two Yolo layers only, instead of 3.
If your intent is to reduce inference time, especially on raspberry pi or a jetson nano, without losing on accuracy/mAP, do following things:
Quantisation: Run inference with INT8 instead of FP32. You can use this repo for this purpose. You can do this for both Jetson nano and raspberry pi.
Use inference library such as tkDNN, which is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards. You can use this for Jetson nano. Note that with TensorRT, you can use INT8 and FP16 instead of FP32 to reduce detection time.
Following techniques can be used to reduce inference time, but they come at the cost of significant drop in accuracy/mAP:
You can train the models with tinier versions rather than full Yolo versions.
Model Pruning - If you could rank the neurons in the network according to how much they contribute, you could then remove the low ranking neurons from the network, resulting in a smaller and faster network. Pruned yolov3 research paper and it's implementation. This is another pruned Yolov3 implementation.
I tried reducing the number of classes from 80 to 5 classes, I was aiming to detect vehicles only, on YOLOv3 and found a reduction in time. For example, using Intel Core i5-6300U CPU # 2.40 GHz, the time was reduced by 50%, and for Nvidia GeForce 930M, it was reduced by 20%. Generally, the stronger the processor, the less reduction in time you get.
I'm new to deep learning. I wanted to build an image classifier using CNN to classify clothing images. I decided to train over the fashion MNIST-dataset which is a dataset of 60,000 images. But I'm aware that training is a very heavy task.
I wanted to know how long will my PC take to train over this dataset and should I go for pre-trained models instead with a compromise of accuracy.
My PC configurations are:
- Intel Core i5-6400 CPU # 2.70 GHz
- 8GB RAM.
- NVIDIA GeForce GTX 1050 Ti.
Even though it depends on data-set size & number of EPOCS(i tried with 50 Epocs) ,here it is small that is 32x32.
So for me when i tried on a machine with
Intel Core i7-6400 CPU # 2.70 GHz
8GB RAM.
NVIDIA GeForce GTX 1050 Ti.
with image size(28x28) as provided in MNIST dataset in Tensorflow.org it took less than 5 minutes.
I trained my own dataset for yolov2 in darknet. I am using ubuntu 18.04 and has no GPU. When I play a video(which i have taken in my smart phone) for testing, it is too slow. Is it because i don't have a GPU? Or is it because of some other reasons?
Can someone reply me.
Without a gpu, yolov2 is going to be very slow and if you have a modern smart phone it's likely that video is high resolution with a high frame rate. I'm not sure of your implementation but it's likely you're processing every frame in the video instead of skipping every other frame or only processing every 10th frame.
If you don't have a gpu available (and aren't going to) another way to get gpu type performance is using Intel's Openvino if you have a recent I-series processor. You'd be able to convert your yolov2 model to open vino and run it on a cpu with really fast inference times (likely <100ms per frame). I will say I ran yolov3 off of Openvino though and it was really slow compared to other object detectors and especially compared to a mobilenet.
I also have some demo's set up to test between yolov3 on a cpu and open vino on a cpu, you can check those out on SugarKubes
1 big reason is of course because you don't have GPU. The other reason is the model that you use. You use YoloV2 which is faster than YoloV3 but still slower compared to TinyYolo or TinyYoloV3.
So, this is the trade off between accuracy and speed, the faster your model the lower the accuracy. If you are going for speed, than there are 3 solutions that I can think of :
Use GPU (I know it's expensive but worth the price, nvidia gtx 1060++ would be great)
Change your model to TinyYolo or TinyYoloV3. I recommend using TinyYolov3 for higher fps
TinyYoloV3 : 220 fps
TinyYolo : 207 fps
YoloV2 : 67 fps
Use OpenVino as Andrew Pierno said
Download model from here : https://pjreddie.com/darknet/yolo/
Yolov2's link : https://pjreddie.com/darknet/yolov2/
I am running deep learning CNN (4-CNN layers and 3 FNN layers) model (written in Keras with tensorflow as backend) on two different machines.
I have 2 machines (A: with a GTX 960 graphics GPU with 2GB memory & clock speed: 1.17 GHz and B: with a Tesla K40 computation GPU with 12GB memory & clock speed: 745MHz)
But when I run the CNN model on A:
Epoch 1/35
50000/50000 [==============================] - 10s 198us/step - loss: 0.0851 - acc: 0.2323
on B:
Epoch 1/35
50000/50000 [==============================] - 43s 850us/step - loss: 0.0800 - acc: 0.3110
The numbers are not even comparable. I am quite new to deep learning and running code on GPUs. Could someone please help me explain why the numbers are so different?
Dataset: CIFAR-10 (32x32 RGB images)
Model batch size: 128
Model number of parameters: 1.2M
OS: Ubuntu 16.04
Nvidia driver version: 384.111
Cuda version: 7.5, V7.5.17
Please let me know if you need any more data.
Edit 1: (adding CPU info)
Machine A (GTX 960): 8 cores - Intel(R) Core(TM) i7-6700 CPU # 3.40GHz
Machine B (Tesla K40c):8 cores - Intel(R) Xeon(R) CPU E5-2637 v4 # 3.50GHz
TL;DR: Measure again with a larger batch size.
Those results do not surprise me much. It's a common mistake to think that an expensive Tesla card (or a GPU for that matter) will automatically do everything faster. You have to understand how GPUs work in order to harness their power.
If you compare the base clock speeds of your devices, you will find that your Xeon CPU has the fastest one:
Nvidia K40c: 745MHz
Nvidia GTX 960: 1127MHz
Intel i7: 3400MHz
Intel Xeon: 3500MHz
This gives you a hint of the speeds at which these devices operate and gives a very rough estimate of how fast they can crunch numbers if they would only do one thing at a time, that is, with no parallelization.
So as you see, GPUs are not fast at all (for some definition of fast), in fact they're quite slow. Also note how the K40c is in fact slower than the GTX 960.
However, the real power of a GPU comes from its ability to process a lot of data simultaneously! If you now check again at how much parallelization is possible on these devices, you will find that your K40c is not so bad after all:
Nvidia K40c: 2880 cuda cores
Nvidia GTX 960: 1024 cuda cores
Intel i7: 8 threads
Intel Xeon: 8 threads
Again, these numbers give you a very rough estimate of how many things these devices can do simultaneously.
Note: I am severely simplifying things: In absolutely no way is a CPU core comparable to a cuda core! They are very very different things. And in no way can base clock frequencies be compared like this! It's just to give an idea of what's happening.
So, your devices needs to be able to process a lot of data in parallel in order to maximize their throughput. Luckily tensorflow already does this for you: It will automatically parallelize all those heavy matrix multiplications for maximum throughput. However this is only going to be fast if the matrices have a certain size. Your batch size is set to 128 which means that almost all of these matrices will have the first dimension set to 128. I don't know the details of your model, but if the other dimensions are not large either, then I suspect that most of your K40c is staying idle during those matrix multiplications. Try to increase the batch size and measure again. You should find that larger batch sizes will make the K40c faster in comparison with the GTX 960. The same should be true for increasing the model's capacity: increase the number of units in the fully-connected layers and the number of filters in the convolutional layers. Adding more layers will probably not help here. The output of the nvidia-smi tool is also very useful to see how busy a GPU really is.
Note however that changing the model's hyper-parameter and/or the batch size will of course have a huge impact on how the model is able to train successfully and naturally you might also hit memory limitations.
Perhaps if increasing the batch size or changing the model is not an option, you could also try to train two models on the K40c at the same time to make use of the idle cores. However I have never tried this, so it might not work at all.
I went through the MNIST tutorial with conv nets and during the training -for the first time- felt the need to use a GPU. I have a Geforce GTX 830M on my laptop and was wondering if I could use it with tensorflow?
Should I invest the time to try to get it working or start searching for a low cost GPU with the right requirements?
[I've been reading about very expensive and highly specialized equipment like the nvidia Digits, equipment's with half precision, etc.]
Looking at this chart the 830M has 5.0 compute capability, so in theory you'll be able to run TensorFlow (which requires 3.5). In practice you'll often hit problems with low memory on laptop GPUs, so you'll likely want to graduate to a desktop to do serious work, but it's a good way to get started.