Why is the output video too slow in darknet? - object-detection

I trained my own dataset for yolov2 in darknet. I am using ubuntu 18.04 and has no GPU. When I play a video(which i have taken in my smart phone) for testing, it is too slow. Is it because i don't have a GPU? Or is it because of some other reasons?
Can someone reply me.

Without a gpu, yolov2 is going to be very slow and if you have a modern smart phone it's likely that video is high resolution with a high frame rate. I'm not sure of your implementation but it's likely you're processing every frame in the video instead of skipping every other frame or only processing every 10th frame.
If you don't have a gpu available (and aren't going to) another way to get gpu type performance is using Intel's Openvino if you have a recent I-series processor. You'd be able to convert your yolov2 model to open vino and run it on a cpu with really fast inference times (likely <100ms per frame). I will say I ran yolov3 off of Openvino though and it was really slow compared to other object detectors and especially compared to a mobilenet.
I also have some demo's set up to test between yolov3 on a cpu and open vino on a cpu, you can check those out on SugarKubes

1 big reason is of course because you don't have GPU. The other reason is the model that you use. You use YoloV2 which is faster than YoloV3 but still slower compared to TinyYolo or TinyYoloV3.
So, this is the trade off between accuracy and speed, the faster your model the lower the accuracy. If you are going for speed, than there are 3 solutions that I can think of :
Use GPU (I know it's expensive but worth the price, nvidia gtx 1060++ would be great)
Change your model to TinyYolo or TinyYoloV3. I recommend using TinyYolov3 for higher fps
TinyYoloV3 : 220 fps
TinyYolo : 207 fps
YoloV2 : 67 fps
Use OpenVino as Andrew Pierno said
Download model from here : https://pjreddie.com/darknet/yolo/
Yolov2's link : https://pjreddie.com/darknet/yolov2/

Related

TRT versus TF-TRT

I need to convert some models to be able to deploy them on jetson devices.
I have tried the TensorRT for Yolov3 trained on coco 80, but I wasn't successful to inference it so I decided to do the TF-TRT. It worked on my laptop, the FPS is increased but the size and the GPU memory usage didn't changed. Size of model was 300MB, it gets abit bigger. Before and after TF-TRT model still using 16 GB GPU memory.
Is it sth usual? I mean is it ok or there is sth wrong? I expected to achieve lower size, lesser GPU memory usage and higher FPS (BTW nodes are reduced).
The important thing is that the FPS jumps hardly after TF-TRT. I got around 3FPS before TF-TRT but after that I am getting 4,6,7,8,9 FPSs, but the FPS is not changing smoothly, for example for the first frame I get 4, and for the second frame I get 9 FPS, I can see these jumps in the visualization over the video as well. why this happened? How can I fix it?
I have read that TRT has better performance than TF-TRT. Is it True?
What is the exact difference between them? I am confused
I have another model that I need to convert it to TRT but it is a pytorch model (HourGlass CNN). Do you know how I can do it? Is there any valid/working repo on github or tutorials on YouTube which you can share?
Tensorflow to TRT is easier or Pytorch to TRT?
Thank you very much
Hope my experience match your needs
1 - Yes it is usual with models that are not prepared to be optimized a lot. Yolo is a very huge model, no matters if you translate to TRT. TRT make it works and better than TF-TRT, because with TRT the model is optimized 100% or it fail. With TF-TRT the optimization ocurrs only on the layers that could be optimized and other are leave as it is.
2 - Yes you could fix it! For Jetson Nano you have deepstream, a optimized framwork to run all inference over GPU wthout using CPU to move memory (using TRT inside). For deepstream you have a YOlo demo optimized, in Jetson nano I have achive 12 FPS for YOlov3, and you have the option of tinyYolo for better performance.
https://www.reddit.com/r/learnmachinelearning/comments/hy50dl/a_tutorial_on_implementing_yolo_v3_with/
3 - As I mention before. IF you translate your model to TRT from ONNX or etlt using TRTexec or deepstream, the system will optimize 100% of the layers or it will fail in the process. With TF-TRT the system "do it best" but not guarantee that all layers are optmized to the specific hardware. TF-TRT is a better solution for custom/rare models or if you need to make quick test.
4/5 - In the past, if you have a Pytorch model you need first to convert it to ONNX and then to TRT with trtExec. In the last month, with TRT 8.0 you have the posibility yo use pytoch-TRT, like tensorflow-trt. So today is the same. but if performance FPS is your concern I recommend you to go from tensorflow/pytorch to ONNX and then to TRT with trtexec or deepstream.

YoloV3 deployment on JETSON TX2

I faced problem regarding Yolo object detection deployment on TX2.
I use pre-trained Yolo3 (trained on Coco dataset) to detect some limited objects (I mostly concern on five classes, not all the classes), The speed is low for real-time detection, and the accuracy is not perfect (but acceptable) on my laptop. I’m thinking to make it faster by multithreading or multiprocessing on my laptop, is it possible for yolo?
But my main problem is that algorithm is not running on raspberry pi and nvidia TX2.
Here are my questions:
In general, is it possible to run yolov3 on TX2 without any modification like accelerators and model compression techniques?
I cannot run the model on TX2. Firstly I got error regarding camera, so I decided to run the model on a video, this time I got the 'cannot allocate memory in static TLS block' error, what is the reason of getting this error? the model is too big. It uses 16 GB GPU memory on my laptop.The GPU memory of raspberry and TX2 are less than 8GB. As far as I know there are two solutions, using a smaller model or using tensor RT or pruning. Do you have any idea if there is any other way?
if I use tiny-yolo I will get lower accuracy and this is not what I want. Is there any way to run any object detection model with high performance for real-time in terms of both accuracy and speed (FPS) on raspberry pi or NVIDIA TX2?
If I clean the coco data for just the objects I concern and then train the same model, I would get higher accuracy and speed but the size would not change, Am I correct?
In general, what is the best model in terms of accuracy for real-time detection and what is the best in terms of speed?
How is Mobilenet? Is it better than YOLOs in terms of both accuracy and speed?
1- Yes it is possible. I already run Yolov3 on Jetson Nano.
2- It depends on model and input resolution of data. You can decrease input resolution. Input images are transferred to GPU VRAM to use on model. Big input sizes can allocate much memory. As far as I remember I have run normal Yolov3 on Jetson Nano(which is worse than tx2) 2 years ago. Also, you can use Yolov3-tiny and Tensorrt as you mention. There are many sources on the web like this & this.
3- I suggest you to have a look at here. In this repo, you can make transfer learning with your dataset & optimize the model with TensorRT & run it on Jetson.
4- Size not dependent to dataset. It depend the model architecture(because it contains weights). Speed probably does not change. Accuracy depends on your dataset. It can be better or worser. If any class on COCO is similiar to your dataset's any class, I suggest you to transfer learning.
5- You have to find right model with small size, enough accuracy, gracefully speed. There is not best model. There is best model for your case which depend on also your dataset. You can compare some of the model's accuracy and fps here.
6- Most people uses mobilenet as feature extractor. Read this paper. You will see Yolov3 have better accuracy, SSD with MobileNet backbone have better FPS. I suggest you to use jetson-inference repo.
By using jetson-inference repo, I get enough accuracy on SSD model & get 30 FPS. Also, I suggest you to use MIPI-CSI camera on Jetson. It is faster than USB cameras.
I fixed the problem 1 and 2 only by replacing import order of the opencv and tensorflow inside the script.Now I can run Yolov3 without any modification on tx2. I got average FPS of 3.

Low GPU usage by Keras / Tensorflow?

I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output
Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.
You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)
The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.
Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/
Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.

What's the impact of using a GPU in the performance of serving a TensorFlow model?

I trained a neural network using a GPU (1080 ti). The training speed on GPU is far better than using CPU.
Currently, I want to serve this model using TensorFlow Serving. I just interested to know if using GPU in the serving process has a same impact on performance?
Since the training apply on batches but inferencing (serving) uses asynchronous requests, do you suggest using GPU in serving a model using TensorFlow serving?
You still need to do a lot of tensor operations on the graph to predict something. So GPU still provides performance improvement for inference. Take a look at this nvidia paper, they have not tested their stuff on TF, but it is still relevant:
Our results show that GPUs provide state-of-the-art inference
performance and energy efficiency, making them the platform of choice
for anyone wanting to deploy a trained neural network in the field. In
particular, the Titan X delivers between 5.3 and 6.7 times higher
performance than the 16-core Xeon E5 CPU while achieving 3.6 to 4.4
times higher energy efficiency.
The short answer is yes, you'll get roughly the same speedup for running on the GPU after training. With a few minor qualifications.
You're running 2 passes over the data in training, which all happens on the GPU, during the feedforward inference you're doing less work, so there will be more time spent transferring data to the GPU memory relative to computations than in training. This is probably a minor difference though. And you can now asynchronously load the GPU if that's an issue (https://github.com/tensorflow/tensorflow/issues/7679).
Whether you'll actually need a GPU to do inference depends on your workload. If your workload isn't overly demanding you might get away with using the CPU anyway, after all, the computation workload is less than half, per sample, so consider the number of requests per second you'll need to serve and test out whether you overload your CPU to achieve that. If you do, time to get the GPU out!

Strategies for improving performance when using Tensorflow w / C++?

I'm fairly new to Tensorflow in and ML in general and am wondering what strategies I can use to increase performance of an application I am building.
My app is using the Tensorflow C++ interface, with a source compiled TF 0.11 libtensorflow_cc.so (built with bazel build -c opt --copt=-mavx and optionally adding --config=cuda) for either AVX or AVX + CUDA on Mac OS X 10.12.1, on an MacBook Pro 2.8 GHz Intel Core i7 (2 cores 8 threads) with 16GB ram and a Nvidia 750m w/ 2GB VRam)
My application is using Inception V3 model and pulling feature vectors from pool_3 layer. I'm decoding video frames via native API's and passing those in memory buffers to the C++ interface for TF and running them into a session.
I'm not currently batching, but I am caching my session and re-using it for each individual decoded frame / tensor submission. Ive noticed that both CPU and GPU performance is about the same, taking about 40 to 50 seconds to process 222 frames, which seems very slow to me. Ive confirmed CUDA is being invoked, loaded, and the GPU is functioning (or appears so).
Some questions:
In general what should I expect for reasonable performance time wise of TF doing a frame of Inception on a consumer laptop?
How much of a difference does batching make for these operations? For tensors of 1x299x299x3 , I imagine I am doing more PCI transfer waiting than waiting on for meaningful work from the GPU?
if so Is there a good example of batching under C++ for InceptionV3?
Is there operations that cause additional CPU->GPU Syncronization that might otherwise be avoided?
Is there a way to ensure my sessions / graphs share resources ? Can I use nested scopes somehow in this manner? I couldn't quite get that to work but likely missed something.
Any good documentation of general strategies for things to do / avoid?
My code is below:
https://github.com/Synopsis/Synopsis/blob/TensorFlow/Synopsis/TensorFlowAnalyzer/TensorFlowAnalyzer.mm
Thank you very much
For reference, OpenCV analysis using perceptual hash, histogram, dense optical flow, sparse optical flow for point tracking, and simple saliency detection takes 4 to 5 seconds for the same 222 frames using CPU or CPU + OpenCL.
https://github.com/Synopsis/Synopsis/tree/TensorFlow/Synopsis/StandardAnalyzer
Answering your last question first, if there's documentation about performance optimization, yes:
The TensorFlow Performance Guide
The TensorFlow GPU profiling hints
Laptop performance is highly variable, and TF isn't particularly optimized for laptop GPUs. The numbers you're getting (222 frames in 40-50 seconds) ~= 5 fps don't seem crazy on a laptop platform, using the 2016 version of TensorFlow, with inception. With some of the performance improvements outlined in the performance guide above, that should probably be doubled in late 2017.
For batching, yes - the newer example inception model code allows a variable batch size at inference time. This is mostly about whether the model itself was defined to handle a batch size, which is something improved since 2016.
Batching for inference will make a pretty big difference on GPU. Whether it helps on CPU depends a lot -- for example, if you build with MKL-DNN support, batching should be considered mandatory, but basic TensorFlow may not benefit as much.