TRT versus TF-TRT - tensorflow

I need to convert some models to be able to deploy them on jetson devices.
I have tried the TensorRT for Yolov3 trained on coco 80, but I wasn't successful to inference it so I decided to do the TF-TRT. It worked on my laptop, the FPS is increased but the size and the GPU memory usage didn't changed. Size of model was 300MB, it gets abit bigger. Before and after TF-TRT model still using 16 GB GPU memory.
Is it sth usual? I mean is it ok or there is sth wrong? I expected to achieve lower size, lesser GPU memory usage and higher FPS (BTW nodes are reduced).
The important thing is that the FPS jumps hardly after TF-TRT. I got around 3FPS before TF-TRT but after that I am getting 4,6,7,8,9 FPSs, but the FPS is not changing smoothly, for example for the first frame I get 4, and for the second frame I get 9 FPS, I can see these jumps in the visualization over the video as well. why this happened? How can I fix it?
I have read that TRT has better performance than TF-TRT. Is it True?
What is the exact difference between them? I am confused
I have another model that I need to convert it to TRT but it is a pytorch model (HourGlass CNN). Do you know how I can do it? Is there any valid/working repo on github or tutorials on YouTube which you can share?
Tensorflow to TRT is easier or Pytorch to TRT?
Thank you very much

Hope my experience match your needs
1 - Yes it is usual with models that are not prepared to be optimized a lot. Yolo is a very huge model, no matters if you translate to TRT. TRT make it works and better than TF-TRT, because with TRT the model is optimized 100% or it fail. With TF-TRT the optimization ocurrs only on the layers that could be optimized and other are leave as it is.
2 - Yes you could fix it! For Jetson Nano you have deepstream, a optimized framwork to run all inference over GPU wthout using CPU to move memory (using TRT inside). For deepstream you have a YOlo demo optimized, in Jetson nano I have achive 12 FPS for YOlov3, and you have the option of tinyYolo for better performance.
https://www.reddit.com/r/learnmachinelearning/comments/hy50dl/a_tutorial_on_implementing_yolo_v3_with/
3 - As I mention before. IF you translate your model to TRT from ONNX or etlt using TRTexec or deepstream, the system will optimize 100% of the layers or it will fail in the process. With TF-TRT the system "do it best" but not guarantee that all layers are optmized to the specific hardware. TF-TRT is a better solution for custom/rare models or if you need to make quick test.
4/5 - In the past, if you have a Pytorch model you need first to convert it to ONNX and then to TRT with trtExec. In the last month, with TRT 8.0 you have the posibility yo use pytoch-TRT, like tensorflow-trt. So today is the same. but if performance FPS is your concern I recommend you to go from tensorflow/pytorch to ONNX and then to TRT with trtexec or deepstream.

Related

YoloV3 deployment on JETSON TX2

I faced problem regarding Yolo object detection deployment on TX2.
I use pre-trained Yolo3 (trained on Coco dataset) to detect some limited objects (I mostly concern on five classes, not all the classes), The speed is low for real-time detection, and the accuracy is not perfect (but acceptable) on my laptop. I’m thinking to make it faster by multithreading or multiprocessing on my laptop, is it possible for yolo?
But my main problem is that algorithm is not running on raspberry pi and nvidia TX2.
Here are my questions:
In general, is it possible to run yolov3 on TX2 without any modification like accelerators and model compression techniques?
I cannot run the model on TX2. Firstly I got error regarding camera, so I decided to run the model on a video, this time I got the 'cannot allocate memory in static TLS block' error, what is the reason of getting this error? the model is too big. It uses 16 GB GPU memory on my laptop.The GPU memory of raspberry and TX2 are less than 8GB. As far as I know there are two solutions, using a smaller model or using tensor RT or pruning. Do you have any idea if there is any other way?
if I use tiny-yolo I will get lower accuracy and this is not what I want. Is there any way to run any object detection model with high performance for real-time in terms of both accuracy and speed (FPS) on raspberry pi or NVIDIA TX2?
If I clean the coco data for just the objects I concern and then train the same model, I would get higher accuracy and speed but the size would not change, Am I correct?
In general, what is the best model in terms of accuracy for real-time detection and what is the best in terms of speed?
How is Mobilenet? Is it better than YOLOs in terms of both accuracy and speed?
1- Yes it is possible. I already run Yolov3 on Jetson Nano.
2- It depends on model and input resolution of data. You can decrease input resolution. Input images are transferred to GPU VRAM to use on model. Big input sizes can allocate much memory. As far as I remember I have run normal Yolov3 on Jetson Nano(which is worse than tx2) 2 years ago. Also, you can use Yolov3-tiny and Tensorrt as you mention. There are many sources on the web like this & this.
3- I suggest you to have a look at here. In this repo, you can make transfer learning with your dataset & optimize the model with TensorRT & run it on Jetson.
4- Size not dependent to dataset. It depend the model architecture(because it contains weights). Speed probably does not change. Accuracy depends on your dataset. It can be better or worser. If any class on COCO is similiar to your dataset's any class, I suggest you to transfer learning.
5- You have to find right model with small size, enough accuracy, gracefully speed. There is not best model. There is best model for your case which depend on also your dataset. You can compare some of the model's accuracy and fps here.
6- Most people uses mobilenet as feature extractor. Read this paper. You will see Yolov3 have better accuracy, SSD with MobileNet backbone have better FPS. I suggest you to use jetson-inference repo.
By using jetson-inference repo, I get enough accuracy on SSD model & get 30 FPS. Also, I suggest you to use MIPI-CSI camera on Jetson. It is faster than USB cameras.
I fixed the problem 1 and 2 only by replacing import order of the opencv and tensorflow inside the script.Now I can run Yolov3 without any modification on tx2. I got average FPS of 3.

Low FPS on tensorRT YoloV3 Jetson Nano

I converted my custom yolov3 model to onnx then onnx to tesnsorrt model on Jetson nano, it is taking 0.2 sec to predict on images i tried it on video and it is giving only 3 fps is there any way to increase this.
You can use FP16 inference mode instead of FP32 and speed up your inference around 2x. Another option is using larger batch size which I’m not sure if it works on Jetson Nano since it has resource limitations.
You can find helpful scripts and discussion here

Optimize Tensorflow Object Detection Model V2 Centernet Model for Evaluation

I am using the tensorflow centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 object detection model on a Nvidia Tesla P100 to extract bounding boxes and keypoints for detecting people in a video. Using the pre-trained from tensorflow.org, I am able to process about 16 frames per second. Is there any way I can imporve the evaluation speed for this model? Here are some ideas I have been looking into:
Pruning the model graph since I am only detecting 1 type of object (people)
Have not been successful in doing this. Changing the label_map when building the model does not seem to improve performance.
Hard coding the input size
Have not found a good way to do this.
Compiling the model to an optimized form using something like TensorRT
Initial attempts to convert to TensorRT did not have any performance improvements.
Batching predictions
It looks like the pre-trained model has the batch size hard coded to 1, and so far when I try to change this using the model_builder I see a drop in performance.
My GPU utilization is about ~75% so I don't know if there is much to gain here.
TensorRT should in most cases give a large increase in frames per second compared to Tensorflow.
centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 can be found in the TensorFlow Model Zoo.
Nvidia has released a blog post describing how to optimize models from the TensorFlow Model Zoo using Deepstream and TensorRT:
https://developer.nvidia.com/blog/deploying-models-from-tensorflow-model-zoo-using-deepstream-and-triton-inference-server/
Now regarding your suggestions:
Pruning the model graph: Pruning the model graph can be done by converting your tensorflow model to a TF-TRT model.
Hardcoding the input size: Use the static mode in TF-TRT. This is the default mode and enabled by: is_dynamic_op=False
Compiling the model: My advise would be to convert you model to TF-TRT or first to ONNX and then to TensorRT.
Batching: Specifying the batch size is also covered in the NVIDIA blog post.
Lastly, for my model a big increase in performance came from using FP16 in my inference engine. (mixed precision) You could even try INT8 although then you first have to callibrate.

Why is the output video too slow in darknet?

I trained my own dataset for yolov2 in darknet. I am using ubuntu 18.04 and has no GPU. When I play a video(which i have taken in my smart phone) for testing, it is too slow. Is it because i don't have a GPU? Or is it because of some other reasons?
Can someone reply me.
Without a gpu, yolov2 is going to be very slow and if you have a modern smart phone it's likely that video is high resolution with a high frame rate. I'm not sure of your implementation but it's likely you're processing every frame in the video instead of skipping every other frame or only processing every 10th frame.
If you don't have a gpu available (and aren't going to) another way to get gpu type performance is using Intel's Openvino if you have a recent I-series processor. You'd be able to convert your yolov2 model to open vino and run it on a cpu with really fast inference times (likely <100ms per frame). I will say I ran yolov3 off of Openvino though and it was really slow compared to other object detectors and especially compared to a mobilenet.
I also have some demo's set up to test between yolov3 on a cpu and open vino on a cpu, you can check those out on SugarKubes
1 big reason is of course because you don't have GPU. The other reason is the model that you use. You use YoloV2 which is faster than YoloV3 but still slower compared to TinyYolo or TinyYoloV3.
So, this is the trade off between accuracy and speed, the faster your model the lower the accuracy. If you are going for speed, than there are 3 solutions that I can think of :
Use GPU (I know it's expensive but worth the price, nvidia gtx 1060++ would be great)
Change your model to TinyYolo or TinyYoloV3. I recommend using TinyYolov3 for higher fps
TinyYoloV3 : 220 fps
TinyYolo : 207 fps
YoloV2 : 67 fps
Use OpenVino as Andrew Pierno said
Download model from here : https://pjreddie.com/darknet/yolo/
Yolov2's link : https://pjreddie.com/darknet/yolov2/

htop cpu almost red when running tensorflow, predict is very slow

I'm using tensorflow to train a model and predict, and use htop on ubuntu to monitor cpu usage. predict is very slow, I just can't bear it. htop shows that cpu color is almost red, which means almost all cpu resource is used by system kernel threads, but cpu usage is 0% before tensorflow start.
I have not changed the thread_num, I'm using tensorflow v0.11 on ubuntu14.04.
The problem is that default glibc malloc is not efficient for small allocations. Also, because Google develops/tests tensorflow with tcmalloc internally, bad interactions with regular malloc don't get ironed out. The solution is to run TensorFlow with tcmalloc.
sudo apt-get install google-perftools
export LD_PRELOAD="/usr/lib/libtcmalloc.so.4"
python ...
If you're looking for something to improve the inference performance, I could recommend trying OpenVINO. It improves your model's accuracy by converting it to Intermediate Representation (IR), conducting graph pruning, and fusing certain operations into others. Then, in runtime, it uses vectorization. OpenVINO is optimized for Intel hardware, although it should work with any CPU.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.