Optimize Tensorflow Object Detection Model V2 Centernet Model for Evaluation - tensorflow

I am using the tensorflow centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 object detection model on a Nvidia Tesla P100 to extract bounding boxes and keypoints for detecting people in a video. Using the pre-trained from tensorflow.org, I am able to process about 16 frames per second. Is there any way I can imporve the evaluation speed for this model? Here are some ideas I have been looking into:
Pruning the model graph since I am only detecting 1 type of object (people)
Have not been successful in doing this. Changing the label_map when building the model does not seem to improve performance.
Hard coding the input size
Have not found a good way to do this.
Compiling the model to an optimized form using something like TensorRT
Initial attempts to convert to TensorRT did not have any performance improvements.
Batching predictions
It looks like the pre-trained model has the batch size hard coded to 1, and so far when I try to change this using the model_builder I see a drop in performance.
My GPU utilization is about ~75% so I don't know if there is much to gain here.

TensorRT should in most cases give a large increase in frames per second compared to Tensorflow.
centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 can be found in the TensorFlow Model Zoo.
Nvidia has released a blog post describing how to optimize models from the TensorFlow Model Zoo using Deepstream and TensorRT:
https://developer.nvidia.com/blog/deploying-models-from-tensorflow-model-zoo-using-deepstream-and-triton-inference-server/
Now regarding your suggestions:
Pruning the model graph: Pruning the model graph can be done by converting your tensorflow model to a TF-TRT model.
Hardcoding the input size: Use the static mode in TF-TRT. This is the default mode and enabled by: is_dynamic_op=False
Compiling the model: My advise would be to convert you model to TF-TRT or first to ONNX and then to TensorRT.
Batching: Specifying the batch size is also covered in the NVIDIA blog post.
Lastly, for my model a big increase in performance came from using FP16 in my inference engine. (mixed precision) You could even try INT8 although then you first have to callibrate.

Related

YoloV3 deployment on JETSON TX2

I faced problem regarding Yolo object detection deployment on TX2.
I use pre-trained Yolo3 (trained on Coco dataset) to detect some limited objects (I mostly concern on five classes, not all the classes), The speed is low for real-time detection, and the accuracy is not perfect (but acceptable) on my laptop. I’m thinking to make it faster by multithreading or multiprocessing on my laptop, is it possible for yolo?
But my main problem is that algorithm is not running on raspberry pi and nvidia TX2.
Here are my questions:
In general, is it possible to run yolov3 on TX2 without any modification like accelerators and model compression techniques?
I cannot run the model on TX2. Firstly I got error regarding camera, so I decided to run the model on a video, this time I got the 'cannot allocate memory in static TLS block' error, what is the reason of getting this error? the model is too big. It uses 16 GB GPU memory on my laptop.The GPU memory of raspberry and TX2 are less than 8GB. As far as I know there are two solutions, using a smaller model or using tensor RT or pruning. Do you have any idea if there is any other way?
if I use tiny-yolo I will get lower accuracy and this is not what I want. Is there any way to run any object detection model with high performance for real-time in terms of both accuracy and speed (FPS) on raspberry pi or NVIDIA TX2?
If I clean the coco data for just the objects I concern and then train the same model, I would get higher accuracy and speed but the size would not change, Am I correct?
In general, what is the best model in terms of accuracy for real-time detection and what is the best in terms of speed?
How is Mobilenet? Is it better than YOLOs in terms of both accuracy and speed?
1- Yes it is possible. I already run Yolov3 on Jetson Nano.
2- It depends on model and input resolution of data. You can decrease input resolution. Input images are transferred to GPU VRAM to use on model. Big input sizes can allocate much memory. As far as I remember I have run normal Yolov3 on Jetson Nano(which is worse than tx2) 2 years ago. Also, you can use Yolov3-tiny and Tensorrt as you mention. There are many sources on the web like this & this.
3- I suggest you to have a look at here. In this repo, you can make transfer learning with your dataset & optimize the model with TensorRT & run it on Jetson.
4- Size not dependent to dataset. It depend the model architecture(because it contains weights). Speed probably does not change. Accuracy depends on your dataset. It can be better or worser. If any class on COCO is similiar to your dataset's any class, I suggest you to transfer learning.
5- You have to find right model with small size, enough accuracy, gracefully speed. There is not best model. There is best model for your case which depend on also your dataset. You can compare some of the model's accuracy and fps here.
6- Most people uses mobilenet as feature extractor. Read this paper. You will see Yolov3 have better accuracy, SSD with MobileNet backbone have better FPS. I suggest you to use jetson-inference repo.
By using jetson-inference repo, I get enough accuracy on SSD model & get 30 FPS. Also, I suggest you to use MIPI-CSI camera on Jetson. It is faster than USB cameras.
I fixed the problem 1 and 2 only by replacing import order of the opencv and tensorflow inside the script.Now I can run Yolov3 without any modification on tx2. I got average FPS of 3.

How to optimize a pre-trained TF2.0 model for inference

My goal is to optimize a pre-trained model from TFHub for inference. Therefore I would like to use an object detection model with multiple outputs:
https://tfhub.dev/tensorflow/ssd_mobilenet_v2/fpnlite_640x640/1
where the archive contains a SavedModel file
https://tfhub.dev/tensorflow/ssd_mobilenet_v2/fpnlite_640x640/1?tf-hub-format=compressed
I came across the methods optimize_for_inference and freeze_graph, but read on the following thread that is is no longer supported in TF2:
https://stackoverflow.com/a/56384808/11687201
So how is optimization for inference done with TF2?
The plan is to use this one of the pre-trained networks for transfer learning and use this network later on with a hardware accelerator, the converter for this hardware requires a frozen graph as input.

"Model not quantized" even after post-training quantization

I downloaded a tensorflow model from Custom Vision and want to run it on a coral tpu. I therefore converted it to tensorflow-lite and applying hybrid post-training quantization (as far as I know that's the only way because I do not have access to the training data).
You can see the code here: https://colab.research.google.com/drive/1uc2-Yb9Ths6lEPw6ngRpfdLAgBHMxICk
When I then try to compile it for the edge tpu, I get the following:
Edge TPU Compiler version 2.0.258810407
INFO: Initialized TensorFlow Lite runtime.
Invalid model: model.tflite
Model not quantized
Any idea what my problem might be?
tflite models are not fully quantized using converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]. You might have a look on post training full integer quantization using the representation dataset: https://www.tensorflow.org/lite/performance/post_training_quantization#full_integer_quantization_of_weights_and_activations Simply adapt your generator function to yield representative samples (e.g. similar images, to what your image classification network should predict). Very few images are enough for the converter to identify min and max values and quantize your model. However, typically your accuracy is less in comparison to quantization aware learning.
I can't find the source but I believe the edge TPU currently only supports 8bit-quantized models, and no hybrid operators.
EDIT: On Corals FAQ they mention that the model needs to be fully quantized.
You need to convert your model to TensorFlow Lite and it must be
quantized using either quantization-aware training (recommended) or
full integer post-training quantization.

TensorRT/TFlite sample implementation

Having a trained '.h5' Keras model file, I'm trying to optimize inference time:
Explored 2 options:
Accelerated inference via TensorRT
'int8' Quantization.
At this point I can convert the model file to TensorFlow protobuf '.pb' format, but as a sidenote, it also contains custom objects of few layers.
Saw a few articles on TensorRT conversion and TFLite conversion, but I don't seem to find a robust implementation that's legible. Can someone explain how that's done (TFLite/Keras Quantization or TensorRT) to use the same model for faster inference.
(Open for other suggestions to improve inference speed supported in TensorFlow and Keras)
This is the user guide on how to use TensorRT in TF: https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html
This talk explains how TensorRT works in TF: https://developer.nvidia.com/gtc/2019/video/S9431
Note that TensorRT also supports INT8-quantization (during training or post-training).
This blog post also has kind of the same content: https://medium.com/tensorflow/high-performance-inference-with-tensorrt-integration-c4d78795fbfe
This repository has a bunch of examples showing how to use it: https://github.com/tensorflow/tensorrt

How to optimize a trained Tensorflow graph for execution speedup?

in order to do fast CPU inference of a frozen Tensorflow graph (.pb) I am currently using Tensorflow's C API. The inference speed is already fairly good, however (compared to CPU-specific tools like Intel's OpenVINO) I have so far no possibility to somehow optimize the graph before running it. I am interested in any sort of optimization that is suitable:
- device-specific optimization for CPU
- graph-specific optimization (fusing operations, dropping out nodes, ...)
- ... and everything else lowering the time required for inference.
Therefore I am looking for a way to optimize graphs after training and before execution. As mentioned, Tools like Intel's OpenVINO (for CPUs) and NVIDIA's TensorRT (for GPUs) do stuff like that. I am also working with OpenVINO but currently waiting for a bug fix so that I would like to try an additional way.
I thought about trying Tensorflow XLA, but I have no experience using it. Moreover I have to make sure to either get a frozen graph (.pb) or something that I can convert to a frozen graph (e.g. .h5) in the end.
I would be grateful for recommendations!
Greets
follow these steps:
freeze tensorflow trained model (frozen_graph.pb) - for that you may required trained model .pb, checkpoints & output node names
optimize your frozen model with Intel OpenVINO model optimizer -
python3 mo.py --input_model frozen_graph.pb
Additionally you may required input_shape
you will get .xml & .bin files as result. with the help of benchmark_app, you can check inference optimisation .