What are the main differences between TensorFlowLite, TendorFlow-TRT and TensorRT? - tensorflow

I am using the Coral devboard and the Nvidia Jetson TX2. And that is how I got to know about TensorFlow-Lite, TensorFlow-TRT and TensorRT.
I have some questions about them:
Between TensorFlow-TRT and TensorRT:
When using a fully optimised/compatible graph with TensorRT, which one is faster and why?
The pipeline to use TFlite in a Google Coral (When using TensorFlow 1.x...) is:
a. Use a model available in TensorFlow's zoo
b. Convert the model to frozen (.pb)
c. Use protobuff to serialize the graph
d. Convert to Tflite
e. Apply quantization (INT8)
f. Compile
what would be the pipeline when using TensorFlow-TRT and TensorRT?
Is there somewhere where I can find a good documentation about it?
So far I think TensorRT is closer to TensorFlow Lite because:
TFlite: after compilation you end up with a .quant.edtpu.tflite file which can be used to make inference in the devboard
TensorRT: you will end up with a .plan file to make inference in the devboard.
Thank you for the answers, and if you can point me to documentation which compares them, that will be appreciated.

TensorRT is a very fast CUDA runtime for GPU only. I am using an Nvidia Jetson Xavier NX with Tensorflow models converted to TensorRT, running on the Tensorflow-RT (TRT) runtime. The benefit of TRT runtime is any unsupported operations on TensorRT will fall back to using Tensorflow.
Have not tried Tensorflow-Lite, but I understand it as a reduced TF for inference-only on "small devices". It can support GPU but only limited operations and I think there are no python bindings (currently).

Related

How to optimize your tensorflow model by using TensorRT?

These are the instruction to solve the assignments?
Convert your TensorFlow model to UFF
Use TensorRT’s C++ API to parse your model to convert it to a CUDA engine.
TensorRT engine would automatically optimize your model and perform steps
like fusing layers, converting the weights to FP16 (or INT8 if you prefer) and
optimize to run on Tensor Cores, and so on.
Can anyone tell me how to proceed with this assignment because I don't have GPU in my laptop and is it possible to do this in google colab or AWS free account.
And what are the things or packages I have to install for running TensorRT in my laptop or google colab?
so I haven't used .uff but I used .onnx but from what I've seen the process is similar.
According to the documentation, with TensorFlow you can do something like:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverter(
input_graph_def=frozen_graph,
nodes_blacklist=['logits', 'classes'])
frozen_graph = converter.convert()
In TensorFlow1.0, so they have it pretty straight forward, TrtGraphConverter has the option to serialized for FP16 like:
converter = trt.TrtGraphConverter(
input_saved_model_dir=input_saved_model_dir,
max_workspace_size_bytes=(11<32),
precision_mode=”FP16”,
maximum_cached_engines=100)
See the preciosion_mode part, once you have serialized you can load the networks easily on TensorRT, some good examples using cpp are here.
Unfortunately, you'll need a nvidia gpu with FP16 support, check this support matrix.
If I'm correct, Google Colab offered a Tesla K80 GPU which does not have FP16 support. I'm not sure about AWS but I'm certain the free tier does not have gpus.
Your cheapest option could be buying a Jetson Nano which is around ~90$, it's a very powerful board and I'm sure you'll use it in the future. Or you could rent some AWS gpu server, but that is a bit expensive and the setup progress is a pain.
Best of luck!
Export and convert your TensorFlow model into .onnx file.
Then, use this onnx-tensorrt tool to do the CUDA engine file conversion.

Dumping Weights in TensorflowLite

new Tensorflow 2.0 user. My project requires me to investigate the weights for the neural network i created in Tensorflow (super simple one). I think I know how to do it in the regular Tensorflow case. Namely I use the command model.save_weights(filename). I would like to repeat this effort for a .tflite model but I am having trouble. Instead of generating my own tensorflow lite model, I am using one of the many models which are provided online: https://www.tensorflow.org/lite/guide/hosted_model to avoid having to troubleshoot my use of the Tensorflow Lite converter. Any thoughts?

Is it possible to run yolo model in Jetson without optimizing it?

I had few issues converting yolo.weight model to tensorRT. So, Is it possible to run a yolo model in Jetson with optimizing it to TensorRT ? Will there be the same speed for detection? (Training won't be done in Jetson anyway).
Or is there any other suggestion alternative to TensorRT?
Yes, It is possible to run YOLO model in Jetson without optimizing it with TensorRT. The TensorRT only optimizes the inference time of your model.
You could try converting it to TF Lite format and work from there, but you might need to handle most of the back-end operations yourself.
Also, for both methods, the training is done on the PC and not on the edge device.
You could read more about their documentations here in these links.
Tensorflow RT Github
Tensorflow Lite

TensorRT/TFlite sample implementation

Having a trained '.h5' Keras model file, I'm trying to optimize inference time:
Explored 2 options:
Accelerated inference via TensorRT
'int8' Quantization.
At this point I can convert the model file to TensorFlow protobuf '.pb' format, but as a sidenote, it also contains custom objects of few layers.
Saw a few articles on TensorRT conversion and TFLite conversion, but I don't seem to find a robust implementation that's legible. Can someone explain how that's done (TFLite/Keras Quantization or TensorRT) to use the same model for faster inference.
(Open for other suggestions to improve inference speed supported in TensorFlow and Keras)
This is the user guide on how to use TensorRT in TF: https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html
This talk explains how TensorRT works in TF: https://developer.nvidia.com/gtc/2019/video/S9431
Note that TensorRT also supports INT8-quantization (during training or post-training).
This blog post also has kind of the same content: https://medium.com/tensorflow/high-performance-inference-with-tensorrt-integration-c4d78795fbfe
This repository has a bunch of examples showing how to use it: https://github.com/tensorflow/tensorrt

Which object detection pre-trained models are available and convertible with TensorRT?

I'm looking into converting a pre-trained object detection model with TensorRT to try it out on my NVIDIA Jetson TX2 but every model I find has layers that are not yet supported by TensorRT. So far I tried SSD with MobileNet and Faster R-CNN but they both have operations such as Identity that are not supported by TensorRT and I can't find many other TensorFlow models out there.
Thank you
This repo has the pre-trained models you are looking for, along with instructions on how to take a model and build a TensorRT engine: https://github.com/NVIDIA-Jetson/tf_to_trt_image_classification#download