How to optimize your tensorflow model by using TensorRT? - tensorflow

These are the instruction to solve the assignments?
Convert your TensorFlow model to UFF
Use TensorRT’s C++ API to parse your model to convert it to a CUDA engine.
TensorRT engine would automatically optimize your model and perform steps
like fusing layers, converting the weights to FP16 (or INT8 if you prefer) and
optimize to run on Tensor Cores, and so on.
Can anyone tell me how to proceed with this assignment because I don't have GPU in my laptop and is it possible to do this in google colab or AWS free account.
And what are the things or packages I have to install for running TensorRT in my laptop or google colab?

so I haven't used .uff but I used .onnx but from what I've seen the process is similar.
According to the documentation, with TensorFlow you can do something like:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverter(
input_graph_def=frozen_graph,
nodes_blacklist=['logits', 'classes'])
frozen_graph = converter.convert()
In TensorFlow1.0, so they have it pretty straight forward, TrtGraphConverter has the option to serialized for FP16 like:
converter = trt.TrtGraphConverter(
input_saved_model_dir=input_saved_model_dir,
max_workspace_size_bytes=(11<32),
precision_mode=”FP16”,
maximum_cached_engines=100)
See the preciosion_mode part, once you have serialized you can load the networks easily on TensorRT, some good examples using cpp are here.
Unfortunately, you'll need a nvidia gpu with FP16 support, check this support matrix.
If I'm correct, Google Colab offered a Tesla K80 GPU which does not have FP16 support. I'm not sure about AWS but I'm certain the free tier does not have gpus.
Your cheapest option could be buying a Jetson Nano which is around ~90$, it's a very powerful board and I'm sure you'll use it in the future. Or you could rent some AWS gpu server, but that is a bit expensive and the setup progress is a pain.
Best of luck!

Export and convert your TensorFlow model into .onnx file.
Then, use this onnx-tensorrt tool to do the CUDA engine file conversion.

Related

What are the main differences between TensorFlowLite, TendorFlow-TRT and TensorRT?

I am using the Coral devboard and the Nvidia Jetson TX2. And that is how I got to know about TensorFlow-Lite, TensorFlow-TRT and TensorRT.
I have some questions about them:
Between TensorFlow-TRT and TensorRT:
When using a fully optimised/compatible graph with TensorRT, which one is faster and why?
The pipeline to use TFlite in a Google Coral (When using TensorFlow 1.x...) is:
a. Use a model available in TensorFlow's zoo
b. Convert the model to frozen (.pb)
c. Use protobuff to serialize the graph
d. Convert to Tflite
e. Apply quantization (INT8)
f. Compile
what would be the pipeline when using TensorFlow-TRT and TensorRT?
Is there somewhere where I can find a good documentation about it?
So far I think TensorRT is closer to TensorFlow Lite because:
TFlite: after compilation you end up with a .quant.edtpu.tflite file which can be used to make inference in the devboard
TensorRT: you will end up with a .plan file to make inference in the devboard.
Thank you for the answers, and if you can point me to documentation which compares them, that will be appreciated.
TensorRT is a very fast CUDA runtime for GPU only. I am using an Nvidia Jetson Xavier NX with Tensorflow models converted to TensorRT, running on the Tensorflow-RT (TRT) runtime. The benefit of TRT runtime is any unsupported operations on TensorRT will fall back to using Tensorflow.
Have not tried Tensorflow-Lite, but I understand it as a reduced TF for inference-only on "small devices". It can support GPU but only limited operations and I think there are no python bindings (currently).

GPU support for TensorFlow & PyTorch

Okay, so I've worked on a bunch of Deep Learning projects and internships now and I've never had to do heavy training. But lately I've been thinking of doing some Transfer Learning for which I'll need to run my code on a GPU. Now I have a system with Windows 10 and a dedicated NVIDIA GeForce 940M GPU. I've been doing a lot of research online, but I'm still a bit confused. I haven't installed the NVIDIA Cuda Toolkit or cuDNN or tensorflow-gpu on my system yet. I currently use tensorflow and pytorch to train my DL models. Here are my queries -
When I define a tensor in tf or pytorch, it is a cpu tensor by default. So, all the training I've been doing so far has been on the CPU. So, if I make sure to install the correct versions of Cuda and cuDNN and tensorflow-gpu (specifically for tensorflow), I can run my models on my GPU using tf-gpu and pytorch and that's it? (I'm aware of the torch.cuda.is_available() in pytorch to ensure pytorch can access my GPU and the device_lib module in tf to check if my gpu is visible to tensorflow)(I'm also aware of the fact that tf doesnt support all Nvidia GPUs)
Why does tf have a separate module for GPU support? PyTorch doesnt seem to have that and all you need to do is cast your tensor from cpu() to cuda() to switch between them.
Why install cuDNN? I know it is a high-level API CUDA built for support to train Deep Neural Nets on the GPU. But do tf-gpu and torch use these in the backend while training on the gpu?
After tf == 1.15, did they combine CPU and GPU support all into one package?
First of all unfortunately 940M is a kinda weak GPU for training. I suggest you use Google colab for faster training but of course, it would be faster than the CPU. So here my answers to your four questions.
1-) Yes if you install the requirements correctly, then you can run on GPU. You can manually place your data to your GPU as well. You can check implementations on TensorFlow. In PyTorch, you should specify the device that you want to use. As you said you should do device = torch.device("cuda" if args.cuda else "cpu") then for models and data you should always call .to(device) Then it will automatically use GPU if available.
2-) PyTorch also needs extra installation (module) for GPU support. However, with recent updates both TF and PyTorch are easy to use for GPU compatible code.
3-) Both Tensorflow and PyTorch is based on cuDNN. You can use them without cuDNN but as far as I know, it hurts the performance but I'm not sure about this topic.
4-) No they are still different packages. tensorflow-gpu==1.15 and tensorflow==1.15 what they did with tf2, was making the tensorflow more like Keras. So it is more simplified then 1.15 or before.
Rest was already answered by regarding 3) cudNN optimizes layer and such operations on hardware level and those implementations are pure black magic. It is incredibly hard to write CUDA code that properly utilizes your GPU (how load data into the GPU, how to actually perform them using matrices etc. )

Is it possible to run yolo model in Jetson without optimizing it?

I had few issues converting yolo.weight model to tensorRT. So, Is it possible to run a yolo model in Jetson with optimizing it to TensorRT ? Will there be the same speed for detection? (Training won't be done in Jetson anyway).
Or is there any other suggestion alternative to TensorRT?
Yes, It is possible to run YOLO model in Jetson without optimizing it with TensorRT. The TensorRT only optimizes the inference time of your model.
You could try converting it to TF Lite format and work from there, but you might need to handle most of the back-end operations yourself.
Also, for both methods, the training is done on the PC and not on the edge device.
You could read more about their documentations here in these links.
Tensorflow RT Github
Tensorflow Lite

htop cpu almost red when running tensorflow, predict is very slow

I'm using tensorflow to train a model and predict, and use htop on ubuntu to monitor cpu usage. predict is very slow, I just can't bear it. htop shows that cpu color is almost red, which means almost all cpu resource is used by system kernel threads, but cpu usage is 0% before tensorflow start.
I have not changed the thread_num, I'm using tensorflow v0.11 on ubuntu14.04.
The problem is that default glibc malloc is not efficient for small allocations. Also, because Google develops/tests tensorflow with tcmalloc internally, bad interactions with regular malloc don't get ironed out. The solution is to run TensorFlow with tcmalloc.
sudo apt-get install google-perftools
export LD_PRELOAD="/usr/lib/libtcmalloc.so.4"
python ...
If you're looking for something to improve the inference performance, I could recommend trying OpenVINO. It improves your model's accuracy by converting it to Intermediate Representation (IR), conducting graph pruning, and fusing certain operations into others. Then, in runtime, it uses vectorization. OpenVINO is optimized for Intel hardware, although it should work with any CPU.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

TensorFlow Inference (Serving), is CPU sufficient?

Usually after I trained my model, I would use the same GPU to do so.
However, do we still need a GPU instance for inference if I were to want to serve it online as a service? Or would a CPU instance suffice?
Thanks.
The device would be cleared when you export a model. Here is the unit test for this feature: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/saved_model_test.py#L564
Copy from comment: GPU is fast when processing a large batch. When making inference for a single input, CPU is fast enough.
In many cases, the CPU should be enough. But if it isn't, you can further optimize the inference with some toolkits such as OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
Here are some performance benchmarks for various models and CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.