I am running an ONNX model through TensorRT.
I can verify that inference is running on the GPU through the results and nvsys profile logs.
However, I would like to see the corresponding PTX binary that TensorRT generates for my input model.
Is there a specific flag or argument to do so?
Related
I am using the Coral devboard and the Nvidia Jetson TX2. And that is how I got to know about TensorFlow-Lite, TensorFlow-TRT and TensorRT.
I have some questions about them:
Between TensorFlow-TRT and TensorRT:
When using a fully optimised/compatible graph with TensorRT, which one is faster and why?
The pipeline to use TFlite in a Google Coral (When using TensorFlow 1.x...) is:
a. Use a model available in TensorFlow's zoo
b. Convert the model to frozen (.pb)
c. Use protobuff to serialize the graph
d. Convert to Tflite
e. Apply quantization (INT8)
f. Compile
what would be the pipeline when using TensorFlow-TRT and TensorRT?
Is there somewhere where I can find a good documentation about it?
So far I think TensorRT is closer to TensorFlow Lite because:
TFlite: after compilation you end up with a .quant.edtpu.tflite file which can be used to make inference in the devboard
TensorRT: you will end up with a .plan file to make inference in the devboard.
Thank you for the answers, and if you can point me to documentation which compares them, that will be appreciated.
TensorRT is a very fast CUDA runtime for GPU only. I am using an Nvidia Jetson Xavier NX with Tensorflow models converted to TensorRT, running on the Tensorflow-RT (TRT) runtime. The benefit of TRT runtime is any unsupported operations on TensorRT will fall back to using Tensorflow.
Have not tried Tensorflow-Lite, but I understand it as a reduced TF for inference-only on "small devices". It can support GPU but only limited operations and I think there are no python bindings (currently).
Overview
I know this subject has been discussed many times, but I am having a hard time understanding the workflow, or rather, the variations of the workflow.
For example, imagine you are installing TensorFlow on Windows 10. The main goal being to train a custom model, convert to TensorFlow Lite, and copy the converted .tflite file to a Raspberry Pi running TensorFlow Lite.
The confusion for me starts with the conversion process. After following along with multiple guides, it seems TensorFlow is often install with pip, or Anaconda. But then I see detailed tutorials which indicate it needs to be built from source in order to convert from TensorFlow models to TFLite models.
To make things more interesting, I've also seen models which are converted via Python scripts as seen here.
Question
So far I have seen 3 ways to do this conversion, and it could just be that I don't have a grasp on the full picture. Below are the abbreviated methods I have seen:
Build from source, and use the TensorFlow Lite Optimizing Converter (TOCO):
bazel run --config=opt tensorflow/lite/toco:toco -- --input_file=$OUTPUT_DIR/tflite_graph.pb --output_file=$OUTPUT_DIR/detect.tflite ...
Use the TensorFlow Lite Converter Python API:
converter = tf.lite.TFLiteConverter.from_saved_model(export_dir)
tflite_model = converter.convert()
with tf.io.gfile.GFile('model.tflite', 'wb') as f:
f.write(tflite_model)
Use the tflite_convert CLI utilities:
tflite_convert --saved_model_dir=/tmp/mobilenet_saved_model --output_file=/tmp/mobilenet.tflite
I *think I understand that options 2/3 are the same, in the sense that the tflite_convert utility is installed, and can be invoked either from the command line, or through a Python script. But is there a specific reason you should choose one over the other?
And lastly, what really gets me confused is option 1. And maybe it's a version thing (1.x vs 2.x)? But what's the difference between the TensorFlow Lite Optimizing Converter (TOCO) and the TensorFlow Lite Converter. It appears that in order to use TOCO you would have to build TensorFlow from source, so is there is a reason you would use one over the other?
There is no difference in the output from different conversion methods, as long as the parameters remain the same. The Python API is better if you want to generate TFLite models in an automated way (for eg a Python script that's run periodically).
The TensorFlow Lite Optimizing Converter (TOCO) was the first version of the TF->TFLite converter. It was recently deprecated and replaced with a new converter that can handle more ops/models. So I wouldn't recommend using toco:toco via bazel, but rather use tflite_convert as mentioned here.
You should never have to build the converter from source, unless you are making some changes to it and want to test them out.
I have trained a model for detection, which is doing great when embedded in tensorflow sample app.
After freezing with export_tflite_ssd_graph and conversion to tflite using toco the results do perform rather bad and have a huge "variety".
Reading this answer on a similar problem with loss of accuracy I wanted to try tflite_diff_example_test on a tensorflow docker machine.
As the documentation is not that evolved right now, I build the tool referencing this SO Post
using:
bazel build tensorflow/contrib/lite/testing/tflite_diff_example_test.cc which ran smooth.
After figuring out all my needed input parameters I tried the testscript with following commands:
~/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/external/bazel_tools/tools/test/test-setup.sh tensorflow/contrib/lite/testing/tflite_diff_example_test '--tensorflow_model=/tensorflow/shared/exported/tflite_graph.pb' '--tflite_model=/tensorflow/shared/exported/detect.tflite' '--input_layer=a,b,c,d' '--input_layer_type=float,float,float,float' '--input_layer_shape=1,3,4,3:1,3,4,3:1,3,4,3:1,3,4,3' '--output_layer=x,y'
and
bazel-bin/tensorflow/contrib/lite/testing/tflite_diff_example_test --tensorflow_model="/tensorflow/shared/exported/tflite_graph.pb" --tflite_model="/tensorflow/shared/exported/detect.tflite" --input_layer=a,b,c,d --input_layer_type=float,float,float,float --input_layer_shape=1,3,4,3:1,3,4,3:1,3,4,3:1,3,4,3 --output_layer=x,y
Both ways are failing. Errors:
way:
tflite_diff_example_test.cc:line 1: /bazel: Is a directory
tflite_diff_example_test.cc: line 3: syntax error near unexpected token '('
tflite_diff_example_test.cc: line 3: 'Licensed under the Apache License, Version 2.0 (the "License");'
/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/external/bazel_tools/tools/test/test-setup.sh: line 184: /tensorflow/: Is a directory
/root/.cache/bazel/_bazel_root/68a62076e91007a7908bc42a32e4cff9/external/bazel_tools/tools/test/test-setup.sh: line 276: /tensorflow/: Is a directory
way:
2018-09-10 09:34:27.650473: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Failed to create session. Op type not registered 'TFLite_Detection_PostProcess' in binary running on d36de5b65187. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.)tf.contrib.resamplershould be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
I would really appreciate any help, that enables me to compare the output of two graphs using tensorflows given tests.
The second way you mentioned is the correct way to use tflite_diff. However, the object detection model containing the TFLite_Detection_PostProcess op cannot be run via tflite_diff.
tflite_diff runs the provided TensorFlow (.pb) model in the TensorFlow runtime and runs the provided TensorFlow Lite (.tflite) model in the TensorFlow Lite runtime. In order to run the .pb model in the TensorFlow runtime, all of the operations must be implemented in TensorFlow.
However, in the model you provided, the TFLite_Detection_PostProcess op is not implemented in TensorFlow runtime - it is only available in the TensorFlow Lite runtime. Therefore, TensorFlow cannot resolve the op. Therefore, you unfortunately cannot use the tflite_diff tool with this model.
I'm using tensorflow to train a model and predict, and use htop on ubuntu to monitor cpu usage. predict is very slow, I just can't bear it. htop shows that cpu color is almost red, which means almost all cpu resource is used by system kernel threads, but cpu usage is 0% before tensorflow start.
I have not changed the thread_num, I'm using tensorflow v0.11 on ubuntu14.04.
The problem is that default glibc malloc is not efficient for small allocations. Also, because Google develops/tests tensorflow with tcmalloc internally, bad interactions with regular malloc don't get ironed out. The solution is to run TensorFlow with tcmalloc.
sudo apt-get install google-perftools
export LD_PRELOAD="/usr/lib/libtcmalloc.so.4"
python ...
If you're looking for something to improve the inference performance, I could recommend trying OpenVINO. It improves your model's accuracy by converting it to Intermediate Representation (IR), conducting graph pruning, and fusing certain operations into others. Then, in runtime, it uses vectorization. OpenVINO is optimized for Intel hardware, although it should work with any CPU.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
Usually after I trained my model, I would use the same GPU to do so.
However, do we still need a GPU instance for inference if I were to want to serve it online as a service? Or would a CPU instance suffice?
Thanks.
The device would be cleared when you export a model. Here is the unit test for this feature: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/saved_model_test.py#L564
Copy from comment: GPU is fast when processing a large batch. When making inference for a single input, CPU is fast enough.
In many cases, the CPU should be enough. But if it isn't, you can further optimize the inference with some toolkits such as OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
Here are some performance benchmarks for various models and CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.