I am trying to convert a trained model from checkpoint file to tflite. I am using tf.lite.LiteConverter. The float conversion went fine with reasonable inference speed. But the inference speed of the INT8 conversion is very slow. I tried to debug by feeding in a very small network. I found that inference speed for INT8 model is generally slower than float model.
In the INT8 tflite file, I found some tensors called ReadVariableOp, which doesn't exist in TensorFlow's official mobilenet tflite model.
I wonder what causes the slowness of INT8 inference.
You possibly used x86 cpu instead of one with arm instructions. You can refer it here https://github.com/tensorflow/tensorflow/issues/21698#issuecomment-414764709
Related
Context :
I would like to run inferencing of a DL-model on an Arduino and, since I don't have much memory available, I need to post-training int8-quantize my model.
But the quantization of my model doesn't seem to be working, and it seems to be linked to the Elu activations functions in the model.
Indeed, I get no error during both the conversion and the quantization of the model on Python and the inferencing on Arduino, but the necessary size for the model on the Arduino remains the same than without quantization.
What I tried :
I retrained a model in which I changed the Elu for Relu activation functions. Then quantization works : thanks to the line : tflInterpreter->arena_used_bytes() on Arduino, I can see that quantization helped me to reduce the necessary size for the model by 3.
I analysed the model (quantized, with Elu) on the Netron App and I realised that there are steps of de-quantization and re-quantization before and after each call of Elu function : model de-quantize and re-quantize. I don't understand why is this doing so, when it doesn't append with Relu functions.
Finally, I found this commit on Tensorflow Git, which made me believe that int8 quantization for Elu is implemented : Commit for Elu int8 quantization TF. Nevertheless, they mentioned a LUT approach, which I don't understand and might (?) be linked to the troubles I am facing.
Context :
TF 2.5.
Training, conversion and quantisation on Colab
Arduino with TFLite version 2.5
Does anyone face the same king of troubles for quantization of model containing Elu ? Do you have any idea of how to solve this problem ?
Thank you really much !
I am using the tensorflow centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 object detection model on a Nvidia Tesla P100 to extract bounding boxes and keypoints for detecting people in a video. Using the pre-trained from tensorflow.org, I am able to process about 16 frames per second. Is there any way I can imporve the evaluation speed for this model? Here are some ideas I have been looking into:
Pruning the model graph since I am only detecting 1 type of object (people)
Have not been successful in doing this. Changing the label_map when building the model does not seem to improve performance.
Hard coding the input size
Have not found a good way to do this.
Compiling the model to an optimized form using something like TensorRT
Initial attempts to convert to TensorRT did not have any performance improvements.
Batching predictions
It looks like the pre-trained model has the batch size hard coded to 1, and so far when I try to change this using the model_builder I see a drop in performance.
My GPU utilization is about ~75% so I don't know if there is much to gain here.
TensorRT should in most cases give a large increase in frames per second compared to Tensorflow.
centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 can be found in the TensorFlow Model Zoo.
Nvidia has released a blog post describing how to optimize models from the TensorFlow Model Zoo using Deepstream and TensorRT:
https://developer.nvidia.com/blog/deploying-models-from-tensorflow-model-zoo-using-deepstream-and-triton-inference-server/
Now regarding your suggestions:
Pruning the model graph: Pruning the model graph can be done by converting your tensorflow model to a TF-TRT model.
Hardcoding the input size: Use the static mode in TF-TRT. This is the default mode and enabled by: is_dynamic_op=False
Compiling the model: My advise would be to convert you model to TF-TRT or first to ONNX and then to TensorRT.
Batching: Specifying the batch size is also covered in the NVIDIA blog post.
Lastly, for my model a big increase in performance came from using FP16 in my inference engine. (mixed precision) You could even try INT8 although then you first have to callibrate.
Is it possible to re-quantize already quantized models?
I have some models that I have trained with Quantization Aware Training (QAT) with Full Integer Quantization. However, I am failing to do GPU Delegation with those models. Is there a way to make the models I already have with Float16 Quantization in order to be able to run them with GPU Delegate.
Are you looking for some ways to convert integer quantized model fo float16 quantized model?
Which version of TFLite are you using? TFLite 2.3 supports running quantized models with GPU delegate. However, as GPU only supports float operations, so it internally dequantize integer weights into float weights.
Please see the doc for how to enable (experimental) quantized model support. https://www.tensorflow.org/lite/performance/gpu_advanced#running_quantized_models_experimental
I'm trying to compile a tflite model with the edgetpu compiler to make it compatible with Google's Coral USB key, but when I run edgetpu_compiler the_model.tflite I get a Model not quantized error.
I then wanted to quantize the tflite model to an 8-bit integer format, but I don't have the model's original .h5 file.
Is it possible to quantize a tflite-converted model to an 8-bit format?
#garys unfortunately, tensorflow doesn't have an API to quantize a float tflite model. For post training quantization, the only API they have is for full tensorflow models (.pb, hdf5, h5, saved_model...) -> tflite. The quantization process happens during tflite conversion, so to my knowledge, there isn't a way to do this
I downloaded a tensorflow model from Custom Vision and want to run it on a coral tpu. I therefore converted it to tensorflow-lite and applying hybrid post-training quantization (as far as I know that's the only way because I do not have access to the training data).
You can see the code here: https://colab.research.google.com/drive/1uc2-Yb9Ths6lEPw6ngRpfdLAgBHMxICk
When I then try to compile it for the edge tpu, I get the following:
Edge TPU Compiler version 2.0.258810407
INFO: Initialized TensorFlow Lite runtime.
Invalid model: model.tflite
Model not quantized
Any idea what my problem might be?
tflite models are not fully quantized using converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]. You might have a look on post training full integer quantization using the representation dataset: https://www.tensorflow.org/lite/performance/post_training_quantization#full_integer_quantization_of_weights_and_activations Simply adapt your generator function to yield representative samples (e.g. similar images, to what your image classification network should predict). Very few images are enough for the converter to identify min and max values and quantize your model. However, typically your accuracy is less in comparison to quantization aware learning.
I can't find the source but I believe the edge TPU currently only supports 8bit-quantized models, and no hybrid operators.
EDIT: On Corals FAQ they mention that the model needs to be fully quantized.
You need to convert your model to TensorFlow Lite and it must be
quantized using either quantization-aware training (recommended) or
full integer post-training quantization.