Is Elu int8 quantisation working on Tensorflow Lite? - tensorflow

Context :
I would like to run inferencing of a DL-model on an Arduino and, since I don't have much memory available, I need to post-training int8-quantize my model.
But the quantization of my model doesn't seem to be working, and it seems to be linked to the Elu activations functions in the model.
Indeed, I get no error during both the conversion and the quantization of the model on Python and the inferencing on Arduino, but the necessary size for the model on the Arduino remains the same than without quantization.
What I tried :
I retrained a model in which I changed the Elu for Relu activation functions. Then quantization works : thanks to the line : tflInterpreter->arena_used_bytes() on Arduino, I can see that quantization helped me to reduce the necessary size for the model by 3.
I analysed the model (quantized, with Elu) on the Netron App and I realised that there are steps of de-quantization and re-quantization before and after each call of Elu function : model de-quantize and re-quantize. I don't understand why is this doing so, when it doesn't append with Relu functions.
Finally, I found this commit on Tensorflow Git, which made me believe that int8 quantization for Elu is implemented : Commit for Elu int8 quantization TF. Nevertheless, they mentioned a LUT approach, which I don't understand and might (?) be linked to the troubles I am facing.
Context :
TF 2.5.
Training, conversion and quantisation on Colab
Arduino with TFLite version 2.5
Does anyone face the same king of troubles for quantization of model containing Elu ? Do you have any idea of how to solve this problem ?
Thank you really much !

Related

Tensorflow's quantization for RNN and LSTM

In the guide for Quantization Aware Training, I noticed that RNN and LSTM were listed in the roadmap for "future support". Does anyone know if it is supported now?
Is using Post-Training Quantization also possible for quantizing RNN and LSTM? I don't see much information or discussion about it so I wonder if it is possible now or if it is still in development.
Thank you.
I am currently trying to implement a speech enhancement model in 8-bit integer based on DTLN (https://github.com/breizhn/DTLN). However, when I tried to infer the quantized model without any audio/ empty array, it adds a weird waveform on top of the result: A constant signal every 125 Hz. I have checked other places in the code and there is no problem, just boils down to the quantization process with RNN/LSTM.

Optimize Tensorflow Object Detection Model V2 Centernet Model for Evaluation

I am using the tensorflow centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 object detection model on a Nvidia Tesla P100 to extract bounding boxes and keypoints for detecting people in a video. Using the pre-trained from tensorflow.org, I am able to process about 16 frames per second. Is there any way I can imporve the evaluation speed for this model? Here are some ideas I have been looking into:
Pruning the model graph since I am only detecting 1 type of object (people)
Have not been successful in doing this. Changing the label_map when building the model does not seem to improve performance.
Hard coding the input size
Have not found a good way to do this.
Compiling the model to an optimized form using something like TensorRT
Initial attempts to convert to TensorRT did not have any performance improvements.
Batching predictions
It looks like the pre-trained model has the batch size hard coded to 1, and so far when I try to change this using the model_builder I see a drop in performance.
My GPU utilization is about ~75% so I don't know if there is much to gain here.
TensorRT should in most cases give a large increase in frames per second compared to Tensorflow.
centernet_resnet50_v2_512x512_kpts_coco17_tpu-8 can be found in the TensorFlow Model Zoo.
Nvidia has released a blog post describing how to optimize models from the TensorFlow Model Zoo using Deepstream and TensorRT:
https://developer.nvidia.com/blog/deploying-models-from-tensorflow-model-zoo-using-deepstream-and-triton-inference-server/
Now regarding your suggestions:
Pruning the model graph: Pruning the model graph can be done by converting your tensorflow model to a TF-TRT model.
Hardcoding the input size: Use the static mode in TF-TRT. This is the default mode and enabled by: is_dynamic_op=False
Compiling the model: My advise would be to convert you model to TF-TRT or first to ONNX and then to TensorRT.
Batching: Specifying the batch size is also covered in the NVIDIA blog post.
Lastly, for my model a big increase in performance came from using FP16 in my inference engine. (mixed precision) You could even try INT8 although then you first have to callibrate.

Quantized models in Object Detection API for TF2

I want to migrate my code for fine-tuning an object detection model for inference on Coral devices to TensorFlow 2, but I don't see quantized models in the TF2 model zoo.
Is it possible to fine-tune a model in TF2 for this purpose and use a technique like quantization-aware training or post-training quantization? I haven't seen any related tutorials or issues. I've also seen some reports of issues with quantization with TFLite converter in TF2 so I'm not even sure if it's possible to do it in TF2.
Even for the TF1.x models, the conversions are not that straightforward which I am struggling with them at this time. I am pretty disappointed as I think Coral should have better support for tensorflow than Intel NCS USB originally as they are in the same family. However, it seems that I am wrong.
Quantization aware training is possible by adding graph_rewriter at the end of the config file before fine-tuning of the pretrained model:
graph_rewriter {
quantization {
delay: 48000
weight_bits: 8
activation_bits: 8
}
}
Source: https://neuralet.com/article/quantization-of-tensorflow-object-detection-api-models/

"Model not quantized" even after post-training quantization

I downloaded a tensorflow model from Custom Vision and want to run it on a coral tpu. I therefore converted it to tensorflow-lite and applying hybrid post-training quantization (as far as I know that's the only way because I do not have access to the training data).
You can see the code here: https://colab.research.google.com/drive/1uc2-Yb9Ths6lEPw6ngRpfdLAgBHMxICk
When I then try to compile it for the edge tpu, I get the following:
Edge TPU Compiler version 2.0.258810407
INFO: Initialized TensorFlow Lite runtime.
Invalid model: model.tflite
Model not quantized
Any idea what my problem might be?
tflite models are not fully quantized using converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]. You might have a look on post training full integer quantization using the representation dataset: https://www.tensorflow.org/lite/performance/post_training_quantization#full_integer_quantization_of_weights_and_activations Simply adapt your generator function to yield representative samples (e.g. similar images, to what your image classification network should predict). Very few images are enough for the converter to identify min and max values and quantize your model. However, typically your accuracy is less in comparison to quantization aware learning.
I can't find the source but I believe the edge TPU currently only supports 8bit-quantized models, and no hybrid operators.
EDIT: On Corals FAQ they mention that the model needs to be fully quantized.
You need to convert your model to TensorFlow Lite and it must be
quantized using either quantization-aware training (recommended) or
full integer post-training quantization.

Can I add Tensorflow Fake Quantization in a Keras sequential model?

I have searched this for a while, but it seems Keras only has quantization feature after the model is trained. I wish to add Tensorflow fake quantization to my Keras sequential model. According to Tensorflow's doc, I need these two functions to do fake quantization: tf.contrib.quantize.create_training_graph() and tf.contrib.quantize.create_eval_graph().
My question is has anyone managed to add these two functions in a Keras model? If yes, where should these two function be added? For example, before model.compile or after model.fit or somewhere else? Thanks in advance.
I worked around by post-training quantization. Since my final goal is to train a mdoel for mobile device, instead of fake quantization during training, I exported keras .h5 file and converted to Tenforflow lite .tflite file directly (with post_training_quantize flag set to true). I tested this on a simple cifar-10 model. The original keras model and the quantized tflite model have very close accuracy (the quantized one a bit lower).
Post-training quantization: https://www.tensorflow.org/performance/post_training_quantization
Convert Keras model to tensorflow lite: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/toco/g3doc/python_api.md
Used the tf-nightly tensorflow here: https://pypi.org/project/tf-nightly/
If you still want to do fake quantization (because for some model, post-training quantization may give poor accuracy according to Google), the original webpage is down last week. But you can find it from github: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize
Update: Turns out post-quantization does not really quantize the model. During inference, it still uses float32 kernels to do calculations. Thus, I've switched to quantization-aware training. The accuracy is pretty good for my cifar10 model.