In regards to quantization aware training in tf, is it necessary to use "tf.contrib.quantize.create_eval_graph()" if we are not exporting the model? - tensorflow

I am wondering that in tensorflow, if we are doing quantization aware training (QAT) by introducing fake quant nodes (using tf.contrib.quantize.create_training_graph() method), and after finishing the training process, can we do inference on the quantized output while not using tf.contrib.quantize.create_eval_graph() method?
In other words, after introducing fake quantization nodes and training, is it necessary to use tf.contrib.quantize.create_eval_graph() before trained computational graph evaluation. Can we query the tensorflow graph (which has fake quantization nodes) by making a tensorflow session without using tf.contrib.quantize.create_eval_graph().
In short, what is the function of tf.contrib.quantize.create_eval_graph()?

Related

How to optimize a pre-trained TF2.0 model for inference

My goal is to optimize a pre-trained model from TFHub for inference. Therefore I would like to use an object detection model with multiple outputs:
https://tfhub.dev/tensorflow/ssd_mobilenet_v2/fpnlite_640x640/1
where the archive contains a SavedModel file
https://tfhub.dev/tensorflow/ssd_mobilenet_v2/fpnlite_640x640/1?tf-hub-format=compressed
I came across the methods optimize_for_inference and freeze_graph, but read on the following thread that is is no longer supported in TF2:
https://stackoverflow.com/a/56384808/11687201
So how is optimization for inference done with TF2?
The plan is to use this one of the pre-trained networks for transfer learning and use this network later on with a hardware accelerator, the converter for this hardware requires a frozen graph as input.

Does Tensorflows quantization aware training lead to an actual speedup during training?

we are looking into using quantization aware training for a research project to determine the impact of quantization during training on convergence rates an runtimes. We are though not yet fully convinced that this is the right tool. Could you please clarify the following points:
1) If a layer is quantized during quantization aware training, this means inputs and weights are quantized and all operations including activation function are quantized and then, before returning, the outputs are de-quantized to a precision compatible with the next layer. Is this understanding correct?
2) Tensorboard profiler compatibility?
3) Does quantization aware training, in principle, lead to a speedup during training in your general experience or is this impossible due to it beeing solely a simulation?
4) Can you point us to a resource on how to add custom quantizers and datatypes to tensorflow s.t. they are GPU compatible?
Thank you very much for your help!
After doing some research, QAT does not speed up training but only prepares the model for post training quantization. MuPPET, however, is an algorithm that actually speeds up training via quantization.

Convert Training Graph to Inference Graph? ( Remove Batch Normalization in TF)

I have been trying to find tools that I can use to remove the batch normalization nodes during the inference phase.
I have used the graph transform tool that TF provides but they don't actually remove the batch norm layers. Just fuses it to the previous conv layers.
Does anyone have any tools that they use for this purpose?
I would be fine with the batch layers, but tf life does not support it yet.
MobileNet seems to require batch normalization, so how is tflite being able to run MobileNet?
When you convert your tensorflow graph to tflite graph, utility tflite_convert (or toco) automatically fused batch norm during conversation. This is part of conversion general transformations

Can I add Tensorflow Fake Quantization in a Keras sequential model?

I have searched this for a while, but it seems Keras only has quantization feature after the model is trained. I wish to add Tensorflow fake quantization to my Keras sequential model. According to Tensorflow's doc, I need these two functions to do fake quantization: tf.contrib.quantize.create_training_graph() and tf.contrib.quantize.create_eval_graph().
My question is has anyone managed to add these two functions in a Keras model? If yes, where should these two function be added? For example, before model.compile or after model.fit or somewhere else? Thanks in advance.
I worked around by post-training quantization. Since my final goal is to train a mdoel for mobile device, instead of fake quantization during training, I exported keras .h5 file and converted to Tenforflow lite .tflite file directly (with post_training_quantize flag set to true). I tested this on a simple cifar-10 model. The original keras model and the quantized tflite model have very close accuracy (the quantized one a bit lower).
Post-training quantization: https://www.tensorflow.org/performance/post_training_quantization
Convert Keras model to tensorflow lite: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/toco/g3doc/python_api.md
Used the tf-nightly tensorflow here: https://pypi.org/project/tf-nightly/
If you still want to do fake quantization (because for some model, post-training quantization may give poor accuracy according to Google), the original webpage is down last week. But you can find it from github: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize
Update: Turns out post-quantization does not really quantize the model. During inference, it still uses float32 kernels to do calculations. Thus, I've switched to quantization-aware training. The accuracy is pretty good for my cifar10 model.

Tensorflow SSD-Mobilenet model accuracy drop after quantization using transform_graph

I am working on the recently released "SSD-Mobilenet" model by google for object detection.
Model downloaded from following location: https://github.com/tensorflow/models/blob/master/object_detection/g3doc/detection_model_zoo.md
The frozen graph file downloaded from the site is working as expected, however after quantization the accuracy drops significantly (mostly random predictions).
I built tensorflow r1.2 from source, and used following method to quantize:
bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=frozen_inference_graph.pb --out_graph=optimized_graph.pb --inputs='image_tensor' --outputs='detection_boxes','detection_scores','detection_classes','num_detections' --transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,224,224,3") fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms quantize_weights strip_unused_nodes sort_by_execution_order'
I tried various combinations in the "transforms" part, and the transforms mentioned above gave sometimes correct predictions, however no where close to the original model.
Is there any other way to improve performance of the quantized model?
In this case SSD uses mobilenet as it's feature extractor . In-order to increase the speed. If you read the mobilenet paper , it's a lightweight convolutional neural nets specially using separable convolution inroder to reduce parameters .
As I understood separable convolution can loose information because of the channel wise convolution.
So when quantifying a graph according to TF implementation it makes 16 bits ops and weights to 8bits . If you read the tutorial in TF for quantization they clearly have mentioned how this operation is more like adding some noise in to already trained net hoping our model has well generalized .
So this will work really well and almost lossless interms of accuracy for a heavy model like inception , resnet etc. But with the lightness and simplicity of ssd with mobilenet it really can make a accuracy loss .
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
How to Quantize Neural Networks with TensorFlow