SSD-shufflenet-V2-FPN is Slower than Mobilenet V2 - tensorflow

I have write some custom code on tensorflow-
models/object_detection to implement the SSD-shufflenet-v2-FPN (based on shufflenet v2 1.0) and SSD-mobilenet-v2-FPN (based on mobilenet v2 1.0). I trained them on my own data set.
Their precision is similar, but the performance speed varies greatly:
SSD-shufflenet-v2-fpn takes three times as long as SSD-mobilenet-v2-fpn when using the same input.(With 1080*1920 input,4 * ARM Cortex-A72 Cores and Android 8.0,SSD-shufflenet-v2-fpn cost 1200ms per image,SSD-mobilenet-v2-fpn just 400ms)
I tried to replace my code with a third-party basic network structure - Nothing changed.
In the shufflenet v2's paper, shufflenet v2 1.0 is much faster than mobilenet v2 1.0, either on GPU or ARM.Has anyone tried these two networks?
ps: Sorry, I have no condition to test the performance and coco classification performance of the basic network on the imagenet classification. I only have one GTX1080TI that often overheats and is thus too slow to complete these.
The way to implement a modified version of SSD is very simple. After completing the code of the shufflenet v2, modify ssd_mobilenet_v1_fpn_feature_extractor.py

In the original paper of MobileNetV2: (http://openaccess.thecvf.com/content_cvpr_2018/papers/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.pdf), there is a comparison between the MobileNetV2 and ShuffleNet architectures, in section 6.2.
One may observe that the CPU time on ShuffleNet is not measured, because, as the authors mention that, there wasn't an efficient implementation of group convolution and shuffling operations implemented in Tensorflow mobile framework when experiments took place. That may be the reason in your case too.

Related

Can I use EfficientNetB7 as my baseline model for image recognition?

I have seen many articles that used EfficientNetB0 as their baseline model, but I never saw anyone used EfficientNetB7 yet. From the EfficientNet Github page (https://github.com/qubvel/efficientnet) I saw that EfficientNetB7 achieved a very high accuracy result. Why doesn't everyone just use EfficientNetB7? Is it because of the memory limit or is there any other consideration to use EfficientNetB0?
A baseline is the result of a very basic model or approach to a problem. It is used to compare performance of more complex methods such as larger models, feature engineering or data augmentation.
EfficientNetB0 is used as it is a reliable model for somewhat good accuracy and because it is fast to train due to a low number of parameters.
Using EfficentNetB7 could serve as a baseline model, however when testing non-architecture related changes, such as data augmentation as mentioned earlier, retraining the large network will take longer slowing down your iteration speed.

Is there a standard way to optimize models to run well on different mobile devices?

I’m working on a few side projects that involve deploying ML models to the edge. One of them is a photo-editing app that includes CNN’s for facial recognition, object detection, classification, and style transfer. The other is a NLP app that assists in the writing process by suggesting words and sentence completions..
Once I have a trained model that’s accurate, it ends up being really slow on one or more mobile devices that I'm testing on (usually the lower end Android). I’ve read that there are optimizations one can do to speed models up, but I don’t know how. Is there a standard, go-to tool for optimizing models for mobile/edge?
I will be talking about TensorFlow Lite specifically it is a platform for running TensorFlow ops on Android and iOS. There are several optimisation techniques mentioned on their website but I will discuss the ones which feel important to me.
Constructing relevant models for platforms:
The first step in model optimization is its construction from scratch meaning TensorFlow. We need to create a model which can be used exported to a memory constrained device.
We definitely need to train different models for different machines. A model constructed to work on a high-end TPU will never run efficiently on a Mobile processor.
Create a model which has minimum layers and ops.
Do this without compromising the model's accuracy.
For this, you will need expertise in ML and also which ops are the best to preprocess data.
Also, extra preprocessing of input data brings down the model complexity to a great extent.
Model quantization:
We convert the high precision floats or decimals to lower precision floats. It affects the model's performance slightly but greatly reduces the model size and then holds less of the memory.
Post-training quantization is a general technique to reduce model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights from floating point to 8-bits of precision - from TF docs.
You can see the TensorFlow Lite TFLiteConverter example:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
Also you should try using the post_training_quantize= flag which reduces the model size considerably.
Hope it helps.

fast.ai equivalent in tensorflow

Is there any equivalent/alternate library to fastai in tensorfow for easier training and debugging deep learning models including analysis on results of trained model in Tensorflow.
Fastai is built on top of pytorch looking for similar one in tensorflow.
The obvious choice would be to use tf.keras.
It is bundled with tensorflow and is becoming its official "high-level" API -- to the point where in TF 2 you would probably need to go out of your way not using it at all.
It is clearly the source of inspiration for fastai to easy the use of pytorch as Keras does for tensorflow, as mentionned by the authors time and again:
Unfortunately, Pytorch was a long way from being a good option for part one of the course, which is designed to be accessible to people with no machine learning background. It did not have anything like the clear simple API of Keras for training models. Every project required dozens of lines of code just to implement the basics of training a neural network. Unlike Keras, where the defaults are thoughtfully chosen to be as useful as possible, Pytorch required everything to be specified in detail. However, we also realised that Keras could be even better. We noticed that we kept on making the same mistakes in Keras, such as failing to shuffle our data when we needed to, or vice versa. Also, many recent best practices were not being incorporated into Keras, particularly in the rapidly developing field of natural language processing. We wondered if we could build something that could be even better than Keras for rapidly training world-class deep learning models.

Why the MobileNetV2 is faster than MobileNetV1 only at mobile device?

I am studying about Google's brandnew MobileNetV2 architecture.
During studying, I've read this string at Tensorflow model zoo Github
'For example Mobilenet V2 is faster on mobile devices than Mobilenet V1, but is slightly slower on desktop GPU.'
So, my question is,
How that could be possible? I really want to know why.
From https://arxiv.org/abs/1903.08469v1 :
"However, MobileNet V2 uses depthwise separable convolutions which are not directly supported in GPU firmware (the cuDNN library). Therefore, MobileNet V2 tends to be slower than ResNet18 in most experimental setups. Note that the same issue disqualifies usage of the DenseNet architecture [12], since it requires efficient convolution over a non-contiguous tensor, which is still not supported in cuDNN."
From their published paper at MobileNetV2: Inverted Residuals and Linear Bottlenecks,
under subtopic number 5: Implementation Notes, 5.1. Memory efficient inference;
The inverted residual bottleneck layers allow a particularly
memory efficient implementation which is very
important for mobile applications. (and more in paper)
According to TensorFlow team, it's optimized smaller in size can also be used as TF Lite. As far as we know TF Lite is indeed for mobile use. It's much slower on desktop GPU probably V2 has more conv layers compared to V1 which make sense if the training tooks more times to finish. For now, we didn't do the training and inferencing of data on mobile because of computational speed hunger which lead to power hunger as well.
Hope I answer the question.

Using scikit learn for Neural Networks vs Tensorflow in training

I was implementing some sample Neural networks and in most tutorials saw this statement.
Neural networks tend to work better on GPUs than on CPU.
The scikit-learn framework isn’t built for GPU optimization.
So does this statement (work better) refers solely regarding the train phase of a neural network or it includes the prediction part also. Would greatly appreciate some explanation on this.
That statement refers to the training phase. The only issue here is that you can explore the search space of feasible models in a more efficient way using a GPU so you will probably find better models in less time. However, this is only related to computational costs and not to model predictive performance.