Why the MobileNetV2 is faster than MobileNetV1 only at mobile device? - tensorflow

I am studying about Google's brandnew MobileNetV2 architecture.
During studying, I've read this string at Tensorflow model zoo Github
'For example Mobilenet V2 is faster on mobile devices than Mobilenet V1, but is slightly slower on desktop GPU.'
So, my question is,
How that could be possible? I really want to know why.

From https://arxiv.org/abs/1903.08469v1 :
"However, MobileNet V2 uses depthwise separable convolutions which are not directly supported in GPU firmware (the cuDNN library). Therefore, MobileNet V2 tends to be slower than ResNet18 in most experimental setups. Note that the same issue disqualifies usage of the DenseNet architecture [12], since it requires efficient convolution over a non-contiguous tensor, which is still not supported in cuDNN."

From their published paper at MobileNetV2: Inverted Residuals and Linear Bottlenecks,
under subtopic number 5: Implementation Notes, 5.1. Memory efficient inference;
The inverted residual bottleneck layers allow a particularly
memory efficient implementation which is very
important for mobile applications. (and more in paper)
According to TensorFlow team, it's optimized smaller in size can also be used as TF Lite. As far as we know TF Lite is indeed for mobile use. It's much slower on desktop GPU probably V2 has more conv layers compared to V1 which make sense if the training tooks more times to finish. For now, we didn't do the training and inferencing of data on mobile because of computational speed hunger which lead to power hunger as well.
Hope I answer the question.

Related

Assign Keras/TF/PyTorch layer to hardware type

Suppose we have the following architecture:
Multiple CNN layers
RNN layer
(Time-distributed) Dense classification layer
We want to train this architecture now. Our fancy GPU is very fast at solving the CNN layers. Although using a lower clockrate, it can perform many convolutions in parallel, thus the speed. Our fancy CPU however is faster for the (very long) resulting time series, because the time steps cannot be parallelized, and the processing profits from the higher CPU clockrate. So the (supposedly) smart idea for execution would look like this:
Multiple CNN layers (run on GPU)
RNN layer (run on CPU)
(Time-distributed) Dense classification layer (run on GPU/CPU)
This lead me to two important questions:
Is it possible, with any of the frameworks mentioned in the title, to distribute certain layers to certain hardware, and how?
If it is possible, would the overhead for the additional memory operations, e.g. tranferring between GPU-/CPU-RAM, render the whole idea useless?
Basically, in Pytorch you can control the device on which variables/parameters reside. AFAIK, it is your responsibility to make sure that for each operation all the arguments reside on the same device: i.e., you cannot conv(x, y) where x is on GPU and y is on CPU.
This is done via pytorch's .to() method that moves a module/variable .to('cpu') or .to('cuda:0')
As Shai mentioned you can control this yourself in pytorch so in theory you can have parts of your model on different devices. You have then to move data between devices in your forward pass.
I think the overhead as you mentioned would make the performance worst. The cuda RNN implementation benefits greatly from running on a gpu anyways :)

Is there a standard way to optimize models to run well on different mobile devices?

I’m working on a few side projects that involve deploying ML models to the edge. One of them is a photo-editing app that includes CNN’s for facial recognition, object detection, classification, and style transfer. The other is a NLP app that assists in the writing process by suggesting words and sentence completions..
Once I have a trained model that’s accurate, it ends up being really slow on one or more mobile devices that I'm testing on (usually the lower end Android). I’ve read that there are optimizations one can do to speed models up, but I don’t know how. Is there a standard, go-to tool for optimizing models for mobile/edge?
I will be talking about TensorFlow Lite specifically it is a platform for running TensorFlow ops on Android and iOS. There are several optimisation techniques mentioned on their website but I will discuss the ones which feel important to me.
Constructing relevant models for platforms:
The first step in model optimization is its construction from scratch meaning TensorFlow. We need to create a model which can be used exported to a memory constrained device.
We definitely need to train different models for different machines. A model constructed to work on a high-end TPU will never run efficiently on a Mobile processor.
Create a model which has minimum layers and ops.
Do this without compromising the model's accuracy.
For this, you will need expertise in ML and also which ops are the best to preprocess data.
Also, extra preprocessing of input data brings down the model complexity to a great extent.
Model quantization:
We convert the high precision floats or decimals to lower precision floats. It affects the model's performance slightly but greatly reduces the model size and then holds less of the memory.
Post-training quantization is a general technique to reduce model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights from floating point to 8-bits of precision - from TF docs.
You can see the TensorFlow Lite TFLiteConverter example:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
Also you should try using the post_training_quantize= flag which reduces the model size considerably.
Hope it helps.

SSD-shufflenet-V2-FPN is Slower than Mobilenet V2

I have write some custom code on tensorflow-
models/object_detection to implement the SSD-shufflenet-v2-FPN (based on shufflenet v2 1.0) and SSD-mobilenet-v2-FPN (based on mobilenet v2 1.0). I trained them on my own data set.
Their precision is similar, but the performance speed varies greatly:
SSD-shufflenet-v2-fpn takes three times as long as SSD-mobilenet-v2-fpn when using the same input.(With 1080*1920 input,4 * ARM Cortex-A72 Cores and Android 8.0,SSD-shufflenet-v2-fpn cost 1200ms per image,SSD-mobilenet-v2-fpn just 400ms)
I tried to replace my code with a third-party basic network structure - Nothing changed.
In the shufflenet v2's paper, shufflenet v2 1.0 is much faster than mobilenet v2 1.0, either on GPU or ARM.Has anyone tried these two networks?
ps: Sorry, I have no condition to test the performance and coco classification performance of the basic network on the imagenet classification. I only have one GTX1080TI that often overheats and is thus too slow to complete these.
The way to implement a modified version of SSD is very simple. After completing the code of the shufflenet v2, modify ssd_mobilenet_v1_fpn_feature_extractor.py
In the original paper of MobileNetV2: (http://openaccess.thecvf.com/content_cvpr_2018/papers/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.pdf), there is a comparison between the MobileNetV2 and ShuffleNet architectures, in section 6.2.
One may observe that the CPU time on ShuffleNet is not measured, because, as the authors mention that, there wasn't an efficient implementation of group convolution and shuffling operations implemented in Tensorflow mobile framework when experiments took place. That may be the reason in your case too.

Using scikit learn for Neural Networks vs Tensorflow in training

I was implementing some sample Neural networks and in most tutorials saw this statement.
Neural networks tend to work better on GPUs than on CPU.
The scikit-learn framework isn’t built for GPU optimization.
So does this statement (work better) refers solely regarding the train phase of a neural network or it includes the prediction part also. Would greatly appreciate some explanation on this.
That statement refers to the training phase. The only issue here is that you can explore the search space of feasible models in a more efficient way using a GPU so you will probably find better models in less time. However, this is only related to computational costs and not to model predictive performance.

Is everything in Tensorflow implemented as a NN?

For example, Kmeans clustering - is it implemented as a neural network algorithm?
No, why should it ? In order to better understand tensorflow take a look at the original paper in the abstract it states:
TensorFlow [1] is an interface for expressing machine learning
algorithms, and an implementation for executing such algorithms. A
computation expressed using TensorFlow can be executed with little or
no change on a wide variety of heterogeneous systems, ranging from
mobile devices such as phones and tablets up to large-scale
distributed systems of hundreds of machines and thousands of
computational devices such as GPU cards.
Hence Tensorflow is a tool to express algorithms and to schedule them on pieces of hardware such as CPU's, GPU's, TPU's and friends. Because it is most well known for running neural networks doesn't mean that even the simplest things should be implemented by using them.