Is there a standard way to optimize models to run well on different mobile devices? - tensorflow

I’m working on a few side projects that involve deploying ML models to the edge. One of them is a photo-editing app that includes CNN’s for facial recognition, object detection, classification, and style transfer. The other is a NLP app that assists in the writing process by suggesting words and sentence completions..
Once I have a trained model that’s accurate, it ends up being really slow on one or more mobile devices that I'm testing on (usually the lower end Android). I’ve read that there are optimizations one can do to speed models up, but I don’t know how. Is there a standard, go-to tool for optimizing models for mobile/edge?

I will be talking about TensorFlow Lite specifically it is a platform for running TensorFlow ops on Android and iOS. There are several optimisation techniques mentioned on their website but I will discuss the ones which feel important to me.
Constructing relevant models for platforms:
The first step in model optimization is its construction from scratch meaning TensorFlow. We need to create a model which can be used exported to a memory constrained device.
We definitely need to train different models for different machines. A model constructed to work on a high-end TPU will never run efficiently on a Mobile processor.
Create a model which has minimum layers and ops.
Do this without compromising the model's accuracy.
For this, you will need expertise in ML and also which ops are the best to preprocess data.
Also, extra preprocessing of input data brings down the model complexity to a great extent.
Model quantization:
We convert the high precision floats or decimals to lower precision floats. It affects the model's performance slightly but greatly reduces the model size and then holds less of the memory.
Post-training quantization is a general technique to reduce model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights from floating point to 8-bits of precision - from TF docs.
You can see the TensorFlow Lite TFLiteConverter example:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
Also you should try using the post_training_quantize= flag which reduces the model size considerably.
Hope it helps.

Related

In object detection with SSD Mobilenetv2 + FPN(TF2 Object detection API), what are the possible reasons for hard example mining to not work well?

I am using the Tensorflow Object Detection API to build a detection model. Hard example mining seemed to work really well with SSD+Mobilenetv2 model(used with the TF1 version of the API). However with similar settings in the TF2 version with FPN SSD+Mobilenetv2+FPN model, I achieve similar metrics for mAP on relevant category but see a lot more false positives in evaluation even after adding hard example mining. What could be the possible reasons for that, any other ways to reduce false positives?
Solution is to not use negative hard mining, but instead use focal loss. This is in line with the Focal Loss / RetinaNet paper : https://arxiv.org/pdf/1708.02002v2.pdf

YoloV3 deployment on JETSON TX2

I faced problem regarding Yolo object detection deployment on TX2.
I use pre-trained Yolo3 (trained on Coco dataset) to detect some limited objects (I mostly concern on five classes, not all the classes), The speed is low for real-time detection, and the accuracy is not perfect (but acceptable) on my laptop. I’m thinking to make it faster by multithreading or multiprocessing on my laptop, is it possible for yolo?
But my main problem is that algorithm is not running on raspberry pi and nvidia TX2.
Here are my questions:
In general, is it possible to run yolov3 on TX2 without any modification like accelerators and model compression techniques?
I cannot run the model on TX2. Firstly I got error regarding camera, so I decided to run the model on a video, this time I got the 'cannot allocate memory in static TLS block' error, what is the reason of getting this error? the model is too big. It uses 16 GB GPU memory on my laptop.The GPU memory of raspberry and TX2 are less than 8GB. As far as I know there are two solutions, using a smaller model or using tensor RT or pruning. Do you have any idea if there is any other way?
if I use tiny-yolo I will get lower accuracy and this is not what I want. Is there any way to run any object detection model with high performance for real-time in terms of both accuracy and speed (FPS) on raspberry pi or NVIDIA TX2?
If I clean the coco data for just the objects I concern and then train the same model, I would get higher accuracy and speed but the size would not change, Am I correct?
In general, what is the best model in terms of accuracy for real-time detection and what is the best in terms of speed?
How is Mobilenet? Is it better than YOLOs in terms of both accuracy and speed?
1- Yes it is possible. I already run Yolov3 on Jetson Nano.
2- It depends on model and input resolution of data. You can decrease input resolution. Input images are transferred to GPU VRAM to use on model. Big input sizes can allocate much memory. As far as I remember I have run normal Yolov3 on Jetson Nano(which is worse than tx2) 2 years ago. Also, you can use Yolov3-tiny and Tensorrt as you mention. There are many sources on the web like this & this.
3- I suggest you to have a look at here. In this repo, you can make transfer learning with your dataset & optimize the model with TensorRT & run it on Jetson.
4- Size not dependent to dataset. It depend the model architecture(because it contains weights). Speed probably does not change. Accuracy depends on your dataset. It can be better or worser. If any class on COCO is similiar to your dataset's any class, I suggest you to transfer learning.
5- You have to find right model with small size, enough accuracy, gracefully speed. There is not best model. There is best model for your case which depend on also your dataset. You can compare some of the model's accuracy and fps here.
6- Most people uses mobilenet as feature extractor. Read this paper. You will see Yolov3 have better accuracy, SSD with MobileNet backbone have better FPS. I suggest you to use jetson-inference repo.
By using jetson-inference repo, I get enough accuracy on SSD model & get 30 FPS. Also, I suggest you to use MIPI-CSI camera on Jetson. It is faster than USB cameras.
I fixed the problem 1 and 2 only by replacing import order of the opencv and tensorflow inside the script.Now I can run Yolov3 without any modification on tx2. I got average FPS of 3.

Can I use EfficientNetB7 as my baseline model for image recognition?

I have seen many articles that used EfficientNetB0 as their baseline model, but I never saw anyone used EfficientNetB7 yet. From the EfficientNet Github page (https://github.com/qubvel/efficientnet) I saw that EfficientNetB7 achieved a very high accuracy result. Why doesn't everyone just use EfficientNetB7? Is it because of the memory limit or is there any other consideration to use EfficientNetB0?
A baseline is the result of a very basic model or approach to a problem. It is used to compare performance of more complex methods such as larger models, feature engineering or data augmentation.
EfficientNetB0 is used as it is a reliable model for somewhat good accuracy and because it is fast to train due to a low number of parameters.
Using EfficentNetB7 could serve as a baseline model, however when testing non-architecture related changes, such as data augmentation as mentioned earlier, retraining the large network will take longer slowing down your iteration speed.

Strategies for pre-training models for use in tfjs

This is a more general version of a question I've already asked: Significant difference between outputs of deep tensorflow keras model in Python and tensorflowjs conversion
As far as I can tell, the layers of a tfjs model when run in the browser (so far only tested in Chrome and Firefox) will have small numerical differences in the output values when compared to the same model run in Python or Node. The cumulative effect of these small differences across all the layers of the model can cause fairly significant differences in the output. See here for an example of this.
This means a model trained in Python or Node will not perform as well in terms of accuracy when run in the browser. And the deeper your model, the worse it will get.
Therefore my question is, what is the best way to train a model to use with tfjs in the browser? Is there a way to ensure the output will be identical? Or do you just have to accept that there will be small numerical differences and, if so, are there any methods that can be used to train a model to be more resilient to this?
This answer is based on my personal observations. As such, it is debatable and not backed by much evidence. Some things that I follow to get accuracy of 16-bit models close to 32 bit models are:
Avoid using activations that have small upper and lower bounds, such as sigmoid or tanh, for hidden layers. These activations cause the weights of the next layer to become very sensitive to small values, and hence, small changes. I prefer using ReLU for such models. Since it is now the standard activation for hidden layers in most models, you should be using it in any case.
Avoid weight decay and L1/L2 regularizations on weights while training (the kernel_regularizer parameter in keras), since these increase sensitivity of weights. Use Dropout instead, I didn't observe a major drop in performance on TFLite when using it instead of numerical regularizers.

Best case to use tensorflow

I followed all the steps mentioned in the article:
https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
Then I compared the results with Linear Regression and found that the error is less (68) than the tensorflow model (84).
from sklearn.linear_model import LinearRegression
logreg_clf = LinearRegression()
logreg_clf.fit(X_train, y_train)
pred = logreg_clf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, pred)))
Does this mean that if I have large dataset, I will get better results than linear regression?
What is the best situation - when I should be using tensorflow?
Answering your first question, Neural Networks are notoriously known for overfitting on smaller datasets, and here you are comparing the performance of a simple linear regression model with a neural network with two hidden layers on the testing data set, so it's not very surprising to see that the MLP model falling behind (assuming that you are working with relatively a smaller dataset) the linear regression model. Larger datasets will definitely help neural networks in learning more accurate parameters and generalize the phenomena well.
Now coming to your second question, Tensorflow is basically a library for building deep learning models, so whenever you are working on a deep learning problem like image recognition, Natural Language Processing, etc. you need massive computational power and will be processing a ton of data to train your models, and this is where TensorFlow becomes handy, it offers you GPU support which will significantly boost your training process which otherwise becomes practically impossible. Moreover, if you are building a product that has to be deployed in a production environment for it to be consumed, you can make use of TensorFlow Serving which helps you to take your models much closer to the customers.