Can I use EfficientNetB7 as my baseline model for image recognition? - tensorflow

I have seen many articles that used EfficientNetB0 as their baseline model, but I never saw anyone used EfficientNetB7 yet. From the EfficientNet Github page (https://github.com/qubvel/efficientnet) I saw that EfficientNetB7 achieved a very high accuracy result. Why doesn't everyone just use EfficientNetB7? Is it because of the memory limit or is there any other consideration to use EfficientNetB0?

A baseline is the result of a very basic model or approach to a problem. It is used to compare performance of more complex methods such as larger models, feature engineering or data augmentation.
EfficientNetB0 is used as it is a reliable model for somewhat good accuracy and because it is fast to train due to a low number of parameters.
Using EfficentNetB7 could serve as a baseline model, however when testing non-architecture related changes, such as data augmentation as mentioned earlier, retraining the large network will take longer slowing down your iteration speed.

Related

When should I stop the object detection model training while mAP are not stable?

I am re-training the SSD MobileNet with 900 images from the Berkeley Deep Drive dataset, and eval towards 100 images from that dataset.
The problem is that after about 24 hours of training, the totalloss seems unable to go below 2.0:
And the corresponding mAP score is quite unstable:
In fact, I have actually tried to train for about 48 hours, and the TotoalLoss just cannot go below 2.0, something ranging from 2.5~3.0. And during that time, mAP is even lower..
So here is my question, given my situation (I really don't need any "high-precision" model, as you can see, I pick 900 images for training and would like to simply do a PoC model training/predication and that's it), when should I stop the training and obtain a reasonably performed model?
indeed for detection you need to finetune the network, since you are using SSD, there are already some sources out there:
https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html (This one specifically for an SSD Model, uses mxnet but you can use the same with TF)
You can watch a very nice finetuning intro here
This repo has a nice fine tuning option enabled as long as you write your dataloader, check it out here
In general your error can be attributed to many factors, the learning rate you are using, the characteristics of the images themselves (are they normalized?) If the ssd network you are using was trained with normalized data and you don't normalize to retrain then you'll get stuck while learning. Also what learning rate are they using?
From the model zoo I can see that for SSD there are models trained on COCO
And models trained on Open Images:
If for example you are using ssd_inception_v2_coco, there is a truncated_normal_initializer in the input layers, so take that into consideration, also make sure the input sizes are the same that the ones you provide to the model.
You can get very good detections even with little data if you also include many augmentations and take into account the rest of the things I mentioned, more details on your code would help to see where the problem lies.

Is there a standard way to optimize models to run well on different mobile devices?

I’m working on a few side projects that involve deploying ML models to the edge. One of them is a photo-editing app that includes CNN’s for facial recognition, object detection, classification, and style transfer. The other is a NLP app that assists in the writing process by suggesting words and sentence completions..
Once I have a trained model that’s accurate, it ends up being really slow on one or more mobile devices that I'm testing on (usually the lower end Android). I’ve read that there are optimizations one can do to speed models up, but I don’t know how. Is there a standard, go-to tool for optimizing models for mobile/edge?
I will be talking about TensorFlow Lite specifically it is a platform for running TensorFlow ops on Android and iOS. There are several optimisation techniques mentioned on their website but I will discuss the ones which feel important to me.
Constructing relevant models for platforms:
The first step in model optimization is its construction from scratch meaning TensorFlow. We need to create a model which can be used exported to a memory constrained device.
We definitely need to train different models for different machines. A model constructed to work on a high-end TPU will never run efficiently on a Mobile processor.
Create a model which has minimum layers and ops.
Do this without compromising the model's accuracy.
For this, you will need expertise in ML and also which ops are the best to preprocess data.
Also, extra preprocessing of input data brings down the model complexity to a great extent.
Model quantization:
We convert the high precision floats or decimals to lower precision floats. It affects the model's performance slightly but greatly reduces the model size and then holds less of the memory.
Post-training quantization is a general technique to reduce model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights from floating point to 8-bits of precision - from TF docs.
You can see the TensorFlow Lite TFLiteConverter example:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
Also you should try using the post_training_quantize= flag which reduces the model size considerably.
Hope it helps.

tf-slim batch norm: different behaviour between training/inference mode

I'm attempting to train a tensorflow model based on the popular slim implementation of mobilenet_v2 and am observing behaviour I cannot explain related (I think) to batch normalization.
Problem Summary
Model performance in inference mode improves initially but starts producing trivial inferences (all near-zeros) after a long period. Good performance continues when run in training mode, even on the evaluation dataset. Evaluation performance is impacted by batch normalization decay/momentum rate... somehow.
More extensive implementation details below, but I'll probably lose most of you with the wall of text, so here are some pictures to get you interested.
The curves below are from a model which I tweaked the bn_decay parameter of while training.
0-370k: bn_decay=0.997 (default)
370k-670k: bn_decay=0.9
670k+: bn_decay=0.5
Loss for (orange) training (in training mode) and (blue) evaluation (in inference mode). Low is good.
Evaluation metric of model on evaluation dataset in inference mode. High is good.
I have attempted to produce a minimal example which demonstrates the issue - classification on MNIST - but have failed (i.e. classification works well and the problem I experience is not exhibited). My apologies for not being able to reduce things further.
Implementation Details
My problem is 2D pose estimation, targeting Gaussians centered at the joint locations. It is essentially the same as semantic segmentation, except rather than using a softmax_cross_entropy_with_logits(labels, logits) I use tf.losses.l2_loss(sigmoid(logits) - gaussian(label_2d_points)) (I use the term "logits" to describe unactivated output of my learned model, though this probably isn't the best term).
Inference Model
After preprocessing my inputs, my logits function is a scoped call to the base mobilenet_v2 followed by a single unactivated convolutional layer to make the number of filters appropriate.
from slim.nets.mobilenet import mobilenet_v2
def get_logtis(image):
with mobilenet_v2.training_scope(
is_training=is_training, bn_decay=bn_decay):
base, _ = mobilenet_v2.mobilenet(image, base_only=True)
logits = tf.layers.conv2d(base, n_joints, 1, 1)
return logits
Training Op
I have experimented with tf.contrib.slim.learning.create_train_op as well as a custom training op:
def get_train_op(optimizer, loss):
global_step = tf.train.get_or_create_global_step()
opt_op = optimizer.minimize(loss, global_step)
update_ops = set(tf.get_collection(tf.GraphKeys.UPDATE_OPS))
update_ops.add(opt_op)
return tf.group(*update_ops)
I'm using tf.train.AdamOptimizer with learning rate=1e-3.
Training Loop
I'm using the tf.estimator.Estimator API for training/evaluation.
Behaviour
Training initially goes well, with an expected sharp increase in performance. This is consistent with my expectations, as the final layer is rapidly trained to interpret the high-level features output by the pretrained base model.
However, after a long period (60k steps with batch_size 8, ~8 hours on a GTX-1070) my model begins to output near-zero values (~1e-11) when run in inference mode, i.e. is_training=False. The exact same model continues to improve when run in *training mode, i.e.is_training=True`, even on the valuation set. I have visually verified this is.
After some experimentation I changed the bn_decay (batch normalization decay/momentum rate) from the default 0.997 to 0.9 at ~370k steps (also tried 0.99, but that didn't make much of a difference) and observed an immdeiate improvement in accuracy. Visual inspection of the inference in inference mode showed clear peaks in the inferred values of order ~1e-1 in the expected places, consistent with the location of peaks from training mode (though values much lower). This is why the accuracy increases significantly, but the loss - while more volative - does not improve much.
These effects dropped off after more training and reverted to all zero inference.
I further dropped the bn_decay to 0.5 at step ~670k. This resulted in improvements to both loss and accuracy. I'll likely have to wait until tomorrow to see the long-term effect.
Loss and an evaluation metric plots given below. Note the evaluation metric is based on the argmax of the logits and high is good. Loss is based on the actual values, and low is good. Orange uses is_training=True on the training set, while blue uses is_training=False on the evaluation set. The loss of around 8 is consistent with all zero outputs.
Other notes
I have also experimented with turning off dropout (i.e. always running the dropout layers with is_training=False), and observed no difference.
I have experimented with all versions of tensorflow from 1.7 to 1.10. No difference.
I have trained models from the pretrained checkpoint using bn_decay=0.99 from the start. Same behaviour as using default bn_decay.
Other experiments with a batch size of 16 result in qualitatively identical behaviour (though I can't evaluate and train simultaneously due to memory constraints, hence quantitatively analysing on batch size of 8).
I have trained different models using the same loss and using tf.layers API and trained from scratch. They have worked fine.
Training from scratch (rather than using pretrained checkpoints) results in similar behaviour, though takes longer.
Summary/my thoughts:
I am confident this is not an overfitting/dataset problem. The model makes sensible inferences on the evaluation set when run with is_training=True, both in terms of location of peaks and magnitude.
I am confident this is not a problem with not running update ops. I haven't used slim before, but apart from the use of arg_scope it doesn't look too much different to the tf.layers API which I've used extensively. I can also inspect the moving average values and observe that they are changing as training progresses.
Chaning bn_decay values significantly effected the results temporarily. I accept that a value of 0.5 is absurdly low, but I'm running out of ideas.
I have tried swapping out slim.layers.conv2d layers for tf.layers.conv2d with momentum=0.997 (i.e. momentum consistent with default decay value) and behaviour was the same.
Minimal example using pretrained weights and Estimator framework worked for classification of MNIST without modification to bn_decay parameter.
I've looked through issues on both the tensorflow and models github repositories but haven't found much apart from this. I'm currently experimenting with a lower learning rate and a simpler optimizer (MomentumOptimizer), but that's more because I'm running out of ideas rather than because I think that's where the problem lies.
Possible Explanations
The best explanation I have is that my model parameters are rapidly cycling in a manner such that the moving statistics are unable to keep up with the batch statistics. I've never heard of such behaviour, and it doesn't explain why the model reverts to poor behaviour after more time, but it's the best explanation I have.
There may be a bug in the moving average code, but it has worked perfectly for me in every other case, including a simple classification task. I don't want to file an issue until I can produce a simpler example.
Anyway, I'm running out of ideas, the debug cycle is long, and I've already spent too much time on this. Happy to provide more details or run experiments on demand. Also happy to post more code, though I'm worried that'll scare more people off.
Thanks in advance.
Both lowering the learning rate to 1e-4 with Adam and using Momentum optimizer (with learning_rate=1e-3 and momentum=0.9) resolved this issue. I also found this post which suggests the problem spans multiple frameworks and is an undocumented pathology of some networks due to the interaction between optimizer and batch-normalization. I do not believe it is a simple case of the optimizer failing to find a suitable minimum due to the learning rate being too high (otherwise performance in training mode would be poor).
I hope that helps others experiencing the same issue, but I'm a long way from satisfied. I'm definitely happy to hear other explanations.

YOLO vs Inception on unique images

I have images of unique products that are used at my workplace. I can't imagine that the inception database already has similar items that it has been trained on.
I tried to train a model using YOLO. It was taking a very very long time. Maybe 7minutes between epochs; and I wanted to do 1000 epochs due to small data size.
I used tiny-yolov2-voc cfg/weight on 1.0 GPU. I had a video of the item but i broke it up into frames so i could annotate. I then attempted to train on the images (not video). The products are healthcare related. Basically anything that a hospital would use.
Ive also used the inception method on images I got from Google. I noticed that inception method was very fast and resulted in accurate predictions. However, i'm worried that my images are too unique for inception to work.
Which method is best to use?
If you recommend YOLO, can you please provide suggestions on how to speed up the training phase?
If you recommend inception, can you please provide an explanation why it would work on unique images? I guess i'm having trouble understanding how inception knows which item i'm trying to train on without me providing annotations.
Thanks in advance
Just my impression (no recommendation or even related experience)
Having a look at the Hardware recommendations related to darknet a assumption is that you might stock up your own hardware to get faster results.
I read about the currently three different versions of YOLO and expect there are lot's of GFLOPS training included if you download the recommended files, but if the models never fit to your products then for you they never might be very helpful.
I must admit I've neither been active with YOLO nor with Tensorflow, so my impression might not be helpful at all.
If you see some videos of YOLO you can remark that sometimes a camel is labeled with horse and the accuracy seems being bad but it depends on the threshold that is applied to the images, so the videos look amazing as it seems the recognition is done so fast but with higher accuracy the process would slow down - also depending on the trained motives.
They never hide it though, they explain on an image where a dog is labeled as cow and a horse as sheep (Version 2) that in combination with darknet it's getting much faster but less accurate too, so usage of darknet is an important aspect too.
The information about details seems being quite bad on the websites of YOLO, they present it more like you'd do with a popstar, in comparison the website of Tensorflow looks more academic and is informing about the mathematics behind the framework.
Concerning Tensorflow I don't know about the hardware-recommendations, but as you wrote your results are useful, probably they are a bit or even much less.
My impression is that YOLO is primary intended for real-time detection in (live-)videos and needs much training for high accuracy. So depending on your use-case it might be right but you'd to invest in hardware probably for professional usage.
This is not an opinion against Tensorflow but that I had to verify more and it seems taking more time to get an impression. Concerning Tensorflow in the moment I even can't say if it can be used for real-time-detection, how accurate it is then and if the results are then still better then those of YOLO.
My assumption is that concerning both solutions it's a matter of involved elements (like the decision if to include darknet for speed), configuration, training and adjustments. Probably there is always something to increase in speed and accuracy, so investing in a system for recognition won't be static process with fixed end in timeline, but a steady process.
This is just a short overview of my impressions, I've never any experience with any recognition-software and hardly recommend that you make any decision based on my words.
Just if you want to do use any recognition software professional, especially for real-time-recognition, then you've to invest in hardware probably.
To my understanding of your problem you need you need inception with the capability of identifying your unique images. In this circumstance you can use transfer-learning on the inception model. With transfer-learning you can still train inception your own pictures while retaining the previous knowledge of inception.
More on transfer-learning

CNN : Fine tuning small network vs feature extracting from a big network

To elaborate : Under what circumstances would fine tuning all layers of a small network (say SqueezeNet) perform better than feature extracting or fine tuning only last 1 or 2 Convolution layer of a big network (e.g inceptionV4)?
My understanding is computing resource required for both is somewhat comparable. And I remember reading in a paper that extreme options i.e fine tuning 90% or 10% of network is far better compared to more moderate like 50%. So, what should be the default choice when experimenting extensively is not an option?
Any past experiments and intuitive description of their result, research paper or blog would be specially helpful. Thanks.
I don't have much experience in training models like SqueezeNet, but I think it is much easier to finetune only the last 1 or 2 layers of a big network: you don't have to extensively search for many optimal hyperparameters. Transfer learning works amazingly well out of the box with the LR finder and the cyclical learning rate from fast.ai.
If you want fast inference after the training, then it is preferable to train SqueezeNet. It might also be the case if the new task is very different from ImageNet.
Some intuition from http://cs231n.github.io/transfer-learning/
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.