If we look at the list of available models in Keras as shown here we see that almost all of them are instantiated with weights='imagenet'. For instance:
model = VGG16(weights='imagenet', include_top=False)
Why always imagenet? is it because it is the baseline? If not, what are the other options available?
Thank you
Imagenet is a defacto standard for images classification. A yearly contest is run with millions of training images in 1000 categories. The models used in the imagenet classification competitions are measured against each other for performance. Therefore it provides a "standard" measure for how good a model is for image classification. So many often used transfer learning model models use the imagenet weights. Your model if you are using transfer learning can be customized for your application by adding additional layers to the model. You do not have to use the imagenet weighst but it generally is beneficial as it helps the model converge in less epochs. I use them but I also set all layers to be trainable which helps adapt the weights of the model to your application.
Related
I am using Google's Dopamine framework to train a specific reinforcement learning use-case. I am using an auto encoder to pre-train the convolutional layers of the Deep Q Network and then transfer those pre-trained weights in the final network.
To that end, I have created a separate model (in this case an auto-encoder) which I train and save the resulting model and weights.
The DQN model is created using Keras's model sub-classing method and the model used to save the trained convolutional layers weights was build using the Sequential API. My issue is with when trying to load the pre-trained weights to my final DQN model. Based on whether I use the load_model() or load_weights() functionality from Tensorflow's API I get two different overall behaviors of my network and I would like to understand why. Specifically I have the two following scenarios:
Loading the weights with theload_weights() method to the final model. The weights are the weights of the encoder plus one additional layer(added just before saving the weights) to fit the architecture of the final network implemented in dopamine where they are loaded.
First load the saved model with load_model() and then when defining the new model in the __init__() method, extract the relevant layers from the loaded model and then use them for the final model.
Overall, I would expect the two approaches to yield similar results with regards to the average reward achieved per episode , when I use the same pre-trained weights. However the two approaches differ ( 1. yield higher average reward than 2. although using the same pre-trained weights) and I don't understand why.
Furthermore, in order to validate this behavior I have tried loading random weights with the two aforementioned approaches in order to see a change in behavior. In both cases, based on which of the two aforementioned loading methods I am using, I end up with very similar resulting behavior with the respected case when loading the trained weights. It's seems like the pre-trained weights in each respected case have no effect on the overall resulting training behavior. Although, this might be irrelevant to the issue I am trying to investigate here as it might be the case that the pre-trained weights don't offer any benefit overall which is also possible.
Any thoughts and ideas on this would be much appreciated.
SCENARIO
What if my intention is to train for a dataset of medical images and I have chosen a coco pre-trained model.
My Doubts
1 Since I have chosen medical images there is no point of train it on COCO dataset, right? if so what is a possible solution to do the same?
2 Adding more layers to a pre-trained model will screw the entire model? with classes of around 10 plus and 10000's of training datasets?
3 Without train from scratch what are the possible solutions , like fine-tuning the model?
PS - let's assume this scenario is based on deploying the model for business purposes.
Thanks-
Yes, it is a good idea to reuse the Pre-Trained Models or Transfer Learning in Real World Projects, as it saves Computation Time and as the Architectures are proven.
If your use case is to classify the Medical Images, that is, Image Classification, then
Since I have chosen medical images there is no point of train it on
COCO dataset, right? if so what is a possible solution to do the same?
Yes, COCO Dataset is not a good idea for Image Classification as it is efficient for Object Detection. You can reuse VGGNet or ResNet or Inception Net or EfficientNet. For more information, refer TF HUB Modules.
Adding more layers to a pre-trained model will screw the entire model?
with classes of around 10 plus and 10000's of training datasets?
No. We can remove the Top Layer of the Pre-Trained Model and can add our Custom Layers, without affecting the performance of the Pre-Trained Model.
Without train from scratch what are the possible solutions , like
fine-tuning the model?
In addition to using the Pre-Trained Models, you can Tune the Hyper-Parameters of the Model (Custom Layers added by you) using HParams of Tensorboard.
I have implemented a form of the LeNet model via tensorflow and python for a Car number plate recognition system. My model was trained solely on my train data and tested on the test data. My dataset contains segmented images wherein every image has only one character in them. This is what my data looks like. My created model does not perform very well, so I'm now looking for models which I can use via Transfer Learning. Since most models, are already trained on a humongous dataset, I looked over a few like AlexNet, ResNet, GoogLeNet and Inception v2. Most of these models have not been trained on the type of data that I want which would be, Letters and digits.
Question: Should I still go forward with one of these models and train them on my dataset or are there any better models which would help ? For such models would keras be a better option since it is more high level than Tensorflow?
Question: I'd prefer to work with the LeNet model itself since training the other models would definitely take a long time due to the insufficient specs of my laptop. So is there any implementation of the model which uses machine printed character images to train the model which I could use to then train the final layers of the model on my data?
to get good results you should use a model explicitly designed for text recognition.
First, (roughly) crop the input image to the region around the text.
Then, feed the image of the text into a neural network (NN) to detect the text.
A typical NN for text recognition extracts relevant features (with convolutional NN), propagates those features through the image (with recurrent NN) and finally predicts a character score for each position in the image.
Usually, those networks are trained with the CTC loss.
As a starting point I would suggest looking at the CRNN implementation (they also provide a pre-trained model) [1] and the corresponding paper [2]. There is, as far as I remember, also a TensorFlow implementation on github.
You can use any framework (e.g TensorFlow or CNTK or ...) you like as long as it features convolutional and recurrent NN and the CTC loss.
I once attended a presentation about CNTK where they claimed that they have a very fast implementation of recurrent NN - so maybe CNTK would be a good choice for your slow computer?
[1] CRNN implementation: https://github.com/bgshih/crnn
[2] Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition
I have a dataset consist of around 5000 categories of images, but the number of images of every category varies from 20 to 2000, which is quite unbalanced. Also, the number of images are far from enough to train a model from scratch. I decided to do finetuning on pretrained models, like Inception models.
But I am not sure about how to deal with unbalanced data. There are several possible approaches:
Oversampling: Oversample the minority category. But even with aggressive image augmentation technique, we may not be able to deal with overfit.
Also, how to generate balanced batches from unbalanced dataset over so many categories? Do you have some ideas about this pipeline mechanism with TensorFlow?
SMOTE: I think it is not so effective for high dimensional signals like images.
Put weight on cross entropy loss in every batch. This might be useful for single batch, but cannot deal with the overall unbalance.
Any ideas about this? Any feedback will be appreciated.
Use tf.losses.softmax_cross_entropy and set weights for each class inversely proportional to their training frequency to "balance" the optimization.
Start with the pre-trained ImageNet layers, add your own final layers (with appropriate convolution, drop out and flatten layers as required). Freeze all but last few of the ImageNet layers, then train on your dataset.
For unbalanced data (and in general small datasets), use data augmentation to create more training images. Keras has this functionality built-in: Building powerful image classification models using very little data
I read a lot of papers on convnets, but there is one thing I don't understand, how the filters in convolutional layer are initialized ?
Because, for examples, in first layer, filters should detect edges etc..
But if it randomly init, it could not be accurate ? Same for next layer and high-level features.
And an other question, what are the range of the value in those filters ?
Many thanks to you!
You can either initialize the filters randomly or pretrain them on some other data set.
Some references:
http://deeplearning.net/tutorial/lenet.html:
Notice that a randomly initialized filter acts very much like an edge
detector!
Note that we use the same weight initialization formula as with the
MLP. Weights are sampled randomly from a uniform distribution in the
range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a
hidden unit. For MLPs, this was the number of units in the layer
below. For CNNs however, we have to take into account the number of
input feature maps and the size of the receptive fields.
http://cs231n.github.io/transfer-learning/ :
Transfer Learning
In practice, very few people train an entire Convolutional Network
from scratch (with random initialization), because it is relatively
rare to have a dataset of sufficient size. Instead, it is common to
pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 1.2 million images with 1000 categories), and then use the
ConvNet either as an initialization or a fixed feature extractor for
the task of interest. The three major Transfer Learning scenarios look
as follows:
ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer's outputs
are the 1000 class scores for a different task like ImageNet), then
treat the rest of the ConvNet as a fixed feature extractor for the new
dataset. In an AlexNet, this would compute a 4096-D vector for every
image that contains the activations of the hidden layer immediately
before the classifier. We call these features CNN codes. It is
important for performance that these codes are ReLUd (i.e. thresholded
at zero) if they were also thresholded during the training of the
ConvNet on ImageNet (as is usually the case). Once you extract the
4096-D codes for all images, train a linear classifier (e.g. Linear
SVM or Softmax classifier) for the new dataset.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new
dataset, but to also fine-tune the weights of the pretrained network
by continuing the backpropagation. It is possible to fine-tune all the
layers of the ConvNet, or it's possible to keep some of the earlier
layers fixed (due to overfitting concerns) and only fine-tune some
higher-level portion of the network. This is motivated by the
observation that the earlier features of a ConvNet contain more
generic features (e.g. edge detectors or color blob detectors) that
should be useful to many tasks, but later layers of the ConvNet
becomes progressively more specific to the details of the classes
contained in the original dataset. In case of ImageNet for example,
which contains many dog breeds, a significant portion of the
representational power of the ConvNet may be devoted to features that
are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release
their final ConvNet checkpoints for the benefit of others who can use
the networks for fine-tuning. For example, the Caffe library has a
Model Zoo where people
share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of
several factors, but the two most important ones are the size of the
new dataset (small or big), and its similarity to the original dataset
(e.g. ImageNet-like in terms of the content of images and the classes,
or very different, such as microscope images). Keeping in mind that
ConvNet features are more generic in early layers and more
original-dataset-specific in later layers, here are some common rules
of thumb for navigating the 4 major scenarios:
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to
overfitting concerns. Since the data is similar to the original data,
we expect higher-level features in the ConvNet to be relevant to this
dataset as well. Hence, the best idea might be to train a linear
classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won't overfit
if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a
linear classifier. Since the dataset is very different, it might not
be best to train the classifier form the top of the network, which
contains more dataset-specific features. Instead, it might work better
to train the SVM classifier from activations somewhere earlier in the
network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can
afford to train a ConvNet from scratch. However, in practice it is
very often still beneficial to initialize with weights from a
pretrained model. In this case, we would have enough data and
confidence to fine-tune through the entire network.
Practical advice. There are a few additional things to keep in mind when performing Transfer Learning:
Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the
architecture you can use for your new dataset. For example, you can't
arbitrarily take out Conv layers from the pretrained network. However,
some changes are straight-forward: Due to parameter sharing, you can
easily run a pretrained network on images of different spatial size.
This is clearly evident in the case of Conv/Pool layers because their
forward function is independent of the input volume spatial size (as
long as the strides "fit"). In case of FC layers, this still holds
true because FC layers can be converted to a Convolutional Layer: For
example, in an AlexNet, the final pooling volume before the first FC
layer is of size [6x6x512]. Therefore, the FC layer looking at this
volume is equivalent to having a Convolutional Layer that has
receptive field size 6x6, and is applied with padding of 0.
Learning rates. It's common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the
(randomly-initialized) weights for the new linear classifier that
computes the class scores of your new dataset. This is because we
expect that the ConvNet weights are relatively good, so we don't wish
to distort them too quickly and too much (especially while the new
Linear Classifier above them is being trained from random
initialization).
Additional References
CNN Features off-the-shelf: an Astounding Baseline for Recognition trains SVMs on features
from ImageNet-pretrained ConvNet and reports several state of the art
results.
DeCAF reported similar findings in 2013. The framework in this paper (DeCAF) was a Python-based precursor to the C++ Caffe library.
How transferable are features in deep neural networks? studies the transfer
learning performance in detail, including some unintuitive findings
about layer co-adaptations.