I read a lot of papers on convnets, but there is one thing I don't understand, how the filters in convolutional layer are initialized ?
Because, for examples, in first layer, filters should detect edges etc..
But if it randomly init, it could not be accurate ? Same for next layer and high-level features.
And an other question, what are the range of the value in those filters ?
Many thanks to you!
You can either initialize the filters randomly or pretrain them on some other data set.
Some references:
http://deeplearning.net/tutorial/lenet.html:
Notice that a randomly initialized filter acts very much like an edge
detector!
Note that we use the same weight initialization formula as with the
MLP. Weights are sampled randomly from a uniform distribution in the
range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a
hidden unit. For MLPs, this was the number of units in the layer
below. For CNNs however, we have to take into account the number of
input feature maps and the size of the receptive fields.
http://cs231n.github.io/transfer-learning/ :
Transfer Learning
In practice, very few people train an entire Convolutional Network
from scratch (with random initialization), because it is relatively
rare to have a dataset of sufficient size. Instead, it is common to
pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 1.2 million images with 1000 categories), and then use the
ConvNet either as an initialization or a fixed feature extractor for
the task of interest. The three major Transfer Learning scenarios look
as follows:
ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer's outputs
are the 1000 class scores for a different task like ImageNet), then
treat the rest of the ConvNet as a fixed feature extractor for the new
dataset. In an AlexNet, this would compute a 4096-D vector for every
image that contains the activations of the hidden layer immediately
before the classifier. We call these features CNN codes. It is
important for performance that these codes are ReLUd (i.e. thresholded
at zero) if they were also thresholded during the training of the
ConvNet on ImageNet (as is usually the case). Once you extract the
4096-D codes for all images, train a linear classifier (e.g. Linear
SVM or Softmax classifier) for the new dataset.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new
dataset, but to also fine-tune the weights of the pretrained network
by continuing the backpropagation. It is possible to fine-tune all the
layers of the ConvNet, or it's possible to keep some of the earlier
layers fixed (due to overfitting concerns) and only fine-tune some
higher-level portion of the network. This is motivated by the
observation that the earlier features of a ConvNet contain more
generic features (e.g. edge detectors or color blob detectors) that
should be useful to many tasks, but later layers of the ConvNet
becomes progressively more specific to the details of the classes
contained in the original dataset. In case of ImageNet for example,
which contains many dog breeds, a significant portion of the
representational power of the ConvNet may be devoted to features that
are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release
their final ConvNet checkpoints for the benefit of others who can use
the networks for fine-tuning. For example, the Caffe library has a
Model Zoo where people
share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of
several factors, but the two most important ones are the size of the
new dataset (small or big), and its similarity to the original dataset
(e.g. ImageNet-like in terms of the content of images and the classes,
or very different, such as microscope images). Keeping in mind that
ConvNet features are more generic in early layers and more
original-dataset-specific in later layers, here are some common rules
of thumb for navigating the 4 major scenarios:
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to
overfitting concerns. Since the data is similar to the original data,
we expect higher-level features in the ConvNet to be relevant to this
dataset as well. Hence, the best idea might be to train a linear
classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won't overfit
if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a
linear classifier. Since the dataset is very different, it might not
be best to train the classifier form the top of the network, which
contains more dataset-specific features. Instead, it might work better
to train the SVM classifier from activations somewhere earlier in the
network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can
afford to train a ConvNet from scratch. However, in practice it is
very often still beneficial to initialize with weights from a
pretrained model. In this case, we would have enough data and
confidence to fine-tune through the entire network.
Practical advice. There are a few additional things to keep in mind when performing Transfer Learning:
Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the
architecture you can use for your new dataset. For example, you can't
arbitrarily take out Conv layers from the pretrained network. However,
some changes are straight-forward: Due to parameter sharing, you can
easily run a pretrained network on images of different spatial size.
This is clearly evident in the case of Conv/Pool layers because their
forward function is independent of the input volume spatial size (as
long as the strides "fit"). In case of FC layers, this still holds
true because FC layers can be converted to a Convolutional Layer: For
example, in an AlexNet, the final pooling volume before the first FC
layer is of size [6x6x512]. Therefore, the FC layer looking at this
volume is equivalent to having a Convolutional Layer that has
receptive field size 6x6, and is applied with padding of 0.
Learning rates. It's common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the
(randomly-initialized) weights for the new linear classifier that
computes the class scores of your new dataset. This is because we
expect that the ConvNet weights are relatively good, so we don't wish
to distort them too quickly and too much (especially while the new
Linear Classifier above them is being trained from random
initialization).
Additional References
CNN Features off-the-shelf: an Astounding Baseline for Recognition trains SVMs on features
from ImageNet-pretrained ConvNet and reports several state of the art
results.
DeCAF reported similar findings in 2013. The framework in this paper (DeCAF) was a Python-based precursor to the C++ Caffe library.
How transferable are features in deep neural networks? studies the transfer
learning performance in detail, including some unintuitive findings
about layer co-adaptations.
Related
We were given an assignment in which we were supposed to implement our own neural network, and two other already developed Neural Networks. I have done that and however, this isn't the requirement of the assignment but I still would want to know that what are the steps/procedure I can follow to improve the accuracy of my Models?
I am fairly new to Deep Learning and Machine Learning as a whole so do not have much idea.
The given dataset contains a total of 15 classes (airplane, chair etc.) and we are provided with about 15 images of each class in training dataset. The testing dataset has 10 images of each class.
Complete github repository of my code can be found here (Jupyter Notebook file): https://github.com/hassanashas/Deep-Learning-Models
I tried it out with own CNN first (made one using Youtube tutorials).
Code is as follows,
X_train = X_train/255.0
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape = X_train.shape[1:]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64))
model.add(Dense(16)) # added 16 because it model.fit gave error on 15
model.add(Activation('softmax'))
For the compiling of Model,
from tensorflow.keras.optimizers import SGD
model.compile(loss='sparse_categorical_crossentropy',
optimizer=SGD(learning_rate=0.01),
metrics=['accuracy'])
I used sparse categorical crossentropy because my "y" label was intenger values, ranging from 1 to 15.
I ran this model with following way,
model_fit = model.fit(X_train, y_train, batch_size=32, epochs=30, validation_split=0.1)
It gave me an accuracy of 0.2030 on training dataset and only 0.0733 on the testing dataset (both the datasets are present in the github repository)
Then, I tried out the AlexNet CNN (followed a Youtube tutorial for its code)
I ran the AlexNet on the same dataset for 15 epochs. It improved the accuracy on training dataset to 0.3317, however accuracy on testing dataset was even worse than my own CNN, at only 0.06
Afterwards, I tried out the VGG16 CNN, again following a Youtube Tutorial.
I ran the code on Google Colab for 10 Epochs. It managed to improve to 100% accuracy on training dataset in the 8th epoch. But this model gave the worst accuracy of all three on testing dataset with only 0.0533
I am unable to understand this contrasting behavior of all these models. I have tried out different epoch values, loss functions etc. but the current ones gave the best result relatively. My own CNN was able to get to 100% accuracy when I ran it on 100 epochs (however, it gave very poor results on the testing dataset)
What can I do to improve the performance of these Models? And specifically, what are the few crucial things that one should always try to follow in order to improve efficiency of a Deep Learning Model? I have looked up multiple similar questions on Stackoverflow but almost all of them were working on datasets provided by the tensorflow like mnist dataset and etc. and I didn't find much help from those.
Disclaimer: it's been a few years since I've played with CNNs myself, so I can only pass on some general advice and suggestions.
First of all, I would like to talk about the results you've gotten so far. The first two networks you've trained seem to at least learn something from the training data because they perform better than just randomly guessing.
However: the performance on the test data indicates that the network has not learned anything meaningful because those numbers suggest the network is as good as (or only marginally better than) a random guess.
As for the third network: high accuracy for training data combined with low accuracy for testing data means that your network has overfitted. This means that the network has memorized the training data but has not learned any meaningful patterns.
There's no point in continuing to train a network that has started overfitting. So once the training accuracy increases and testing accuracy decreases for a few epochs consecutively, you can stop training.
Increase the dataset size
Neural networks rely on loads of good training data to learn patterns from. Your dataset contains 15 classes with 15 images each, that is very little training data.
Of course, it would be great if you could get hold of additional high-quality training data to expand your dataset, but that is not always feasible. So a different approach is to artificially expand your dataset. You can easily do this by applying a bunch of transformations to the original training data. Think about: mirroring, rotating, zooming, and cropping.
Remember to not just apply these transformations willy-nilly, they must make sense! For example, if you want a network to recognize a chair, do you also want it to recognize chairs that are upside down? Or for detecting road signs: mirroring them makes no sense because the text, numbers, and graphics will never appear mirrored in real life.
From the brief description of the classes you have (planes and chairs and whatnot...), I think mirroring horizontally could be the best transformation to apply initially. That will already double your training dataset size.
Also, keep in mind that an artificially inflated dataset is never as good as one of the same size that contains all authentic, real images. A mirrored image contains much of the same information as its original, we merely hope it will delay the network from overfitting and hope that it will learn the important patterns instead.
Lower the learning rate
This is a bit of side note, but try lowering the learning rate. Your network seems to overfit in only a few epochs which is very fast. Obviously, lowering the learning rate will not combat overfitting but it will happen more slowly. This means that you can hopefully find an epoch with better overall performance before overfitting takes place.
Note that a lower learning rate will never magically make a bad-performing network good. It's just one way to locate a set of parameters that performs a tad bit better.
Randomize the training data order
During training, the training data is presented in batches to the network. This often happens in a fixed order over all iterations. This may lead to certain biases in the network.
First of all, make sure that the training data is shuffled at least once. You do not want to present the classes one by one, for example first all plane images, then all chairs, etc... This could lead to the network unlearning much of the first class by the end of each epoch.
Also, reshuffle the training data between epochs. This will again avoid potential minor biases because of training data order.
Improve the network design
You've designed a convolutional neural network with only two convolution layers and two fully connected layers. Maybe this model is too shallow to learn to differentiate between the different classes.
Know that the convolution layers tend to first pick up small visual features and then tend to combine these in higher level patterns. So maybe adding a third convolution layer may help the network identify more meaningful patterns.
Obviously, network design is something you'll have to experiment with and making networks overly deep or complex is also a pitfall to watch out for!
A Neural Algorithm of Artistic Style uses the Gramian Matrix of the intermediate feature vectors of the VGG16 classification network trained on ImageNet. Back then, that was probably a good choice because VGG16 was one of the best-performing classification. Nowadays, there are much more efficient classification networks that surpass VGG in classification performance while requiring fewer parameters and FLOPS, for example EfficientNet and MobileNetv2.
But when I tried this out in practice, the Gramian Matrix for VGG16 features appears representative of the image style in that its L2 distance for stylistically similar images is smaller than the L2 distance to stylistically unrelated images. For the Gramian Matrix calculated from EfficientNet and MobileNetv2 features, that does not appear to be the case. The L2 distance between very similar images and between very dissimilar images only varies by about 5%.
From the network structure, VGG, EfficientNet, and MobileNet all have convolutions with batch normalization and ReLU in between, so the building blocks are the same. Then which design decision is unique to VGG so that its Gramian Matrix captures the style, while EfficientNet's and MobileNet's do not?
By now, I figured it out: The Gramian Matrix needs partially correlated features to work correctly. Newer networks are trained with a Dropout regularizer, which will reduce the inter-feature correlation.
I plot all my weights of my neural network on tensorboard, I found that some
weights of some layer is normally distributed:
but, some are not.
what does this imply? should I increase or decrease the capacity of this layer?
Update:
My network is a LSTM-based netowrk. the non-normal distributed weights is the weights multiply with input feature, the normal distributed weights is the weights multiply with states.
one explanation base on convolutional networks might be this(I don't know if this is true for any other kind of artificial neural models or not), hence the first layer tries to find distinct small features weights are distributed very widely and network tries to find any useful feature it can, then in the next layers combination of these distinct features are used, which make sense to put a normal distribution of weights hence every one of the previous features are going to be part of a single bigger or more representative feature in next layers.
but this was only my intuition I am not sure if this is the reason with proof now.
I have a dataset consist of around 5000 categories of images, but the number of images of every category varies from 20 to 2000, which is quite unbalanced. Also, the number of images are far from enough to train a model from scratch. I decided to do finetuning on pretrained models, like Inception models.
But I am not sure about how to deal with unbalanced data. There are several possible approaches:
Oversampling: Oversample the minority category. But even with aggressive image augmentation technique, we may not be able to deal with overfit.
Also, how to generate balanced batches from unbalanced dataset over so many categories? Do you have some ideas about this pipeline mechanism with TensorFlow?
SMOTE: I think it is not so effective for high dimensional signals like images.
Put weight on cross entropy loss in every batch. This might be useful for single batch, but cannot deal with the overall unbalance.
Any ideas about this? Any feedback will be appreciated.
Use tf.losses.softmax_cross_entropy and set weights for each class inversely proportional to their training frequency to "balance" the optimization.
Start with the pre-trained ImageNet layers, add your own final layers (with appropriate convolution, drop out and flatten layers as required). Freeze all but last few of the ImageNet layers, then train on your dataset.
For unbalanced data (and in general small datasets), use data augmentation to create more training images. Keras has this functionality built-in: Building powerful image classification models using very little data
I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.