Lately I've been reading up on the memory consumed by convolutional neural networks (ConvNets). During training, each convolutional layer has several parameters required for back-propagation of the gradients. These lecture notes suggest that these parameters can in principle be discarded at test time. A quote from the linked notes:
Usually, most of the activations are on the earlier layers of a ConvNet (i.e. first Conv Layers). These are kept around because they are needed for backpropagation, but a clever implementation that runs a ConvNet only at test time could in principle reduce this by a huge amount, by only storing the current activations at any layer and discarding the previous activations on layers below.
Is there any way (using TensorFlow) to make use of this "clever implementation" for inference of large batches? Is there some flag that specifies whether or not the model is in its training phase? Or is this already handled automatically based on whether the optimiser function is called?
Related
I plot all my weights of my neural network on tensorboard, I found that some
weights of some layer is normally distributed:
but, some are not.
what does this imply? should I increase or decrease the capacity of this layer?
Update:
My network is a LSTM-based netowrk. the non-normal distributed weights is the weights multiply with input feature, the normal distributed weights is the weights multiply with states.
one explanation base on convolutional networks might be this(I don't know if this is true for any other kind of artificial neural models or not), hence the first layer tries to find distinct small features weights are distributed very widely and network tries to find any useful feature it can, then in the next layers combination of these distinct features are used, which make sense to put a normal distribution of weights hence every one of the previous features are going to be part of a single bigger or more representative feature in next layers.
but this was only my intuition I am not sure if this is the reason with proof now.
This tutorial has the tensor-flow implementation of batch normal layer for training and testing phases.
When we using transfer learning is it ok to use batch normalization layer? Specially when data distributions are different.
Because in the inference phase BN layer just uses fixed mini batch mean and variance(Which is calculated with the help of training distribution).
So if our model has a different distribution of data , can it give wrong results?
With transfer learning, you're transferring the learned parameters from a domain to another.
Usually, this means that you're keeping fixed the learned values of the convolutional layer whilst adding new fully connected layers that learn to classify the features extracted by the CNN.
When you add batch normalization to every layer, you're injecting values sampled from the input distribution into the layer, in order to force the output layer to be normally distributed.
In order of doing that, you compute the exponential moving average of the layer output and then in the testing phase, you subtract this value from the layer output.
Although data dependent, this mean values (for every convolutional layer) are computed on the output of the layer, thus on the transformation learned.
Thus, in my opinion, the various averages that the BN layer subtracts from its convolutional layer output are general enough to be transferred: they are computed on the transformed data and not on the original data.
Moreover, the convolutional layer learns to extract local patterns thus they're more robust and difficult to influence.
Thus, in short and in my opinion:
you can apply transfer learning of convolutional layer with batch norm applied. But on fully connected layers the influence of the computed value (that see the whole input and not only local patches) can bee too much data dependent and thus I'll avoid it.
However, as a rule of thumb: if you're insecure about something just try it and see if it works!
According to the Keras documentation dropout layers show different behaviors in training and test phase:
Note that if your model has a different behavior in training and
testing phase (e.g. if it uses Dropout, BatchNormalization, etc.), you
will need to pass the learning phase flag to your function:
Unfortunately, nobody talks about the actual differences. Why should dropout behave differently in test phase? I expect the layer to set a certain amount of neurons to 0. Why should this behavior depend on the training/test phase?
Dropout is used in the training phase to reduce the chance of overfitting. As you mention this layer deactivates certain neurons. The model will become more insensitive to weights of other nodes. Basically with the dropout layer the trained model will be the average of many thinned models. Check a more detailed explanation here
However, in when you apply your trained model you want to use the full power of the model. You want to use all neurons in the trained (average) network to get the highest accuracy.
I read a lot of papers on convnets, but there is one thing I don't understand, how the filters in convolutional layer are initialized ?
Because, for examples, in first layer, filters should detect edges etc..
But if it randomly init, it could not be accurate ? Same for next layer and high-level features.
And an other question, what are the range of the value in those filters ?
Many thanks to you!
You can either initialize the filters randomly or pretrain them on some other data set.
Some references:
http://deeplearning.net/tutorial/lenet.html:
Notice that a randomly initialized filter acts very much like an edge
detector!
Note that we use the same weight initialization formula as with the
MLP. Weights are sampled randomly from a uniform distribution in the
range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a
hidden unit. For MLPs, this was the number of units in the layer
below. For CNNs however, we have to take into account the number of
input feature maps and the size of the receptive fields.
http://cs231n.github.io/transfer-learning/ :
Transfer Learning
In practice, very few people train an entire Convolutional Network
from scratch (with random initialization), because it is relatively
rare to have a dataset of sufficient size. Instead, it is common to
pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 1.2 million images with 1000 categories), and then use the
ConvNet either as an initialization or a fixed feature extractor for
the task of interest. The three major Transfer Learning scenarios look
as follows:
ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer's outputs
are the 1000 class scores for a different task like ImageNet), then
treat the rest of the ConvNet as a fixed feature extractor for the new
dataset. In an AlexNet, this would compute a 4096-D vector for every
image that contains the activations of the hidden layer immediately
before the classifier. We call these features CNN codes. It is
important for performance that these codes are ReLUd (i.e. thresholded
at zero) if they were also thresholded during the training of the
ConvNet on ImageNet (as is usually the case). Once you extract the
4096-D codes for all images, train a linear classifier (e.g. Linear
SVM or Softmax classifier) for the new dataset.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new
dataset, but to also fine-tune the weights of the pretrained network
by continuing the backpropagation. It is possible to fine-tune all the
layers of the ConvNet, or it's possible to keep some of the earlier
layers fixed (due to overfitting concerns) and only fine-tune some
higher-level portion of the network. This is motivated by the
observation that the earlier features of a ConvNet contain more
generic features (e.g. edge detectors or color blob detectors) that
should be useful to many tasks, but later layers of the ConvNet
becomes progressively more specific to the details of the classes
contained in the original dataset. In case of ImageNet for example,
which contains many dog breeds, a significant portion of the
representational power of the ConvNet may be devoted to features that
are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release
their final ConvNet checkpoints for the benefit of others who can use
the networks for fine-tuning. For example, the Caffe library has a
Model Zoo where people
share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of
several factors, but the two most important ones are the size of the
new dataset (small or big), and its similarity to the original dataset
(e.g. ImageNet-like in terms of the content of images and the classes,
or very different, such as microscope images). Keeping in mind that
ConvNet features are more generic in early layers and more
original-dataset-specific in later layers, here are some common rules
of thumb for navigating the 4 major scenarios:
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to
overfitting concerns. Since the data is similar to the original data,
we expect higher-level features in the ConvNet to be relevant to this
dataset as well. Hence, the best idea might be to train a linear
classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won't overfit
if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a
linear classifier. Since the dataset is very different, it might not
be best to train the classifier form the top of the network, which
contains more dataset-specific features. Instead, it might work better
to train the SVM classifier from activations somewhere earlier in the
network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can
afford to train a ConvNet from scratch. However, in practice it is
very often still beneficial to initialize with weights from a
pretrained model. In this case, we would have enough data and
confidence to fine-tune through the entire network.
Practical advice. There are a few additional things to keep in mind when performing Transfer Learning:
Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the
architecture you can use for your new dataset. For example, you can't
arbitrarily take out Conv layers from the pretrained network. However,
some changes are straight-forward: Due to parameter sharing, you can
easily run a pretrained network on images of different spatial size.
This is clearly evident in the case of Conv/Pool layers because their
forward function is independent of the input volume spatial size (as
long as the strides "fit"). In case of FC layers, this still holds
true because FC layers can be converted to a Convolutional Layer: For
example, in an AlexNet, the final pooling volume before the first FC
layer is of size [6x6x512]. Therefore, the FC layer looking at this
volume is equivalent to having a Convolutional Layer that has
receptive field size 6x6, and is applied with padding of 0.
Learning rates. It's common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the
(randomly-initialized) weights for the new linear classifier that
computes the class scores of your new dataset. This is because we
expect that the ConvNet weights are relatively good, so we don't wish
to distort them too quickly and too much (especially while the new
Linear Classifier above them is being trained from random
initialization).
Additional References
CNN Features off-the-shelf: an Astounding Baseline for Recognition trains SVMs on features
from ImageNet-pretrained ConvNet and reports several state of the art
results.
DeCAF reported similar findings in 2013. The framework in this paper (DeCAF) was a Python-based precursor to the C++ Caffe library.
How transferable are features in deep neural networks? studies the transfer
learning performance in detail, including some unintuitive findings
about layer co-adaptations.
I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.