I read batch normalization and dropout are two different ways to avoid overfitting in neural networks. Is it relevant to use both in the same estimator as following ?
```
model1 = tf.estimator.DNNClassifier(feature_columns=feature_columns_complex_standardized,
hidden_units=[512,512,512],
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False),
weight_column=weights,
dropout=0.5,
activation_fn=tf.nn.softmax,
n_classes=10,
label_vocabulary=Action_vocab,
model_dir='./Models9/Action/',
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE,
config=tf.estimator.RunConfig().replace(save_summary_steps=10),
batch_norm=True)
There is a small problem in your understanding. Batch Normalization original intent is not to reduce overfitting but to speed up the training. Just like how you normalize the inputs while you are passing it to the first layer of your network, batch normalization achieves this action in inner (or hidden) layers. Batch normalization removes the effect of covariate shift while it is training.
But since this is applied at every batches separately, it results in a side effect of regularizing your weight parameters. This regularizing effect is quite similar to that of how you would have done had you intended to solve over-fitting.
You can apply both batch_norm and dropout together but it is advisable to reduce the dropout. Currently, your dropout rate at 0.5 is very high. I believe dropout of 0.1 to 0.2 should be enough when you are applying it together with batch_norm. Also, the value of dropout is a hyper-parameter, so there is no fixed answer to it and you may have to tune it as per your data input and network.
Both batch normalization and dropout gives the regularization effect in some way or another.
As you apply the batch normalization for normalization steps it sees all the training example in mini-batch together to reduce the internal covariate shift which helps in speeding up the training and not setting the learning rate low and also gives the regularization effect.
If batch normalization is used along the network, then the dropout regularization can be reduced or dropped in strength
Context:
Suppose we have a simple 3-layer feed-forward network. The hidden size of the first linear layer is 100000 -- W1[input_size, 100000] in which input_size is a number much smaller than 100000. Some of neurons won't be learning any thing. I want to select and shutdown these neurons using pruning.
Expected outcomes
After pruning the selected neurons, we will have a smaller network with the less neurons in the first layer, say reduced to 500. And this smaller network turns out to have the same predicting capacity to the large one.
My implementation:
According to some criterion (some metrics applied to check weight similarities after each backpropagation update), I have cheery picked the indices of neurons I want to shut down, e.g., [1,7,8 ...].
Zero out the weights represented by the indices in W1, W1[:, 1, 7, 8 ...] = 0. So that no information will be passed forward via these neurons to the next layer.
Will that be enough? Should I be manually intervening the backpropagation as well? Zero out neurons stop only computations passing forward, but for learning/updating on weights, backpropgation matters more. Since I am using pytorch, it will be great if illustrations are provided in pytorch, other frameworks like tensorflow, Keras are also fine.
My question is in the end.
An example CNN trained with mini-batch GD and used the dropout in the last fully-connected layer (line 60) as
fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)
At first I thought the tf.layers.dropout or tf.nn.dropout randomly sets neurons to zero in columns. But I recently found it's not the case. The below piece of code prints what the dropout does. I used the fc0 as a 4 sample x 10 feature matrix, and the fc as the dropped out version.
import tensorflow as tf
import numpy as np
fc0 = tf.random_normal([4, 10])
fc = tf.nn.dropout(fc0, 0.5)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
a, b = sess.run([fc0, fc])
np.savetxt("oo.txt", np.vstack((a, b)), fmt="%.2f", delimiter=",")
And in the output oo.txt (original matrix: line 1-4, dropped out matrix: line 5-8):
0.10,1.69,0.36,-0.53,0.89,0.71,-0.84,0.24,-0.72,-0.44
0.88,0.32,0.58,-0.18,1.57,0.04,0.58,-0.56,-0.66,0.59
-1.65,-1.68,-0.26,-0.09,-1.35,-0.21,1.78,-1.69,-0.47,1.26
-1.52,0.52,-0.99,0.35,0.90,1.17,-0.92,-0.68,-0.27,0.68
0.20,0.00,0.71,-0.00,0.00,0.00,-0.00,0.47,-0.00,-0.87
0.00,0.00,0.00,-0.00,3.15,0.07,1.16,-0.00,-1.32,0.00
-0.00,-3.36,-0.00,-0.17,-0.00,-0.42,3.57,-3.37,-0.00,2.53
-0.00,1.05,-1.99,0.00,1.80,0.00,-0.00,-0.00,-0.55,1.35
My understanding of the proper? dropout is, knocking out p% same units for each sample in a mini-batch or batch gradient descent phase, and the back-propagation updates the weights and biases of the "thinned network". However, in the implementation of the example, the neurons of each sample in one batch were randomly dropped out, as illustrated in the oo.txt line 5 to 8, and for each sample, the "thinned network" is different.
As a comparison, in a stochastic gradient descent case, samples are fed into the neural network one-by-one, and in each iteration, weights of each tf.layers.dropout introduced "thinned network" are updated.
My question is, in the mini-batch or batch training, shouldn't it be implemented to knock out same neurons for all samples in one batch? Maybe by applying one mask to all input batch samples at each iteration?
Something like:
# ones: a 1xN all 1s tensor
# mask: a 1xN 0-1 tensor, multiply fc1 by mask with broadcasting along the axis of samples
mask = tf.layers.dropout(ones, rate=dropout, training=is_training)
fc1 = tf.multiply(fc1, mask)
Now I'm thinking the dropout strategy in the example may be a weighted way of updating weights of a certain neuron, that if a neuron is kept in 1 out of 10 samples in a mini-batch, its weights will be updated by alpha * 1/10 * (y_k_hat-y_k) * x_k, compared with alpha * 1/10 * sum[(y_k_hat-y_k) * x_k] for weights of another neuron kept in all 10 samples?
the screenshot from here
Dropouts are commonly used to prevent overfitting. In this case it would be a huge weight applied to one of the neurons. By randomly making it 0 from time to time, you force the network to use more neurons in determining the outcome. For this to work well you should drop different neurons for each example so that the gradient you compute is more similar to the one you would get without the dropout.
If you were to drop the same neurons for each example in the batch, my guess is that you will have a less stable gradient (might not matter for your application).
In addition dropout up-scales the rest of the values to keep the average activation at about the same level. Without it the network would learn wrong biases or would over-saturate when you turn dropout off.
If you still want the same neurons to be dropped in the batch then apply dropout to a all 1 tensor of shape (1, num_neurons) and then multiply it with the activations.
When using dropout, you are effectively trying to estimate the average performance of the network for a randomly chosen dropout mask, using Monte-Carlo sampling (by differentiation under the integral sign, the average gradient is equal to the gradient of the average). By fixing a dropout mask for each mini-batch, you are just introducing correlation between successive gradient estimates, which increases the variance and leads to slower training.
Imagine using a different dropout-mask for each image in the mini-batch, but forming the mini-batch from k copies of the same image; it's obvious that this would be a complete waste of effort!
Id like to use dropout to improve my model's final prediction accuracy.
To my understanding :
1) Using drop out we randomly assign neurons to zero.
2) Thereby in the next layer the neuron's values are dependent only on a subset of the previous layers neurons.
3) Therefore during training the neurons's weights are trained to be stronger than they should be ( inversely proportional to drop out rate ) to compensate for the 'droped' neurons in the previous layer.
4) After training when we want to use all of the neurons without dropout we need to compensate by multiplying the neurons in each layer by (1-dropout rate).
Am I correct? And if so does tf.nn.dropout() take care of this or do I need to do it manually?
Thanks in advance :)
When you do perform dropout, random neurons are chosen in that layer and their weights are made 0. When the model is trained, you get two outputs, the architecture of your model and a h5 file that consists of weights. After the training is done, you save these files in your storage and can load anytime for prediction. You do not have to multiply add weight to any neuron manually. TF takes care of it.
In the FCN paper, the authors discuss the patch wise training and fully convolutional training. What is the difference between these two?
Please refer to section 4.4 attached in the following.
It seems to me that the training mechanism is as follows,
Assume the original image is M*M, then iterate the M*M pixels to extract N*N patch (where N<M). The iteration stride can some number like N/3 to generate overlapping patches. Moreover, assume each single image corresponds to 20 patches, then we can put these 20 patches or 60 patches(if we want to have 3 images) into a single mini-batch for training. Is this understanding right? It seems to me that this so-called fully convolutional training is the same as patch-wise training.
The term "Fully Convolutional Training" just means replacing fully-connected layer with convolutional layers so that the whole network contains just convolutional layers (and pooling layers).
The term "Patchwise training" is intended to avoid the redundancies of full image training.
In semantic segmentation, given that you are classifying each pixel in the image, by using the whole image, you are adding a lot of redundancy in the input. A standard approach to avoid this during training segmentation networks is to feed the network with batches of random patches (small image regions surrounding the objects of interest) from the training set instead of full images. This "patchwise sampling" ensures that the input has enough variance and is a valid representation of the training dataset (the mini-batch should have the same distribution as the training set). This technique also helps to converge faster and to balance the classes. In this paper, they claim that is it not necessary to use patch-wise training and if you want to balance the classes you can weight or sample the loss.
In a different perspective, the problem with full image training in per-pixel segmentation is that the input image has a lot of spatial correlation. To fix this, you can either sample patches from the training set (patchwise training) or sample the loss from the whole image. That is why the subsection is called "Patchwise training is loss sampling".
So by "restricting the loss to a randomly sampled subset of its spatial terms excludes patches from the gradient computation." They tried this "loss sampling" by randomly ignoring cells from the last layer so the loss is not calculated over the whole image.