Confused usage of dropout in mini-batch gradient descent

Confused usage of dropout in mini-batch gradient descent - tensorflow

My question is in the end.
An example CNN trained with mini-batch GD and used the dropout in the last fully-connected layer (line 60) as
fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)
At first I thought the tf.layers.dropout or tf.nn.dropout randomly sets neurons to zero in columns. But I recently found it's not the case. The below piece of code prints what the dropout does. I used the fc0 as a 4 sample x 10 feature matrix, and the fc as the dropped out version.
import tensorflow as tf
import numpy as np
fc0 = tf.random_normal([4, 10])
fc = tf.nn.dropout(fc0, 0.5)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
a, b = sess.run([fc0, fc])
np.savetxt("oo.txt", np.vstack((a, b)), fmt="%.2f", delimiter=",")
And in the output oo.txt (original matrix: line 1-4, dropped out matrix: line 5-8):
0.10,1.69,0.36,-0.53,0.89,0.71,-0.84,0.24,-0.72,-0.44
0.88,0.32,0.58,-0.18,1.57,0.04,0.58,-0.56,-0.66,0.59
-1.65,-1.68,-0.26,-0.09,-1.35,-0.21,1.78,-1.69,-0.47,1.26
-1.52,0.52,-0.99,0.35,0.90,1.17,-0.92,-0.68,-0.27,0.68
0.20,0.00,0.71,-0.00,0.00,0.00,-0.00,0.47,-0.00,-0.87
0.00,0.00,0.00,-0.00,3.15,0.07,1.16,-0.00,-1.32,0.00
-0.00,-3.36,-0.00,-0.17,-0.00,-0.42,3.57,-3.37,-0.00,2.53
-0.00,1.05,-1.99,0.00,1.80,0.00,-0.00,-0.00,-0.55,1.35
My understanding of the proper? dropout is, knocking out p% same units for each sample in a mini-batch or batch gradient descent phase, and the back-propagation updates the weights and biases of the "thinned network". However, in the implementation of the example, the neurons of each sample in one batch were randomly dropped out, as illustrated in the oo.txt line 5 to 8, and for each sample, the "thinned network" is different.
As a comparison, in a stochastic gradient descent case, samples are fed into the neural network one-by-one, and in each iteration, weights of each tf.layers.dropout introduced "thinned network" are updated.
My question is, in the mini-batch or batch training, shouldn't it be implemented to knock out same neurons for all samples in one batch? Maybe by applying one mask to all input batch samples at each iteration?
Something like:
# ones: a 1xN all 1s tensor
# mask: a 1xN 0-1 tensor, multiply fc1 by mask with broadcasting along the axis of samples
mask = tf.layers.dropout(ones, rate=dropout, training=is_training)
fc1 = tf.multiply(fc1, mask)
Now I'm thinking the dropout strategy in the example may be a weighted way of updating weights of a certain neuron, that if a neuron is kept in 1 out of 10 samples in a mini-batch, its weights will be updated by alpha * 1/10 * (y_k_hat-y_k) * x_k, compared with alpha * 1/10 * sum[(y_k_hat-y_k) * x_k] for weights of another neuron kept in all 10 samples?
the screenshot from here

Dropouts are commonly used to prevent overfitting. In this case it would be a huge weight applied to one of the neurons. By randomly making it 0 from time to time, you force the network to use more neurons in determining the outcome. For this to work well you should drop different neurons for each example so that the gradient you compute is more similar to the one you would get without the dropout.
If you were to drop the same neurons for each example in the batch, my guess is that you will have a less stable gradient (might not matter for your application).
In addition dropout up-scales the rest of the values to keep the average activation at about the same level. Without it the network would learn wrong biases or would over-saturate when you turn dropout off.
If you still want the same neurons to be dropped in the batch then apply dropout to a all 1 tensor of shape (1, num_neurons) and then multiply it with the activations.

When using dropout, you are effectively trying to estimate the average performance of the network for a randomly chosen dropout mask, using Monte-Carlo sampling (by differentiation under the integral sign, the average gradient is equal to the gradient of the average). By fixing a dropout mask for each mini-batch, you are just introducing correlation between successive gradient estimates, which increases the variance and leads to slower training.
Imagine using a different dropout-mask for each image in the mini-batch, but forming the mini-batch from k copies of the same image; it's obvious that this would be a complete waste of effort!

Related

Number of nodes in output later greater than number of classes in a neural network

While training a neural network, on the fashion mnist dataset, I decided to have a greater number of nodes in my output layer than the number of classes in the dataset.
The dataset has 10 classes, while I trained my network to have 15 nodes in the output layer. I also used a softmax.
Now surprisingly, this gave me an accuracy of 97% which is quite good.
This leads me to the question, what do those extra 5 nodes even mean, and what do they do here?
Why is my softmax able to work properly when the label range(0-9) isn't equal to the number of nodes(15)?
And finally, in general, what does it mean to have more nodes in your output layer than the number of classes, in a classification task?
I understand the effects of having lesser nodes than the number of classes, and also that the rule of thumb is to use number of nodes = number of classes. Yet, I've never seen someone use a greater number of nodes, and I'd like to understand why/why not.
I'm attaching some code so that the results can be reproduced. This was done using Tensorflow 2.3
import tensorflow as tf
print(tf.__version__)
mnist = tf.keras.datasets.mnist
(training_images, training_labels) , (test_images, test_labels) = mnist.load_data()
training_images = training_images/255.0
test_images = test_images/255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(15, activation=tf.nn.softmax)])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(training_images, training_labels, epochs=5)
model.evaluate(test_images, test_labels)

The only reason you are able to use such a configuration is because you have specified your loss function as sparse_categorical_crossentropy.
let's understand the effects of greater output nodes in forward propagation.
Consider a neural network with 2 layers.
1st layer - 6 neurons (Hidden layer)
2nd layer - 4 neurons (output layer)
You have dataset X whose shape is(100*12) ie. 12 features and 100 rows.
you have labels y whose shape is (100,) containing two unique values 0 and 1.
Therefore essentially this is a binary classification problem but we will use 4 neurons in our output layer.
Consider each neuron as a logistic regression unit. Therefore each of your neurons will 12 weights (w1, w2,.....,w12)
Why? - Because you have 12 features.
Each neuron will output a single term given by a. I will give the computation of a in two steps.
z = w1x1 + w2x2 + ........ + w12*x12 + w0 # w0 is bias
a = activation(z)
Therefore, your 1st layer will output 6 values for each row in our dataset.
So now you have a feature matrix of 100 * 6.
This is passed to the 2nd layer and the same process repeats.
So in essence you are able to complete the forward propagation step even when you have more neurons than the actual classes.
Now let's see backpropagation.
For backpropagation to exist you must be able to calculate the loss_value.
we will take a small example:
y_true has two labels as in our problem and y_pred has 4 probability values since we have 4 units in our final layer.
y_true = [0, 1]
y_pred = [[0.03, 0.90, 0.02, 0.05], [0.15, 0.02, 0.8, 0.03]]
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy() # 3.7092905
How is it calculated:
( log(0.03) + log(0.02) ) / 2
So essentially we can compute the loss so we can also compute its gradients.
Therefore no problem in using backpropagation too.
Therefore our model can very well train and achieve 90 % accuracy.
So the final question, what are these extra neurons representing. ie( neuron 2 and neuron 3).
Ans - They are representing the probability of the example being of class 2 and class 3 respectively. But since the labels contain no values of class 2 and class 3 they will have zero contribution in calculating the loss value.
Note- If you encode your y_label in one-hot-encoding and use categorical_crossentropy as your loss you will encounter an error.

Keras dense model gradient explosion

I have a very simple dense layer model takes 10 input values, 20 units in hidden layer, 1 unit in output layer, and "relu" as activation function, adam optimizer with learning rate 0.01
densemodel=keras_model_sequential();
layer_dense(densemodel, input_shape=ncol(trainingX), units=20, activation="relu")
layer_dropout(densemodel, rate=0.1)
layer_dense(densemodel, units=1, activation="relu")
optimizer=optimizer_adam(lr=0.01,clipnorm=1);
compile(densemodel, optimizer=optimizer, loss="logcosh", metrics = list("mean_squared_error"))
I trained the model with n = 2e4 training data and ran into serious gradient explosion, which was finally confirmed caused by some outliers (n < 10) in the training records.
Without removing the the outlier records, any one or combination of the following strategies failed to address the gradient explosion problem.
kernel_regularizer, bias_regularizer, activity_regularizer, clipnorm=1, clipvalue=0.5 or 0.1, set learning rate to 1e-5, add drop out layer, increase batch size.
basically none of them work.
I expect at least clipnorm or clipvalue should work since according to definition
clipnorm: Gradients will be clipped when their L2 norm exceeds this
value.
clipvalue: Gradients will be clipped when their absolute value exceeds
this value.
but why they failed?

MLP output of first layer is zero after one epoch

I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
EDIT:
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation). https://drive.google.com/file/d/18tQJBQg95ISpxF9T3v156JAWRBJYzeiG/view
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
'state1',
'state2',
'state3',
'state4',
'state5',
'pose0',
'pose1',
'pose2',
'pose3',
'pose4',
'pose5'])
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
df.state5.to_numpy()]).transpose()
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
df.pose5.to_numpy()]).transpose()
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_hidden_layers=2
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history = model.fit(x_train,
y_train,
epochs=n_epochs,
verbose=1,
validation_data=(x_test, y_test),
batch_size=32)
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]² and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.

the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -𝜋 to 𝜋 probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working

I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.

Cosine similarity loss cause weight values to explode

Suppose my data consists of images of bubbles, and the labels are histograms describing the distribution of sizes, for example:
0-10mm 10%
10-20mm 30%
20-30mm 40%
30-40mm 20%
It is important to note that -
All size percentages sum to 100% (or 1.0 to be more precise).
I don't have annotated data, so i can't train an object detector and then just calculate the distribution by counting objects detected. However, i do have a feature extractor train on my data.
I implemented a simple CNN that consists of -
Resnet50 backbone.
Global max pooling.
1x1 convolution of 6 filters (6 distribution bins in labels).
After some experiments i came to the conclusion that softmax and cross entropy as loss function does not suit my problem and needs.
I thought that maybe a cosine similarity loss, with a light modification, may be a good alternative (normalization will be part of post process). This is the implementation:
def cosine_similarity_loss(logits, probs, weights=1.0, label_smoothing=0):
x1_val = tf.sqrt(tf.reduce_sum(tf.matmul(logits, tf.transpose(logits)), axis=1))
x2_val = tf.sqrt(tf.reduce_sum(tf.matmul(probs, tf.transpose(probs)), axis=1))
denom = tf.multiply(x1_val, x2_val)
num = tf.reduce_sum(tf.multiply(logits, probs), axis=1)
cosine_sim = tf.math.divide(num, denom)
cosine_dist = tf.math.reduce_mean(1 - tf.square(cosine_sim)) # Cosine Distance. Reduce mean for shape compatibility.
return cosine_dist
Loss is a summation of cosine distance and l2 regularization on weights. After first feed forward i got loss: 3.1267 and after second feed forward i got loss: 96003645440.0000 - meaning weights exploded (logits: [[-785595.812 -553858.625 -545579.625 -148547.875 -12845.8633 19871.1055]] while probs: [[0.466 0.297 0.19 0.047 0 0]]).
What could be the reason for such rapid and extreme increase?

My guess is cosine distance does an internal normalisation of the logits, removing the magnitude, and thus there is no gradient to propogate that opposes the values increasing. BTW weights is not used in your implementation.
What about just plain Euclidian distance using sigmoid instead of softmax in the last layer. Also, I would try adding another one or two dense layers (say size 512) between resnet50 and output dense layer.

How to handle the BatchNorm layer when training fully convolutional networks by finetuning?

Training fully convolutional nerworks (FCNs) for pixelwise semantic segmentation is very memory intensive. So we often use batchsize=1 for traing FCNs. However, when we finetune the pretrained networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense for the BN layers. So, how to handle the BN layers?
Some options:
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
Freeze the parameters and statistics of the BN layers
....
which is better and any demo for implementation in pytorch/tf/caffe?

Having only one element will make the batch normalization zero if epsilon is non-zero (variance is zero, mean will be same as input).
Its better to delete the BN layers from the network and try the activation function SELU (scaled exponential linear units). This is from the paper 'Self normalizing neural networks' (SNNs).
Quote from the paper:
While batch normalization requires explicit normalization, neuron
activations of SNNs automatically converge towards zero mean and
unit variance. The activation function of SNNs are “scaled
exponential linear units” (SELUs), which induce self-normalizing
properties.
The SELU is defined as:
def selu(x, name="selu"):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * tf.where(x >= 0.0, x, alpha * tf.nn.elu(x))

Batch Normalization was introduced to reduce the internal covariate shift of the input feature maps. Due to change of parameters of each layer after every optimization steps, input distribution of a layer also changes, this slow down the model convergence. By using Batch Normalization we can normalize the input distribution irrespective of the batch_size (whether batch_size =1 or larger).
BN normalizes the input distribution
For convolutional network input for intermediate layer is 4D tensor. [batch_size, width, height, num_filters]. Normalization effect all the feature maps.
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
This may further slow down the training step and convergence mayn't be achieved.
Freeze the parameters and statistics of the BN layers
Sometime the input data distribution for retrain/finetune, may vary significantly from the original data used to train the pretrained model used for initialization, Due to which your model may end-up in non-optimal solution.

According to my experiments in PyTorch, if convolutional layer before the BN outputs more than one value (i.e. 1 x feat_nb x height x width, where height > 1 or width > 1), then the BN still works fine even when the batch size is equal to one. However, I suspect that in this case the variance estimate might be very biased since all samples that are used for variance calculation come from the same image. Therefore in my case I still decided to use small batch.

The effective batch size over convolutional layer
I think the CNN-relative section (Section 3.2) in the BN original paper could help. From the point of view of the authors, it should be OK to use batch size = 1 for convolutional layers. The "effective batch size" for convolutional layer actually is batch_size * image_height * image_width.

I do not have an exact answer, but here are my thoughts:
networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense
for the BN layers
The main motivation of BN is to fix the distribution (mean/variance) of the input in the batch. In my opinion, having one element this does not make sense. Judging from the paper
you will need to calculate the mean and the variance for 1 element, which does not make sense.
You can always just remove BN but are you sure you can't afford at least 16 elements in the batch?

My observation is in contrary with Stephan's: using PyTorch on a similar input batch x feat_nb x height x width, where height > 1 or width > 1, I found adding BatchNorm after the last conv and before the last non-linear (sigmoid) actually hurts the accuracy by a big margin. Still trying to make sense out of it..
(batch size = 8)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas