neural networks - variance of gradients in minibatch - tensorflow

I'm using tensorflow to try to investigate the local reparameterization trick [Kingma et al, 2015] and the effect it has on the variance of the gradients. I'm getting strange results however and I'm concerned I may be misunderstanding what's going on.
My understanding is as follows: given a loss function or lower bound etc. it's possible to compute a matrix of derivatives for each data point which is the derivative of this loss function with respect to each weight in the weight matrix for the output layer (if these are the gradients we want to examine). For any one of these weights, the variance is calculated across the derivative of the loss function for each data point with respect to this weight. So if we start with a (1000, 1000) weight matrix, we finish with a (1000, 1000) matrix whose entries (i,j) are given by the variance across the derivatives of the loss function for each datapoint in the mini-batch with respect to weight (i,j).
At this point we can take the mean of all variances in the matrix to give us the final average variance. Is this what's being referred to when people talk about the variance of the gradients?

Related

Is the loss of the mini-batch the mean of the losses for each sample?

I know this has probably been addressed many a time, but I'm constantly hearing conflicting points and I'm uncertain as to how I should go about computing the loss function, and moreover, how to compute the gradient over a mini-batch.
Let's say I have a simple linear regression ANN model with one input, one output and no activation function. The weight matrix (W) and bias matrix (B) then just have the shape (1, 1). If I batch my data into mini-batches of size 32, then the input matrix (X) will have dimensions (1, 32). Forward prop is then performed without a hitch: it's just W.X + B which works fine because the shapes are compatible. It then produces the predictions which can be denoted by matrix Yi with shape (1, 32). The cost is then computed as the mean squared error of the model outputs and the truth values. In this case, there's only one output, so the cost over one training example is just (truth - predicted)2.
So I'm confused about a couple of aspects at this point. Do you a) compute the average cost over a mini-batch, then compute the derivative of the averaged cost w.r.t to the weight and bias; or b) calculate individual costs for each example in the mini-batch and then compute the derivatives of the costs w.r.t the weight and bias, and then finally sum the gradients and average them?
Since gradient is a linear operator,
grad((cost(x1)+...+cost(xn))/n)=(grad(cost(x1))+...grad(cost(xn)))/n.

Cosine similarity loss cause weight values to explode

Suppose my data consists of images of bubbles, and the labels are histograms describing the distribution of sizes, for example:
0-10mm 10%
10-20mm 30%
20-30mm 40%
30-40mm 20%
It is important to note that -
All size percentages sum to 100% (or 1.0 to be more precise).
I don't have annotated data, so i can't train an object detector and then just calculate the distribution by counting objects detected. However, i do have a feature extractor train on my data.
I implemented a simple CNN that consists of -
Resnet50 backbone.
Global max pooling.
1x1 convolution of 6 filters (6 distribution bins in labels).
After some experiments i came to the conclusion that softmax and cross entropy as loss function does not suit my problem and needs.
I thought that maybe a cosine similarity loss, with a light modification, may be a good alternative (normalization will be part of post process). This is the implementation:
def cosine_similarity_loss(logits, probs, weights=1.0, label_smoothing=0):
x1_val = tf.sqrt(tf.reduce_sum(tf.matmul(logits, tf.transpose(logits)), axis=1))
x2_val = tf.sqrt(tf.reduce_sum(tf.matmul(probs, tf.transpose(probs)), axis=1))
denom = tf.multiply(x1_val, x2_val)
num = tf.reduce_sum(tf.multiply(logits, probs), axis=1)
cosine_sim = tf.math.divide(num, denom)
cosine_dist = tf.math.reduce_mean(1 - tf.square(cosine_sim)) # Cosine Distance. Reduce mean for shape compatibility.
return cosine_dist
Loss is a summation of cosine distance and l2 regularization on weights. After first feed forward i got loss: 3.1267 and after second feed forward i got loss: 96003645440.0000 - meaning weights exploded (logits: [[-785595.812 -553858.625 -545579.625 -148547.875 -12845.8633 19871.1055]] while probs: [[0.466 0.297 0.19 0.047 0 0]]).
What could be the reason for such rapid and extreme increase?
My guess is cosine distance does an internal normalisation of the logits, removing the magnitude, and thus there is no gradient to propogate that opposes the values increasing. BTW weights is not used in your implementation.
What about just plain Euclidian distance using sigmoid instead of softmax in the last layer. Also, I would try adding another one or two dense layers (say size 512) between resnet50 and output dense layer.

An Efficient way to Calculate loss function batchwise?

I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.

Backpropagation with python/numpy - calculating derivative of weight and bias matrices in neural network

I'm developing a neural network model in python, using various resources to put together all the parts. Everything is working, but I have questions about some of the math. The model has variable number of hidden layers, uses relu activation for all hidden layers except for the last one, which uses sigmoid.
The cost function is:
def calc_cost(AL, Y):
m = Y.shape[1]
cost = (-1/m) * np.sum((Y * np.log(AL)) - ((1 - Y) * np.log(1 - AL)))
return cost
where AL is probability prediction after last sigmoid activation is applied.
In part of my implementation of backpropagation, I use the following
def linear_backward_step(dZ, A_prev, W, b):
m = A_prev.shape[1]
dW = (1/m) * np.dot(dZ, A_prev.T)
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)
return dA_prev, dW, db
where, given dZ (the derivative of the cost with respect to a linear step of forward propagation at any given layer), the derivative of the layer's weight matrix W, bias vector b, and deriv of previous layer's activation dA_prev, are each calculated.
The forward part that is complement to this step is this equation: Z = np.dot(W, A_prev) + b
My question is: in calculating dW and db, why is it necessary to multiply by 1/m? I've tried differentiating this using calculus rules but I'm unsure how this term fits in.
Any help is appreciated!
Your gradient calculation seems wrong. You do not multiply it by 1/m. Also, your calculation of m seems wrong as well. It should be
# note it's not A_prev.shape[1]
m = A_prev.shape[0]
Also, the definition in your calc_cost function
# should not be Y.shape[1]
m = Y.shape[0]
You can refer the following example for more information.
Neural Network Case Study
This actually depends on your loss function and if you update your weights after each sample or if you update batch-wise. Take a look at the following old-fashion general-purpose cost function:
Source: MSE Cost Function for Training Neural Network
Here, let's say y^_i is your networks output and y_i is your target value. y^_i is the output of your net.
If you differentiate this for y^_i you'll never get rid of the 1/n or the sum, because the derivative of a sum is the sum of the derivates. Since 1/n is a factor to the sum, you won't also not be able to get rid of this, too. Now, think about what the standard gradient descent is actually doing. It updates your weights after calculating the average over all n samples. A stochastic gradient descent can be used to update after each sample, so you won't have to average it. Batch updates calculate the average over each batch. Which I guess in your case is 1/m, where m is the batch size.

dropout with relu activations

I am trying to implement a neural network with dropout in tensorflow.
tf.layers.dropout(inputs, rate, training)
From the documentation: "Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting. The units that are kept are scaled by 1 / (1 - rate), so that their sum is unchanged at training time and inference time."
Now I understand that this behavior if dropout is applied on top of sigmoid activations that are strictly above zero. If half of the input units are zeroed, the sum of all the outputs will be also halved so it makes sense to scale them by factor of 2 in order to regain some kind of consistency before the next layer.
Now what if one uses the tanh activation which is centered around zero? The reasoning above no longer holds true so is it still valid to scale the output of dropout by the mentioned factor? Is there a way to prevent tensorflow dropout from scaling the outputs?
Thanks in advance
If you have a set of inputs to a node and a set of weights, their weighted sum is a value, S. You can define another random variable by selecting a random fraction f of the original random variables. The weighted sum using the same weights of the random variable defined in this way is S * f. From this, you can see the argument for rescaling is precise if the objective is that the mean of the sum remains the same with and without scaling. This would be true when the activation function is linear in the range of the weighted sums of subsets, and approximately true if the activation function is approximately linear in the range of the weighted sum of subsets.
After passing the linear combination through any non-linear activation function, it is no longer true that rescaling exactly preserves the expected mean. However, if the contribution to a node is not dominated by a small number of nodes, the variance in the sum of a randomly selected subset of a chosen, fairly large size will be relatively small, and if the activation function is approximately linear fairly near the output value, rescaling will work well to produce an output with approximately the same mean. Eg the logistic and tanh functions are approximately linear over any small region. Note that the range of the function is irrelevant, only the differences between its values.
With relu activation, if the original weighted sum is close enough to zero for the weighted sum of subsets to be on both sides of zero, a non-differentiable point in the activation function, rescaling won't work so well, but this is a relatively rare situation and limited to outputs that are small, so may not be a big problem.
The main observations here are that rescaling works best with large numbers of nodes making significant contributions, and relies on local approximate linearity of activation functions.
The point of setting the node to have an output of zero is so that neuron would have no effect on the neurons being fed by it. This would create sparsity and hence, attempts to reduce overfitting. When using sigmoid or tanh, the value is still set to zero.
I think your approach of reasoning here is incorrect. Think of contribution rather than sum.