Cosine similarity loss cause weight values to explode - tensorflow

Suppose my data consists of images of bubbles, and the labels are histograms describing the distribution of sizes, for example:
0-10mm 10%
10-20mm 30%
20-30mm 40%
30-40mm 20%
It is important to note that -
All size percentages sum to 100% (or 1.0 to be more precise).
I don't have annotated data, so i can't train an object detector and then just calculate the distribution by counting objects detected. However, i do have a feature extractor train on my data.
I implemented a simple CNN that consists of -
Resnet50 backbone.
Global max pooling.
1x1 convolution of 6 filters (6 distribution bins in labels).
After some experiments i came to the conclusion that softmax and cross entropy as loss function does not suit my problem and needs.
I thought that maybe a cosine similarity loss, with a light modification, may be a good alternative (normalization will be part of post process). This is the implementation:
def cosine_similarity_loss(logits, probs, weights=1.0, label_smoothing=0):
x1_val = tf.sqrt(tf.reduce_sum(tf.matmul(logits, tf.transpose(logits)), axis=1))
x2_val = tf.sqrt(tf.reduce_sum(tf.matmul(probs, tf.transpose(probs)), axis=1))
denom = tf.multiply(x1_val, x2_val)
num = tf.reduce_sum(tf.multiply(logits, probs), axis=1)
cosine_sim = tf.math.divide(num, denom)
cosine_dist = tf.math.reduce_mean(1 - tf.square(cosine_sim)) # Cosine Distance. Reduce mean for shape compatibility.
return cosine_dist
Loss is a summation of cosine distance and l2 regularization on weights. After first feed forward i got loss: 3.1267 and after second feed forward i got loss: 96003645440.0000 - meaning weights exploded (logits: [[-785595.812 -553858.625 -545579.625 -148547.875 -12845.8633 19871.1055]] while probs: [[0.466 0.297 0.19 0.047 0 0]]).
What could be the reason for such rapid and extreme increase?

My guess is cosine distance does an internal normalisation of the logits, removing the magnitude, and thus there is no gradient to propogate that opposes the values increasing. BTW weights is not used in your implementation.
What about just plain Euclidian distance using sigmoid instead of softmax in the last layer. Also, I would try adding another one or two dense layers (say size 512) between resnet50 and output dense layer.

Related

Is the loss of the mini-batch the mean of the losses for each sample?

I know this has probably been addressed many a time, but I'm constantly hearing conflicting points and I'm uncertain as to how I should go about computing the loss function, and moreover, how to compute the gradient over a mini-batch.
Let's say I have a simple linear regression ANN model with one input, one output and no activation function. The weight matrix (W) and bias matrix (B) then just have the shape (1, 1). If I batch my data into mini-batches of size 32, then the input matrix (X) will have dimensions (1, 32). Forward prop is then performed without a hitch: it's just W.X + B which works fine because the shapes are compatible. It then produces the predictions which can be denoted by matrix Yi with shape (1, 32). The cost is then computed as the mean squared error of the model outputs and the truth values. In this case, there's only one output, so the cost over one training example is just (truth - predicted)2.
So I'm confused about a couple of aspects at this point. Do you a) compute the average cost over a mini-batch, then compute the derivative of the averaged cost w.r.t to the weight and bias; or b) calculate individual costs for each example in the mini-batch and then compute the derivatives of the costs w.r.t the weight and bias, and then finally sum the gradients and average them?
Since gradient is a linear operator,
grad((cost(x1)+...+cost(xn))/n)=(grad(cost(x1))+...grad(cost(xn)))/n.

Keras dense model gradient explosion

I have a very simple dense layer model takes 10 input values, 20 units in hidden layer, 1 unit in output layer, and "relu" as activation function, adam optimizer with learning rate 0.01
densemodel=keras_model_sequential();
layer_dense(densemodel, input_shape=ncol(trainingX), units=20, activation="relu")
layer_dropout(densemodel, rate=0.1)
layer_dense(densemodel, units=1, activation="relu")
optimizer=optimizer_adam(lr=0.01,clipnorm=1);
compile(densemodel, optimizer=optimizer, loss="logcosh", metrics = list("mean_squared_error"))
I trained the model with n = 2e4 training data and ran into serious gradient explosion, which was finally confirmed caused by some outliers (n < 10) in the training records.
Without removing the the outlier records, any one or combination of the following strategies failed to address the gradient explosion problem.
kernel_regularizer, bias_regularizer, activity_regularizer, clipnorm=1, clipvalue=0.5 or 0.1, set learning rate to 1e-5, add drop out layer, increase batch size.
basically none of them work.
I expect at least clipnorm or clipvalue should work since according to definition
clipnorm: Gradients will be clipped when their L2 norm exceeds this
value.
clipvalue: Gradients will be clipped when their absolute value exceeds
this value.
but why they failed?

How to make a selective back-propagation in a mini-batch in Tensorflow?

Recently, I'm working on a project "predicting future trajectories of objects from their past trajectories by using LSTMs in Tensorflow."
(Here, a trajectory means a sequence of 2D positions.)
Input to the LSTM is, of course, 'past trajectories' and output is 'future trajectories'.
The size of mini-batch is fixed when training. However, the number of past trajectories in a mini-batch can be different. For example, let the mini-batch size be 10. If I have only 4 past trajectories for the current training iteration, 6 out of 10 in the mini-batch is padded with zero value.
When calculating the loss for the back-propagation, I let the loss from the 6 be zero so that the only 4 contribute to the back-propagation.
The problem that I concern is..it seems that Tensorflow still calculates gradients for the 6 even if their loss is zero. As a result, the training speed becomes slower as I increase the mini-batch size even if I used the same training data.
I also used tf.where function when calculating the loss. However, the training time does not decrease.
How can I reduce the training time?
Here I attached my pseudo code for training.
# For each frame in a sequence
for f in range(pred_length):
# For each element in a batch
for b in range(batch_size):
with tf.variable_scope("rnnlm") as scope:
if (f > 0 or b > 0):
scope.reuse_variables()
# for each pedestrian in an element
for p in range(MNP):
# ground-truth position
cur_gt_pose = ...
# loss mask
loss_mask_ped = ... # '1' or '0'
# go through RNN decoder
output_states_dec_list[b][p], zero_states_dec_list[b][p] = cell_dec(cur_embed_frm_dec,
zero_states_dec_list[b][p])
# fully connected layer for output
cur_pred_pose_dec = tf.nn.xw_plus_b(output_states_dec_list[b][p], output_wd, output_bd)
# go through embedding function for the next input
prev_embed_frms_dec_list[b][p] = tf.reshape(tf.nn.relu(tf.nn.xw_plus_b(cur_pred_pose_dec, embedding_wd, embedding_bd)), shape=(1, rnn_size))
# calculate MSE loss
mse_loss = tf.reduce_sum(tf.pow(tf.subtract(cur_pred_pose_dec, cur_gt_pose_dec), 2.0))
# only valid ped's traj contributes to the loss
self.loss += tf.multiply(mse_loss, loss_mask_ped)
I think you're looking for the function tf.stop_gradient. Using this, you could do something like tf.where(loss_mask, tensor, tf.stop_gradient(tensor)) to achieve the desired result, assuming that the dimensions are correct.
However, it looks like this is probably not your issue. It seems as though for each item in your dataset, you are defining new graph nodes. This is not how TensorFlow is supposed to function, you should only have one graph, built beforehand that performs some fixed function, regardless of the batch size. You should definitely not be defining new nodes for every element in the batch, since that cannot efficiently take advantage of parallelism.

Confused usage of dropout in mini-batch gradient descent

My question is in the end.
An example CNN trained with mini-batch GD and used the dropout in the last fully-connected layer (line 60) as
fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)
At first I thought the tf.layers.dropout or tf.nn.dropout randomly sets neurons to zero in columns. But I recently found it's not the case. The below piece of code prints what the dropout does. I used the fc0 as a 4 sample x 10 feature matrix, and the fc as the dropped out version.
import tensorflow as tf
import numpy as np
fc0 = tf.random_normal([4, 10])
fc = tf.nn.dropout(fc0, 0.5)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
a, b = sess.run([fc0, fc])
np.savetxt("oo.txt", np.vstack((a, b)), fmt="%.2f", delimiter=",")
And in the output oo.txt (original matrix: line 1-4, dropped out matrix: line 5-8):
0.10,1.69,0.36,-0.53,0.89,0.71,-0.84,0.24,-0.72,-0.44
0.88,0.32,0.58,-0.18,1.57,0.04,0.58,-0.56,-0.66,0.59
-1.65,-1.68,-0.26,-0.09,-1.35,-0.21,1.78,-1.69,-0.47,1.26
-1.52,0.52,-0.99,0.35,0.90,1.17,-0.92,-0.68,-0.27,0.68
0.20,0.00,0.71,-0.00,0.00,0.00,-0.00,0.47,-0.00,-0.87
0.00,0.00,0.00,-0.00,3.15,0.07,1.16,-0.00,-1.32,0.00
-0.00,-3.36,-0.00,-0.17,-0.00,-0.42,3.57,-3.37,-0.00,2.53
-0.00,1.05,-1.99,0.00,1.80,0.00,-0.00,-0.00,-0.55,1.35
My understanding of the proper? dropout is, knocking out p% same units for each sample in a mini-batch or batch gradient descent phase, and the back-propagation updates the weights and biases of the "thinned network". However, in the implementation of the example, the neurons of each sample in one batch were randomly dropped out, as illustrated in the oo.txt line 5 to 8, and for each sample, the "thinned network" is different.
As a comparison, in a stochastic gradient descent case, samples are fed into the neural network one-by-one, and in each iteration, weights of each tf.layers.dropout introduced "thinned network" are updated.
My question is, in the mini-batch or batch training, shouldn't it be implemented to knock out same neurons for all samples in one batch? Maybe by applying one mask to all input batch samples at each iteration?
Something like:
# ones: a 1xN all 1s tensor
# mask: a 1xN 0-1 tensor, multiply fc1 by mask with broadcasting along the axis of samples
mask = tf.layers.dropout(ones, rate=dropout, training=is_training)
fc1 = tf.multiply(fc1, mask)
Now I'm thinking the dropout strategy in the example may be a weighted way of updating weights of a certain neuron, that if a neuron is kept in 1 out of 10 samples in a mini-batch, its weights will be updated by alpha * 1/10 * (y_k_hat-y_k) * x_k, compared with alpha * 1/10 * sum[(y_k_hat-y_k) * x_k] for weights of another neuron kept in all 10 samples?
the screenshot from here
Dropouts are commonly used to prevent overfitting. In this case it would be a huge weight applied to one of the neurons. By randomly making it 0 from time to time, you force the network to use more neurons in determining the outcome. For this to work well you should drop different neurons for each example so that the gradient you compute is more similar to the one you would get without the dropout.
If you were to drop the same neurons for each example in the batch, my guess is that you will have a less stable gradient (might not matter for your application).
In addition dropout up-scales the rest of the values to keep the average activation at about the same level. Without it the network would learn wrong biases or would over-saturate when you turn dropout off.
If you still want the same neurons to be dropped in the batch then apply dropout to a all 1 tensor of shape (1, num_neurons) and then multiply it with the activations.
When using dropout, you are effectively trying to estimate the average performance of the network for a randomly chosen dropout mask, using Monte-Carlo sampling (by differentiation under the integral sign, the average gradient is equal to the gradient of the average). By fixing a dropout mask for each mini-batch, you are just introducing correlation between successive gradient estimates, which increases the variance and leads to slower training.
Imagine using a different dropout-mask for each image in the mini-batch, but forming the mini-batch from k copies of the same image; it's obvious that this would be a complete waste of effort!

Inception network BatchNorm layer returns None gradient

Hi I am trying to fine tune inception network with a customized loss function. It is a triplet loss funcion.
This function is from facenet.py
def triplet_loss(value, alpha):
"""Calculate the triplet loss according to the FaceNet paper
Args:
value: the embeddings for the anchor, positive, negative images.
Returns:
the triplet loss according to the FaceNet paper as a float tensor.
"""
# The following function ensuer, it is evenly divided
anchor, positive, negative = tf.split(value, num_or_size_splits=3, axis=0)
with tf.variable_scope('triplet_loss'):
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, positive)), 1)
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, negative)), 1)
basic_loss = tf.add(tf.subtract(pos_dist, neg_dist), alpha)
loss = tf.reduce_mean(tf.maximum(basic_loss, 0.0), 0)
# TODO: added by me
tf.add_to_collection('losses', loss)
return loss
Note: the value param is the output of logits layer before the softmax.
When I calculate the gradient, I find out BatchNorm/moving_variance and BatchNorm/moving_variance have None gradient. Why it returns None gradient value?
And with visualization, I found there is no data flow from loss to BatchNorm scope, Why weights has dataflow from the loss node but Batchnorm doesn't ?
These None gradients only belong to batchNorm layers, therefore I do some studies on batchNorm. After reading the blog post http://ruishu.io/2016/12/27/batchnorm/ I found that
batch normalization has distinct behaviors during training versus test time.
Training
Normalize layer activations according to mini-batch statistics.
During the training step, update population statistics approximation via moving average of mini-batch statistics.
Testing
Normalize layer activations according to estimated population statistics. Do not update population statistics according to mini-batch statistics from test data.
After I set the phase key as training in the inference function, the problem is solved.