How to implement loss on a fully convolutional network in TensorFlow? - tensorflow

So I've started to implement the paper "Synthetic Data for Text Localisation in Natural Images" by Gupta et al. and I've encountered a serious problem.
The network architecture is a fully convolutional network. The final layer is basically an NxNx7 Tensor (Imagine a matrix where each cell holds 7 values). Each cell holds a P and C value. P is 6 parameters about a bounding box that should be regressed and C is the confidence.
Now I want to implement squared loss on this layer. As the paper states every cell of the final layer is a prediction, if indeed that predictor's location should contain a bounding box then the loss should be applied on all of the parameters in that predictor(or cell). If it shouldn't contain a bounding box then only regressing the confidence C should be enough.
So I should have dynamically defined separate losses in TensorFlow, how could I do that?

You can use tf.cond, and write something like
loss = tf.cond(is_there_sthg_label, lambda: tf.add(loss1, loss2), lambda: loss2)
EDIT:
Sorry I didn't understand your problem correctly. You can make a mask of size NxN with value True (at runtime) on [i, j] if there is a bounding box, and false else. Then you compute both your losses for each cell, you get tensors loss1 and loss2 of shape NxN, and then
#loss1 is the loss on the confidence only, loss2 is the loss on P
loss_tensor = loss1 + tf.multiply(loss2, tf.cast(mask, loss2.dtype))
total_loss = reduce_sum(loss_tensor)
(this still works if you have batches of course)

Related

Binary classification of pairs with opposite labels

I have a data-set without labels, but I do have a way to get pairs of examples with opposite labels, that is given a pair x,z I know that their true labels are either 0,1 or 1,0.
So, I am building a model that accepts pairs of samples as input, and learns to classify them with opposite labels. Assuming I have an arbitrary model for predicting a single sample, y_hat = f(x), I am building a model with Keras that accepts pairs of samples (x,z) and outputs pairs of predictions, f(x), f(z). I then use a custom loss function that drives the model towards the correct direction: Given that a regular binary classifier is trained using the Binary Cross Entropy (BCE) to make the predicted and desired output "close", I use the negative BCE. Also, since BCE is not symmetric, I symmetrize it. So, the loss function I give the model.compile method is:
from tensorflow import keras
bce = keras.losses.BinaryCrossentropy()
def neg_sym_bce(y1, y2):
return (- 0.5 * (bce(y1, y2) + bce(y2, y1)))
My problem is, this model fails to learn to classify even a single pair of my data (I get f(x)~=f(z)~=0.5), and if I try to train it with synthetic "easy" data, it takes hundreds of epochs to converge (also on a single pair).
This made me suspect that it has to do with a "vanishing gradient" problem. Indeed, when I plot (see below) the loss for a single pair, which is a function of 2 variables (the 2 outputs), it is evident that there is a wide plateau around the 0.5, 0.5 point. It is also evident that the global minima is, as expected, around the points 0,1 and 1,0.
So, is there a way to deal with the vanishing gradient here? I read about the problem but the references I found deal with vanishing gradient in the network, not in the loss itself.
Or, is there another loss that can drive the model to predict opposite labels?
Think if your labels are always either 0,1 or 1,1 just use categorical_crossentropy for the loss.

Can someone give me an explanation for Multibox loss function?

I have found some expression for SSD Multibox-loss function as follows:
multibox_loss = confidence_loss + alpha * location_loss
Can someone explains what are the explanations for those terms?
SSD Multibox (short for Single Shot Multibox Detector) is a neural network that can detect and locate objects in an image in a single forward pass. The network is trained in a supervised manner on a dataset of images where a bounding box and a class label is given for each object of interest. The loss term
multibox_loss = confidence_loss + alpha * location_loss
is made up of two parts:
Confidence loss is a categorical cross-entropy loss for classifying the detected objects. The purpose of this term is to make sure that correct label is assigned to each detected object.
Location loss is a regression loss (either the smooth L1 or the L2 loss) on the parameters (width, height and corner offset) of the detected bounding box. The purpose of this term is to make sure that the correct region of the image is identified for the detected objects. The alpha term is a hyper parameter used to scale the location loss.
The precise formulation of the loss is given in Equation 1 of the SSD: Single Shot MultiBox Detector paper.

Tensorflow: Accumulating gradients of a Tensor

TL;DR: you can just skip to the question in yellow box below.
Suppose I have a Encoder-Decoder Neural Network, with weights W_1 and W_2 of the encoder and decoder respectively. Let's denote Z as the output of the encoder. The network is trained with batch size n, and all the gradients will be calculated with respect to the mean loss value over the batch (as shown in image below, the L_hat which is the sum of per-sample loss L).
What I'm trying to achieve is, in the backward pass, to manipulate the gradients of Z before passing it further to the encoder's weights W_1. Suppose is a somehow modified gradients operator, for which the following holds:
The described above, in case of a synchronuous pass (first calculate the modified gradients of Z, then propagate down to W_1) is very easy to implement (the Jacobian multiplication is done using grad_ys of tf.gradients):
def modify_grad(grad_z):
# do some modifications
grad_z = tf.gradients(L_hat, Z)
mod_grad_z = modify_grad(grad_z)
mod_grad_w1 = tf.gradients(Z, W_1, mod_grad_z)
The problem is, I need to accumulate the gradients grad_z of the tensor Z over several batches. As the shape of it is dynamic (with None in one of the dimensions, as in the illustration above), I cannot define a tf.Variable to store it. Furthermore, the batch size n may change during training. How can I store the average of grad_z over several batches?
PS: I just wanted to combine pareto-optimal training of ArXiv:1810.04650, the asynchronous network training of ArXiv:1609.02132, and batch size scheduling of ArXiv:1711.00489.

Neural network: stddev of weights as function of layer size. Why?

Quick question about neural networks. I understand why weights are initialized with a small random value. I breaks a tie between weights so that they have a non-zero loss gradient. I was under the impression that it didn't matter much what the small random value was as long as the tie is broken. Then I read this:
weights = tf.Variable(
tf.truncated_normal([hidden1_units, hidden2_units],
stddev=1.0 / math.sqrt(float(hidden1_units))),
name='weights')
Rather than assign some small constant stddev like 0.1, the designer takes the effort to set it as 1/sqrt the number of nodes in the lower layer.
stddev=1.0 / math.sqrt(float(hidden1_units))
Why would they do that?
Is this more stable? Does it avoid some unwanted behavior? Does it train faster? Should I implement this practice in my own NNs?
First of all always remember that the aim of these initialization and training is to make sure the neurons and hence the network learns something meaningful.
Now assume you are using a sigmoid activation function
As you can see above the most change in Y given X is near center, and at the extremes the change is very small, and so will be the gradient during back-propogation.
Now wouldn't it be great if we can somehow assure that the input to the activation be in the good region of sigmoid.
So the aim for input of a neuron (sigmoid activation) be:
Mean: Zero
Variance: small (also independent of number of input dim)
Assuming the input layer dim as 'n'
input-to-activation = ∑ n i=1 wi xi
out-of-neuron = sigmoid(input-to-activation)
Now assuming each of wi and xi are independent respectively, and we have normalized the inputs xi to N(0,1).
So, as of now
X : 0 mean and 1 std
W : uniform (1/sqrt(n) , -1/sqrt(n)), assumed
mean(W) = 0 and var(w) = 1/12(4/n) = 1/(3n) [check variance of uniform dist]
Assumed X and Y are independent and have zero mean
Var(X+Y)=Var(X)+Var(Y). For the sumition over all the (xi wi) .
&
Var(X∗Y)=Var(X)∗Var(Y). For the xi wi .
now check,
mean of input-to-activation: 0
variance of input-to-activation: n* (1/(3n)) = 1/3
so now we are in good zone for an sigmoid activation, meaning not at the extreme ends. Check that the variance is independent of the the number of inputs, n.
Beautiful isn't it ?
But this is not only one way for intialization, Bengio has alos given a initialization way which considers both, the input and output layer of weight initialization. Read this for future details on both

How to constrain a layer to be a probability matrix?

I recently read this paper which deals with noisy labels in convolutional neural networks.
They model label noise by a probability transition matrix which forms a simple
constrained linear layer after the softmax output.
So as an example we may have a 3-by-3 probability transition matrix (3 classes):
Example probability transition matrix. The sum of each column has to be 1.
This matrix Q is basically trained in the same way as the rest of the network via backpropagation. But it needs to be constrained to be a probability matrix. Quote from the paper:
After taking a gradient step with the Q and the model
weights, we project Q back to the subspace of probability matrices because it represents conditional probabilities.
Now I am wondering what is the best way to implement such a layer in tensorflow.
I have some ideas but i'm not sure what could work or is best procedure.
1) Hard code the constraint in the model before any training is done, something like:
# ... build conv model without Q
[...]
# shape of y_conv (output CNN) assumed to be a [3,1] vector
y_conv = tf.nn.softmax(y_conv, 0)
# add linear layer representing Q, no bias
W_Q = weight_variable([3, 3])
# add constraint: columns are valid probability distribution
W_Q = tf.nn.softmax(W_Q, 0)
# output of model:
Q_out = tf.matmul(W_Q, y_conv)
# now compute loss, gradients and start training
2) Compute and apply gradients to the whole model (Q included), then apply constraint
train_op = ...
constraint_op = tf.assign(W_Q, tf.nn.softmax(W_Q,0))
sess = tf.session()
# compute and apply gradients in form of a train_op
sess.run(train_op)
sess.run(constraint_op)
I think the second approach is more related to the paper quote, but I am not sure to what extend external assignments confuse training.
Or maybe my ideas are bananas. I hope you can give me some advice!