How to constrain a layer to be a probability matrix? - tensorflow

I recently read this paper which deals with noisy labels in convolutional neural networks.
They model label noise by a probability transition matrix which forms a simple
constrained linear layer after the softmax output.
So as an example we may have a 3-by-3 probability transition matrix (3 classes):
Example probability transition matrix. The sum of each column has to be 1.
This matrix Q is basically trained in the same way as the rest of the network via backpropagation. But it needs to be constrained to be a probability matrix. Quote from the paper:
After taking a gradient step with the Q and the model
weights, we project Q back to the subspace of probability matrices because it represents conditional probabilities.
Now I am wondering what is the best way to implement such a layer in tensorflow.
I have some ideas but i'm not sure what could work or is best procedure.
1) Hard code the constraint in the model before any training is done, something like:
# ... build conv model without Q
[...]
# shape of y_conv (output CNN) assumed to be a [3,1] vector
y_conv = tf.nn.softmax(y_conv, 0)
# add linear layer representing Q, no bias
W_Q = weight_variable([3, 3])
# add constraint: columns are valid probability distribution
W_Q = tf.nn.softmax(W_Q, 0)
# output of model:
Q_out = tf.matmul(W_Q, y_conv)
# now compute loss, gradients and start training
2) Compute and apply gradients to the whole model (Q included), then apply constraint
train_op = ...
constraint_op = tf.assign(W_Q, tf.nn.softmax(W_Q,0))
sess = tf.session()
# compute and apply gradients in form of a train_op
sess.run(train_op)
sess.run(constraint_op)
I think the second approach is more related to the paper quote, but I am not sure to what extend external assignments confuse training.
Or maybe my ideas are bananas. I hope you can give me some advice!

Related

Tensorflow: Accumulating gradients of a Tensor

TL;DR: you can just skip to the question in yellow box below.
Suppose I have a Encoder-Decoder Neural Network, with weights W_1 and W_2 of the encoder and decoder respectively. Let's denote Z as the output of the encoder. The network is trained with batch size n, and all the gradients will be calculated with respect to the mean loss value over the batch (as shown in image below, the L_hat which is the sum of per-sample loss L).
What I'm trying to achieve is, in the backward pass, to manipulate the gradients of Z before passing it further to the encoder's weights W_1. Suppose is a somehow modified gradients operator, for which the following holds:
The described above, in case of a synchronuous pass (first calculate the modified gradients of Z, then propagate down to W_1) is very easy to implement (the Jacobian multiplication is done using grad_ys of tf.gradients):
def modify_grad(grad_z):
# do some modifications
grad_z = tf.gradients(L_hat, Z)
mod_grad_z = modify_grad(grad_z)
mod_grad_w1 = tf.gradients(Z, W_1, mod_grad_z)
The problem is, I need to accumulate the gradients grad_z of the tensor Z over several batches. As the shape of it is dynamic (with None in one of the dimensions, as in the illustration above), I cannot define a tf.Variable to store it. Furthermore, the batch size n may change during training. How can I store the average of grad_z over several batches?
PS: I just wanted to combine pareto-optimal training of ArXiv:1810.04650, the asynchronous network training of ArXiv:1609.02132, and batch size scheduling of ArXiv:1711.00489.

Confused usage of dropout in mini-batch gradient descent

My question is in the end.
An example CNN trained with mini-batch GD and used the dropout in the last fully-connected layer (line 60) as
fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)
At first I thought the tf.layers.dropout or tf.nn.dropout randomly sets neurons to zero in columns. But I recently found it's not the case. The below piece of code prints what the dropout does. I used the fc0 as a 4 sample x 10 feature matrix, and the fc as the dropped out version.
import tensorflow as tf
import numpy as np
fc0 = tf.random_normal([4, 10])
fc = tf.nn.dropout(fc0, 0.5)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
a, b = sess.run([fc0, fc])
np.savetxt("oo.txt", np.vstack((a, b)), fmt="%.2f", delimiter=",")
And in the output oo.txt (original matrix: line 1-4, dropped out matrix: line 5-8):
0.10,1.69,0.36,-0.53,0.89,0.71,-0.84,0.24,-0.72,-0.44
0.88,0.32,0.58,-0.18,1.57,0.04,0.58,-0.56,-0.66,0.59
-1.65,-1.68,-0.26,-0.09,-1.35,-0.21,1.78,-1.69,-0.47,1.26
-1.52,0.52,-0.99,0.35,0.90,1.17,-0.92,-0.68,-0.27,0.68
0.20,0.00,0.71,-0.00,0.00,0.00,-0.00,0.47,-0.00,-0.87
0.00,0.00,0.00,-0.00,3.15,0.07,1.16,-0.00,-1.32,0.00
-0.00,-3.36,-0.00,-0.17,-0.00,-0.42,3.57,-3.37,-0.00,2.53
-0.00,1.05,-1.99,0.00,1.80,0.00,-0.00,-0.00,-0.55,1.35
My understanding of the proper? dropout is, knocking out p% same units for each sample in a mini-batch or batch gradient descent phase, and the back-propagation updates the weights and biases of the "thinned network". However, in the implementation of the example, the neurons of each sample in one batch were randomly dropped out, as illustrated in the oo.txt line 5 to 8, and for each sample, the "thinned network" is different.
As a comparison, in a stochastic gradient descent case, samples are fed into the neural network one-by-one, and in each iteration, weights of each tf.layers.dropout introduced "thinned network" are updated.
My question is, in the mini-batch or batch training, shouldn't it be implemented to knock out same neurons for all samples in one batch? Maybe by applying one mask to all input batch samples at each iteration?
Something like:
# ones: a 1xN all 1s tensor
# mask: a 1xN 0-1 tensor, multiply fc1 by mask with broadcasting along the axis of samples
mask = tf.layers.dropout(ones, rate=dropout, training=is_training)
fc1 = tf.multiply(fc1, mask)
Now I'm thinking the dropout strategy in the example may be a weighted way of updating weights of a certain neuron, that if a neuron is kept in 1 out of 10 samples in a mini-batch, its weights will be updated by alpha * 1/10 * (y_k_hat-y_k) * x_k, compared with alpha * 1/10 * sum[(y_k_hat-y_k) * x_k] for weights of another neuron kept in all 10 samples?
the screenshot from here
Dropouts are commonly used to prevent overfitting. In this case it would be a huge weight applied to one of the neurons. By randomly making it 0 from time to time, you force the network to use more neurons in determining the outcome. For this to work well you should drop different neurons for each example so that the gradient you compute is more similar to the one you would get without the dropout.
If you were to drop the same neurons for each example in the batch, my guess is that you will have a less stable gradient (might not matter for your application).
In addition dropout up-scales the rest of the values to keep the average activation at about the same level. Without it the network would learn wrong biases or would over-saturate when you turn dropout off.
If you still want the same neurons to be dropped in the batch then apply dropout to a all 1 tensor of shape (1, num_neurons) and then multiply it with the activations.
When using dropout, you are effectively trying to estimate the average performance of the network for a randomly chosen dropout mask, using Monte-Carlo sampling (by differentiation under the integral sign, the average gradient is equal to the gradient of the average). By fixing a dropout mask for each mini-batch, you are just introducing correlation between successive gradient estimates, which increases the variance and leads to slower training.
Imagine using a different dropout-mask for each image in the mini-batch, but forming the mini-batch from k copies of the same image; it's obvious that this would be a complete waste of effort!

LSTM with Condition

I'm studying LSTM with CNN in tensorflow.
I want to put some scalar label into LSTM network as a condition.
Does anybody know which LSTM is what I meant?
If available, please let me know the usage of that
Thank you.
This thread might interest you: Adding Features To Time Series Model LSTM.
You have basically 3 possible ways:
Let's take an example with weather data from two different cities: Paris and San Francisco. You want to predict the next temperature based on historical data. But at the same time, you expect the weather to change based on the city. You can either:
Combine the auxiliary features with the time series data, at the beginning or at the end (ugly!).
Concatenate the auxiliary features with the output of the RNN layer. It's some kind of post-RNN adjustment since the RNN layer won't see this auxiliary info.
Or just initialize the RNN states with a learned representation of the condition (e.g. Paris or San Francisco).
I wrote a library to condition on auxiliary inputs. It abstracts all the complexity and has been designed to be as user-friendly as possible:
https://github.com/philipperemy/cond_rnn/
The implementation is in tensorflow (>=1.13.1) and Keras.
Hope it helps!
Heres an example of applying CNN and LSTM over the output probabilities of a sequence, like you asked:
def build_model(inputs):
BATCH_SIZE = 4
NUM_CLASSES = 2
NUM_UNITS = 128
H = 224
W = 224
C = 3
TIME_STEPS = 4
# inputs is assumed to be of shape (BATCH_SIZE, TIME_STEPS, H, W, C)
# reshape your input such that you can apply the CNN for all images
input_cnn_reshaped = tf.reshape(inputs, (-1, H, W, C))
# define CNN, for instance vgg 16
cnn_logits_output, _ = vgg_16(input_cnn_reshaped, num_classes=NUM_CLASSES)
cnn_probabilities_output = tf.nn.softmax(cnn_logits_output)
# reshape back to time series convention
cnn_probabilities_output = tf.reshape(cnn_probabilities_output, (BATCH_SIZE, TIME_STEPS, NUM_CLASSES))
# perform LSTM over the probabilities per image
cell = tf.contrib.rnn.LSTMCell(NUM_UNITS)
_, state = tf.nn.dynamic_rnn(cell, cnn_probabilities_output)
# employ FC layer over the last state
logits = tf.layers.dense(state, NUM_UNITS)
# logits is of shape (BATCH_SIZE, NUM_CLASSES)
return logits
By the way, a better approach would be to employ the LSTM over the last hidden layer, i.e to use the CNN as feature extractor and make the prediction over sequences of features.

Using median frequency balancing with sparse_softmax_cross_entropy_with_logits

I'm currently working on using tensorflow to adress a multi-class segmentation problem on a SegNet Architecture
My classes are heavily unbalanced and thus I need to integrate the median frequency balancing (using weights on classes on loss calculation). I use the following tip (based on this post) to apply Softmax. I need help to extend it in order to add the weights, I'm not sure how to do it. Current implementation:
reshaped_logits = tf.reshape(logits, [-1, nClass])
reshaped_labels = tf.reshape(labels, [-1])
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(reshaped_logits, reshaped_labels)
My idea would be:
To split the logits tensor in nClass tensor,
Apply the softmax on each independently,
Weight them with median frequency balancing
Finally, summing the weighted losses.
Would that be the right approach?
Thanks
You can find the code to do that here
def _compute_cross_entropy_mean(class_weights, labels, softmax):
cross_entropy = -tf.reduce_sum(tf.multiply(labels * tf.log(softmax), class_weights),
reduction_indices=[1])
cross_entropy_mean = tf.reduce_mean(cross_entropy,
name='xentropy_mean')
return cross_entropy_mean
where head is your class weighing matrix.

How can I implement max norm constraints in an MLP in tensorflow?

How can I implement max norm constraints on the weights in an MLP in tensorflow? The kind that Hinton and Dean describe in their work on dark knowledge. That is, does tf.nn.dropout implement the weight constraints by default, or do we need to do it explicitly, as in
https://arxiv.org/pdf/1207.0580.pdf
"If these networks share the same weights for the hidden units that are present.
We use the standard, stochastic gradient descent procedure for training the dropout neural
networks on mini-batches of training cases, but we modify the penalty term that is normally
used to prevent the weights from growing too large. Instead of penalizing the squared length
(L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming
weight vector for each individual hidden unit. If a weight-update violates this constraint, we
renormalize the weights of the hidden unit by division."
Keras appears to have it
http://keras.io/constraints/
tf.nn.dropout does not impose any norm constraint. I believe what you're looking for is to "process the gradients before applying them" using tf.clip_by_norm.
For example, instead of simply:
# Create an optimizer + implicitly call compute_gradients() and apply_gradients()
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
You could:
# Create an optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Compute the gradients for a list of variables.
grads_and_vars = optimizer.compute_gradients(loss, [weights1, weights2, ...])
# grads_and_vars is a list of tuples (gradient, variable).
# Do whatever you need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_norm(gv[0], clip_norm=123.0, axes=0), gv[1])
for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients
optimizer = optimizer.apply_gradients(capped_grads_and_vars)
I hope this helps. Final notes about tf.clip_by_norm's axes parameter:
If you're calculating tf.nn.xw_plus_b(x, weights, biases), or equivalently matmul(x, weights) + biases, when the dimensions of x and weights are (batch, in_units) and (in_units, out_units) respectively, then you probably want to set axes == [0] (because in this usage each column details all incoming weights to a specific unit).
Pay attention to the shape/dimensions of your variables above and whether/how exactly you want to clip_by_norm each of them! E.g. if some of [weights1, weights2, ...] are matrices and some aren't, and you call clip_by_norm() on the grads_and_vars with the same axes value like in the List Comprehension above, this doesn't mean the same thing for all the variables! In fact, if you're lucky, this will result in a weird error like ValueError: Invalid reduction dimension 1 for input with 1 dimensions, but otherwise it's a very sneaky bug.
You can use tf.clip_by_value:
https://www.tensorflow.org/versions/r0.10/api_docs/python/train/gradient_clipping
Gradient clipping is also used to prevent weight explosion in recurrent neural networks.