variable-length rnn padding and mask out padding gradients - tensorflow

I'm building an rnn and using the sequene_length parameter to supply a list of lengths for sequences in a batch, and all of sequences in a batch are padded to the same length.
However, when doing backprop, is it possible to mask out the gradients corresponding to the padded steps, so these steps would have 0 contribution to the weight updates? I'm already masking out their corresponding costs like this (where batch_weights is a vector of 0's and 1's, where the elements corresponding to the padding steps are 0's):
loss = tf.mul(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, tf.reshape(self._targets, [-1])), batch_weights)
self._cost = cost = tf.reduce_sum(loss) / tf.to_float(tf.reduce_sum(batch_weights))
the problem is I'm not sure by doing the above whether the gradients from the padding steps are zeroed out or not?

For all framewise / feed-forward (non-recurrent) operations, masking the loss/cost is enough.
For all sequence / recurrent operations (e.g. dynamic_rnn), there is always a sequence_length parameter which you need to set to the corresponding sequence lengths. Then there wont be a gradient for the zero-padded steps, or in other terms, it will have 0 contribution.

Related

What is the equivalence of Masking() Keras function in tensorflow? And does batch norm, conv, and relu support Masking?

I am training a GRU layer where inputs doesn't have the same length. Therefore, I have padded the inputs' features with 0.0 to make all sequences of same length. On the other hand, I don't want to compute any loss at any time step, for any sample as long as the input feature vector is all zeros. Example, at time step 1000, I have a batch size of 34, but samples number 33 and 34 of this batch lack data or feature values at time step 1000.
I have found that we can use the method Masking()(inputs) in Keras as long as all subsequent layers or operations support masking. But I have implemented my model in tensorflow. So what is the equivalence of Masking() in tensorflow?
Second, how can I know whether: batch normalization, conv layer and any non linear activation function has support for the masking() function in Keras?
Your help is much appreciated!!
So I found the detailed solution in danijar blog https://danijar.com/variable-sequence-lengths-in-tensorflow/.
The masking in keras is used when having incomplete sequences. So usually, you need to pad your sequences with 0.0 in the third dimension (The feature's dimension; when the input dimension has shape = [batch_size, sequence_length, num_features]).Afterwards, the masking in keras will take a number, will output 0 for their activations.
In summary: He showed how to compute the sequence length for each sample in the batch using length() he implemented. The output vector is then fed into the dynamic_rnn which will output zero vectors for incomplete sequences (for states and outputs), which is somehow similar to what happens in Keras Masking() function. Second, we should use a mask when computing the loss function.
All the details are discussed in this blog post.
But regarding the support thingy for masking in batch_norm, conv and non linear activation function; usually, if the output of the LSTM is zeros; then in case with sigmoid activation function at the output; the derivative of the output with respect to the input of the sigmoid function is output(1 - output). Hence, when the output is 0, this derivative is zero as well. And since back propagation applies the chain rule, then the gradients of the current sample with respect to any weight parameter in the network is going to be 0 as well. Hence, there is no need to worry about the support thingy... But the problem arises when the activation is relu for example, this is when the gradients should be explicitely multiplied by zeros before doing the back propagation (I guess). Maybe doing something like this will help:
final_output = output * mask
Then derivative of the final_output with respect to output will be the mask => 0 or 1 (the any time step; for any sample). Then, back propagate this gradient from the output of the activation function to its inputs...followed by chain rule => weights wont be affected in this case.

how to deal with padded 0 during feedforward process

Assume I have a lists of inputs of different sizes, for example, some are of the shape[10,9,5] some are [7,6,5], I have to pad 0s to feed them into tensor flow with the same size, say [10,9,5], I need to do matrix multiplication and add the biases during the forward process which will introduce numbers in the padded 0 positions. So I have to create a mask matrix by myself to mask them? Or is there an easier way from tensorflow? Thanks!
BTW, I'm not feeding sequences nor using rnn. so I cannot use dynamic rnn
I think you may use attention mechanism to convert the variable-length inputs to some fixed length tensor before you feed them into a feed forward network.

LSTM Followed by Mean Pooling (TensorFlow)

I am aware that there is a similar topic at LSTM Followed by Mean Pooling, but that is about Keras and I work in pure TensorFlow.
I have an LSTM network where the recurrence is handled by:
outputs, final_state = tf.nn.dynamic_rnn(cell,
embed,
sequence_length=seq_lengths,
initial_state=initial_state)
where I pass the correct sequence lengths for each sample (padding by zeros). In any case, outputs contains irrelevant outputs since some samples produce longer outputs than others, based on sequence lengths.
Right now I'm extracting the last relevant output by means of the following method:
def extract_axis_1(data, ind):
"""
Get specified elements along the first axis of tensor.
:param data: Tensorflow tensor that will be subsetted.
:param ind: Indices to take (one for each element along axis 0 of data).
:return: Subsetted tensor.
"""
batch_range = tf.range(tf.shape(data)[0])
indices = tf.stack([batch_range, ind], axis=1)
res = tf.reduce_mean(tf.gather_nd(data, indices), axis=0)
where I pass sequence_length - 1 as indices. In reference to the last topic, I would like to select all relevant outputs followed by average pooling, instead of just the last one.
Now, I tried passing nested lists as indeces to extract_axis_1 but tf.stack does not accept this.
Any solution directions for this?
You can exploit the weight parameter of the tf.contrib.seq2seq.sequence_loss function.
From the documentation:
weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.
You need to compute a binary mask that distinguish between your valid outputs and invalid ones. Then you can just provide this mask to the weights parameter of the loss function (probably, you will want to use a loss like this one); the function will not consider the outputs with a 0 weight in the computation of the loss.
If you can't/don't need to use a sequence loss you can do exactly the same thing manually. You compute a binarymask and then multiply your outputs by this mask and provide these as inputs to your fully connected layer.

Bi-directional LSTM for variable-length sequence in Tensorflow

I want to train a bi-directional LSTM in tensorflow to perform a sequence classification problem (sentiment classification).
Because sequences are of variable lengths, batches are normally padded with vectors of zero. Normally, I use the sequence_length parameter in the uni-directional RNN to avoid training on the padding vectors.
How can this be managed with bi-directional LSTM. Does the "sequence_length" parameter work automatically starts from an advanced position in the sequence for the backward direction?
Thank you
bidirectional_dynamic_rnn also has a sequence_length parameter that takes care of sequences of variable lengths.
https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn (mirror):
sequence_length: An int32/int64 vector, size [batch_size], containing the actual lengths for each of the sequences.
You can see an example here: https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/entity_lstm.py
In forward pass, rnn cell will stop at sequence_length which is the no-padding length of the input and is a parameter in tf.nn.bidirectional_dynamic_rnn. In backward pass, it firstly use function tf.reverse_sequence to reverse the first sequence_length elements and then traverse like that in the forward pass.
https://tensorflow.google.cn/api_docs/python/tf/reverse_sequence
This op first slices input along the dimension batch_axis, and for each slice i, reverses the first seq_lengths[i] elements along the dimension seq_axis.

Tensorflow unrolled LSTM longer than input sequence

I want to create an LSTM in tensorflow to predict time-series data. My training data is a set of input/output sequences of different lengths. Can I include multiple sequences of different lengths in the same training batch? Or do I need to pad them to equal lengths? If so, how?
Also: What will tensorflow do if the unrolled RNN is longer than the input sequence? The rnn() method contains an optional sequence_length argument which appears designed to handle this eventuality, but I'm not clear what it does.
Do you want to build the model from scratch? Otherwise you might want to look into the translate.py-model. Here your issue is taken care of by:
- padding the input (and output) sequences with a PAD-symbol (basically a neutral "no info"-symbol)
- buckets: For different groups of lengths you can create different buckets (makes sense only if your sequence-lengths are very different shortest to longest
You DONT have to batch inputs/output sequence of same length into a batch. TF has a way to specify the input size. The parameter "sequence_length", controls the number of time steps a cell is unrolled. So the TF will unroll your cell only up to sequence_length but not to the step size.
So while feeding the inputs and outputs also feed a sequence_length array which contain the length of each input
tf.nn.bidirectional_rnn(fwd_stacked_lstm_cells, bwd_stacked_lstm_cells,
reshaped_inputs,
sequence_length=sequence_length)
.....
feed_dict={
model.inputs: x,
model.targets: y,
model.sequence_length: lengths})
where
len(lengths) == batch_size and
for all i, lengths[i] == length of input x[i] (same as length of outpu y[i])