What is attention penalty in speech transformer paper? (updated) - tensorflow

github: https://github.com/sephiroce/tfsr/tree/exprimental
I'm trying to reproduce recognition accuracies described in the speech transformer paper [1].
The attention penalty is a technique I could not fully understand.
This is the description of the attention penalty in the paper.
"In addition, we encouraged the model attending to closer positions by adding
bigger penalty on the attention weights of more distant position-pairs."
I understood as it means adding smaller negative values for more away from the diagonal on scaled attention logits (before masking) except for the first multi-head attention in decoders.
This is a code snippet for computing attention weights.
# Q * trans(K): (..., seq_len_q, seq_len_k)
matmul_qk = tf.matmul(query, key, transpose_b=True)
# scaled matmul_qk: ( Q * trans(K) ) / sqrt(d_k)
dimension_of_key = tf.cast(tf.shape(key)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dimension_of_key)
# add the mask to the scaled tensor
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# softmax is normalized on the last axis (seq_len_k) so that the scores
# add up to 1.
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
# Adding penalty to attention weights and linearly re-normalize it.
if attention_penalty is not None and att_penalty_scale > 0:
attention_weights += (attention_penalty * att_penalty_scale)
attention_weights += tf.math.abs(tf.math.reduce_min(attention_weights))
inv_sum = 1 / tf.math.reduce_sum(attention_weights, axis=-1)
attention_weights = tf.einsum('ijlm,ijl->ijlm', attention_weights, inv_sum)
The source code snippet below is for creating an attention penalty matrix.
I could not find any efficient way to create an attention penalty matrix for the second multi-head attention weights in decoders since the attention maps are not diagonal. Thus first I am trying to apply the attention penalty to encoders.
The source code assigns linearly bigger penalties for more distant elements from diagonal.
There are two hyper-parameters such as an attention_penalty_scale (this is similar to penalty_values which Jindřich suggested) and a width of the diagonal line.
I might be able to add an option such as stripe_step_size. Currently stripe_step_size can be interpreted as 1.
def create_attention_penalty(inp_len, tar_len, num_heads, attention_penalty_width):
max_inp_len = tf.cast(tf.math.reduce_max(inp_len), tf.int32)
n_batch = tf.shape(inp_len)[0]
enc_att_penalty = tf.ones([n_batch, num_heads, max_inp_len, max_inp_len])
accum = tf.zeros(([n_batch, num_heads, max_inp_len, max_inp_len]))
for i in range(attention_penalty_width - 1, max_inp_len - 1):
accum += tf.linalg.band_part(enc_att_penalty, i, i, name=None) - 1
enc_att_penalty = accum
return enc_att_penalty, None
Even though I implemented as I understand, I could not gain any accuracy improvement. And there is another down-side of this implementation. The training speed was getting slower.
Q) How to efficiently apply this attention penalty method for square and non-square attention weights?
Reference
[1] Linhao Dong, Shuang Xu, Bo Xu, Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition, ICASSP 2018, https://ieeexplore.ieee.org/document/8462506

I think you understand it well. They probably did a stripe around the diagonal, something like:
attention_penalty = (1 - tf.linalg.band_part(scaled_attention_logits, stripe_size, stripe_size)) * penalty
However, you probably need to experiment more with what the strip_size and penalty_values should be because the paper does not say much. Or you can try to write to the authors.

Related

GPflow multi-class: How to squeeze many gp.predict_f_samples to obtain their probabilities?

I classify MNIST digits and I want to sample the probabilities (not the latent function) for each class on multiple many times. However, gp.predict_y gives the probabilities just for one case.
Thus I take f_samples = gp.predict_f_samples which returns numerous examples from the underlying latent function.
Now, how to 'squeeze' the f_samples through the robust_max likelihood?
Code for my gp:
kernel = gpflow.kernels.Matern52(input_dim=128, ARD=ARD, active_dims=np.arange(128))\
+ gpflow.kernels.White(input_dim=128, active_dims=np.arange(128))
# Robustmax Multiclass Likelihood
invlink = gpflow.likelihoods.RobustMax(10) # Robustmax inverse link function
likelihood = gpflow.likelihoods.MultiClass(10, invlink=invlink) # Multiclass likelihood
Z = x_train[::5].copy() # inducing inputs
gp = gpflow.models.SVGP(x_train, y_train, num_latent=10,
kern=kernel, Z=Z, likelihood=likelihood,
whiten=True, q_diag=True)
GPflow version: 1.5.1
Once you've sampled you're no longer working with a probability distribution - you have actual values for each of your 10 latent functions. To convert a sample to probabilities over the classes you can just apply the RobustMax function (probability 1-epsilon for the largest latent function, epsilon/9 for all the others) to the 10 values you get. E.g.
eps = 0.001
f_samples = gp.predict_f_samples(x_test, num_samples)
largests = np.argmax(f_samples , axis = 2)
prob_samples = (np.eye(10)[largests]*(1-eps-eps/9)+eps/9)
Note that the probabilities you get will all be 0.999 on one class and 0.0001 on all the others - that's what RobustMax is. If you're intending to average over your samples, you probably just want to call gp.predict_y(), which actually integrates the RobustMax over the probability distribution and can give you some smoother class probabilities if the latent means are close.

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

How can I improve my LSTM accuracy in Tensorflow

I'm trying to figure out how to decrease the error in my LSTM. It's an odd use-case because rather than classifying, we are taking in short lists (up to 32 elements long) and outputting a series of real numbers, ranging from -1 to 1 - representing angles. Essentially, we want to reconstruct short protein loops from amino acid inputs.
In the past we had redundant data in our datasets, so the accuracy reported was incorrect. Since removing the redundant data our validation accuracy has gotten much worse, which suggests our network had learned to memorise the most frequent examples.
Our dataset is 10,000 items, split 70/20/10 between train, validation and test. We use a bi-directional, LSTM as follows:
x = tf.cast(tf_train_dataset, dtype=tf.float32)
output_size = FLAGS.max_cdr_length * 4
dmask = tf.placeholder(tf.float32, [None, output_size], name="dmask")
keep_prob = tf.placeholder(tf.float32, name="keepprob")
sizes = [FLAGS.lstm_size,int(math.floor(FLAGS.lstm_size/2)),int(math.floor(FLAGS.lstm_size/ 4))]
single_rnn_cell_fw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_fw" + str(i)) for i in range(len(sizes))])
single_rnn_cell_bw = tf.contrib.rnn.MultiRNNCell( [lstm_cell(sizes[i], keep_prob, "cell_bw" + str(i)) for i in range(len(sizes))])
length = create_length(x)
initial_state = single_rnn_cell_fw.zero_state(FLAGS.batch_size, dtype=tf.float32)
initial_state = single_rnn_cell_bw.zero_state(FLAGS.batch_size, dtype=tf.float32)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=single_rnn_cell_fw, cell_bw=single_rnn_cell_bw, inputs=x, dtype=tf.float32, sequence_length = length)
output_fw, output_bw = outputs
states_fw, states_bw = states
output_fw = last_relevant(FLAGS, output_fw, length, "last_fw")
output_bw = last_relevant(FLAGS, output_bw, length, "last_bw")
output = tf.concat((output_fw, output_bw), axis=1, name='bidirectional_concat_outputs')
test = tf.placeholder(tf.float32, [None, output_size], name="train_test")
W_o = weight_variable([sizes[-1]*2, output_size], "weight_output")
b_o = bias_variable([output_size],"bias_output")
y_conv = tf.tanh( ( tf.matmul(output, W_o)) * dmask, name="output")
Essentially, we use 3 layers of LSTM, with 256, 128 and 64 units each. We take the last step of both the Forward and Backward passes and concatenate them together. These feed into a final, fully connected layer that presents the data in the way we need it. We use a mask to set these steps we don't need to zero.
Our cost function uses a mask again, and takes the mean of the squared difference. We build the mask from the test data. Values to ignore are set to -3.0.
def cost(goutput, gtest, gweights, FLAGS):
mask = tf.sign(tf.add(gtest,3.0))
basic_error = tf.square(gtest-goutput) * mask
basic_error = tf.reduce_sum(basic_error)
basic_error /= tf.reduce_sum(mask)
return basic_error
To train the net I've used a variety of optimizers. The lowest scores have been obtained with the AdamOptimizer. The others, such as Adagrad, Adadelta, RMSProp tend to flatline around 0.3/0.4 error which is not particularly great.
Our learning rate is 0.004, batch size of 200. We use a 0.5 probability dropout layer.
I've tried adding more layers, changing learning rates, batch sizes, even the representation of the data. I've attempted batch regularisation, L1 and L2 weight regularisation (though perhaps incorrectly) and I've even considered switching to a convnet approach instead.
Nothing seems to make any difference. What has seemed to work is changing the optimizer. Adam seems noisier as it improves, but it does get closer than the other optimizers.
We need to get down to a value much closer to 0.05 or 0.01. Sometimes the training error touches 0.09 but the validation doesn't follow. I've run this network for about 500 epochs so far (about 8 hours) and it tends to settle around 0.2 validation error.
I'm not quite sure what to attempt next. Decayed learning rate might help but I suspect there is something more fundamental I need to do. It could be something as simple as a bug in the code - I need to double check the masking,

How to constrain a layer to be a probability matrix?

I recently read this paper which deals with noisy labels in convolutional neural networks.
They model label noise by a probability transition matrix which forms a simple
constrained linear layer after the softmax output.
So as an example we may have a 3-by-3 probability transition matrix (3 classes):
Example probability transition matrix. The sum of each column has to be 1.
This matrix Q is basically trained in the same way as the rest of the network via backpropagation. But it needs to be constrained to be a probability matrix. Quote from the paper:
After taking a gradient step with the Q and the model
weights, we project Q back to the subspace of probability matrices because it represents conditional probabilities.
Now I am wondering what is the best way to implement such a layer in tensorflow.
I have some ideas but i'm not sure what could work or is best procedure.
1) Hard code the constraint in the model before any training is done, something like:
# ... build conv model without Q
[...]
# shape of y_conv (output CNN) assumed to be a [3,1] vector
y_conv = tf.nn.softmax(y_conv, 0)
# add linear layer representing Q, no bias
W_Q = weight_variable([3, 3])
# add constraint: columns are valid probability distribution
W_Q = tf.nn.softmax(W_Q, 0)
# output of model:
Q_out = tf.matmul(W_Q, y_conv)
# now compute loss, gradients and start training
2) Compute and apply gradients to the whole model (Q included), then apply constraint
train_op = ...
constraint_op = tf.assign(W_Q, tf.nn.softmax(W_Q,0))
sess = tf.session()
# compute and apply gradients in form of a train_op
sess.run(train_op)
sess.run(constraint_op)
I think the second approach is more related to the paper quote, but I am not sure to what extend external assignments confuse training.
Or maybe my ideas are bananas. I hope you can give me some advice!

Tensorflow: What does tf.nn.separable_conv2d do?

I'm not quite sure what tf.nn.separable_conv2d does exactly. It seems to be that the pointwise_filter is the scaling factor for different features when generating one pixel of the next layer. But I'm not sure whether my interpretation is correct. Is there any reference for this method and what's the benefit?
tf.nn.separable_conv2d generates the same shape as tf.nn.conv2d. I would assume I can replace tf.nn.conv2d with tf.nn.separable_conv2d. But the result when using tf.nn.separable_conv2d seems to be very bad. The network stopped learning very early. For MNIST dataset, the accuracy is just random guess ~ 10%.
I thought when I set the pointwise_filter values to be all 1.0 and make it not trainable, I would get the same thing as the tf.nn.conv2d. But not really... still ~10% accuracy.
But when tf.nn.conv2d is used with the same hyper-parameters, the accuracy can be 99%. Why?
Also, it requires channel_multiplier * in_channels < out_channels. Why? What is the role of channel_multiplier here?
Thanks.
Edit:
I used channel_multiplier previously as 1.0. Maybe that is a bad choice. After I change it to 2.0, the accuracy becomes much better. But what is the role of channel_multiplier? Why 1.0 is not a good value?
tf.nn.separable_conv2d() implements the so-called 'separable convolution' described on slide 26 and onwards of this talk.
The idea is that instead of convolving jointly across all channels of an image, you run a separate 2D convolution on each channel with a depth of channel_multiplier. The in_channels * channel_multiplier intermediate channels get concatenated together, and mapped to out_channels using a 1x1 convolution.
It's often an effective way to reduce the parametric complexity of early convolutions in a convnet, and can materially speed up training. channel_multiplier controls that complexity, and would typically be 4 to 8 for a RGB input. For a grayscale input, using it makes little sense.
In the regular 2D convolution performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. Depthwise convolutions don't do that - each channel is kept separate - hence the name depthwise. Here's a diagram to help explain how that works[1]:
If you look at the official documentation you will find:
output[b, i, j, k] = sum_{di, dj, q, r}
input[b, strides[1] * i + di, strides[2] * j + dj, q] *
depthwise_filter[di, dj, q, r] *
pointwise_filter[0, 0, q * channel_multiplier + r, k]
And a sample code in tensorflow to test:
import tensorflow as tf
import numpy as np
width = 8
height = 8
batch_size = 100
filter_height = 3
filter_width = 3
in_channels = 3
channel_multiplier = 1
out_channels = 3
input_tensor = tf.get_variable(shape=(batch_size, height, width, in_channels), name="input")
depthwise_filter = tf.get_variable(shape=(filter_height, filter_width, in_channels, channel_multiplier), name="deptwise_filter")
pointwise_filter = tf.get_variable(shape=[1, 1, channel_multiplier * in_channels, out_channels], name="pointwise_filter")
output = tf.nn.separable_conv2d(
input_tensor,
depthwise_filter,
pointwise_filter,
strides=[1,1,1,1],
padding='SAME',
)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
output_value = sess.run(output, feed_dict={input_tensor: np.random.rand(batch_size, width, height, in_channels),
depthwise_filter: np.random.rand(filter_height, filter_width, in_channels, channel_multiplier),
pointwise_filter: np.random.rand(1, 1, channel_multiplier * in_channels, out_channels)})
print(np.shape(output_value))
credit:
[1] https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/
[2] https://www.tensorflow.org/api_docs/python/tf/nn/separable_conv2d
To answer the last part of the question:
Also, it requires channel_multiplier * in_channels < out_channels. Why?
I don't know why this constraint was put in originally, but it has been removed in the current master branch of TF and should make it to version 1.3. The thinking was probably something along the lines of "If you are reducing the reducing the number of channels in the pointwise step, you might have as well picked a smaller channel multiplier and saved on computation". I guess this reasoning is flawed because the pointwise step can combine values from different depthwise_filters or maybe because one might want to reduce the dimension a bit, not by a full factor.