AttentionQKV from Trax - attention-model

The AttentionQKV layer implemented by Trax is as the following: AttentionQKV
def AttentionQKV(d_feature, n_heads=1, dropout=0.0, mode='train'):
"""Returns a layer that maps (q, k, v, mask) to (activations, mask).
See `Attention` above for further context/details.
Args:
d_feature: Depth/dimensionality of feature embedding.
n_heads: Number of attention heads.
dropout: Probababilistic rate for internal dropout applied to attention
activations (based on query-key pairs) before dotting them with values.
mode: One of `'train'`, `'eval'`, or `'predict'`.
"""
return cb.Serial(
cb.Parallel(
core.Dense(d_feature),
core.Dense(d_feature),
core.Dense(d_feature),
),
PureAttention( # pylint: disable=no-value-for-parameter
n_heads=n_heads, dropout=dropout, mode=mode),
core.Dense(d_feature),
)
In particular, what is the purpose of the three parallel dense layers? The input to this layer is q, k, v, mask. Why the q, k, v are put through a dense layer?

This code snippet is an implementation of the equation on the top of page 5 of the Attention is all you need paper that introduced the Transformer models in 2017. The computation is illustrated in Figure 2 of the paper:
The hidden states get projection into h attention heads which do the scaled dot-product attention in parallel. The projection can be interpreted as extraction of information that is relevant for the head. Each head then does the probabilistic retrieval based on different (learned) criteria.

Related

Custom loss function in Keras that penalizes output from intermediate layer

Imagine I have a convolutional neural network to classify MNIST digits, such as this Keras example. This is purely for experimentation so I don't have a clear reason or justification as to why I'm doing this, but let's say I would like to regularize or penalize the output of an intermediate layer. I realize that the visualization below does not correspond to the MNIST CNN example and instead just has several fully connected layers. However, to help visualize what I mean let's say I want to impose a penalty on the node values in layer 4 (either pre or post activation is fine with me).
In addition to having a categorical cross entropy loss term which is typical for multi-class classification, I would like to add another term to the loss function that minimizes the squared sum of the output at a given layer. This is somewhat similar in concept to l2 regularization, except that l2 regularization is penalizing the squared sum of all weights in the network. Instead, I am purely interested in the values of a given layer (e.g. layer 4) and not all the weights in the network.
I realize that this requires writing a custom loss function using keras backend to combine categorical crossentropy and the penalty term, but I am not sure how to use an intermediate layer for the penalty term in the loss function. I would greatly appreciate help on how to do this. Thanks!
Actually, what you are interested in is regularization and in Keras there are two different kinds of built-in regularization approach available for most of the layers (e.g. Dense, Conv1D, Conv2D, etc.):
Weight regularization, which penalizes the weights of a layer. Usually, you can use kernel_regularizer and bias_regularizer arguments when constructing a layer to enable it. For example:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., kernel_regularizer=l1_l2, bias_regularizer=l1_l2)
Activity regularization, which penalizes the output (i.e. activation) of a layer. To enable this, you can use activity_regularizer argument when constructing a layer:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., activity_regularizer=l1_l2)
Note that you can set activity regularization through activity_regularizer argument for all the layers, even custom layers.
In both cases, the penalties are summed into the model's loss function, and the result would be the final loss value which would be optimized by the optimizer during training.
Further, besides the built-in regularization methods (i.e. L1 and L2), you can define your own custom regularizer method (see Developing new regularizers). As always, the documentation provides additional information which might be helpful as well.
Just specify the hidden layer as an additional output. As tf.keras.Models can have multiple outputs, this is totally allowed. Then define your custom loss using both values.
Extending your example:
input = tf.keras.Input(...)
x1 = tf.keras.layers.Dense(10)(input)
x2 = tf.keras.layers.Dense(10)(x1)
x3 = tf.keras.layers.Dense(10)(x2)
model = tf.keras.Model(inputs=[input], outputs=[x3, x2])
for the custom loss function I think it's something like this:
def custom_loss(y_true, y_pred):
x2, x3 = y_pred
label = y_true # you might need to provide a dummy var for x2
return f1(x2) + f2(y_pred, x3) # whatever you want to do with f1, f2
Another way to add loss based on input or calculations at a given layer is to use the add_loss() API. If you are already creating a custom layer, the custom loss can be added directly to the layer. Or a custom layer can be created that simply takes the input, calculates and adds the loss, and then passes the unchanged input along to the next layer.
Here is the code taken directly from the documentation (in case the link is ever broken):
from tensorflow.keras.layers import Layer
class MyActivityRegularizer(Layer):
"""Layer that creates an activity sparsity regularization loss."""
def __init__(self, rate=1e-2):
super(MyActivityRegularizer, self).__init__()
self.rate = rate
def call(self, inputs):
# We use `add_loss` to create a regularization loss
# that depends on the inputs.
self.add_loss(self.rate * tf.reduce_sum(tf.square(inputs)))
return inputs

What is attention penalty in speech transformer paper? (updated)

github: https://github.com/sephiroce/tfsr/tree/exprimental
I'm trying to reproduce recognition accuracies described in the speech transformer paper [1].
The attention penalty is a technique I could not fully understand.
This is the description of the attention penalty in the paper.
"In addition, we encouraged the model attending to closer positions by adding
bigger penalty on the attention weights of more distant position-pairs."
I understood as it means adding smaller negative values for more away from the diagonal on scaled attention logits (before masking) except for the first multi-head attention in decoders.
This is a code snippet for computing attention weights.
# Q * trans(K): (..., seq_len_q, seq_len_k)
matmul_qk = tf.matmul(query, key, transpose_b=True)
# scaled matmul_qk: ( Q * trans(K) ) / sqrt(d_k)
dimension_of_key = tf.cast(tf.shape(key)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dimension_of_key)
# add the mask to the scaled tensor
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# softmax is normalized on the last axis (seq_len_k) so that the scores
# add up to 1.
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
# Adding penalty to attention weights and linearly re-normalize it.
if attention_penalty is not None and att_penalty_scale > 0:
attention_weights += (attention_penalty * att_penalty_scale)
attention_weights += tf.math.abs(tf.math.reduce_min(attention_weights))
inv_sum = 1 / tf.math.reduce_sum(attention_weights, axis=-1)
attention_weights = tf.einsum('ijlm,ijl->ijlm', attention_weights, inv_sum)
The source code snippet below is for creating an attention penalty matrix.
I could not find any efficient way to create an attention penalty matrix for the second multi-head attention weights in decoders since the attention maps are not diagonal. Thus first I am trying to apply the attention penalty to encoders.
The source code assigns linearly bigger penalties for more distant elements from diagonal.
There are two hyper-parameters such as an attention_penalty_scale (this is similar to penalty_values which Jindřich suggested) and a width of the diagonal line.
I might be able to add an option such as stripe_step_size. Currently stripe_step_size can be interpreted as 1.
def create_attention_penalty(inp_len, tar_len, num_heads, attention_penalty_width):
max_inp_len = tf.cast(tf.math.reduce_max(inp_len), tf.int32)
n_batch = tf.shape(inp_len)[0]
enc_att_penalty = tf.ones([n_batch, num_heads, max_inp_len, max_inp_len])
accum = tf.zeros(([n_batch, num_heads, max_inp_len, max_inp_len]))
for i in range(attention_penalty_width - 1, max_inp_len - 1):
accum += tf.linalg.band_part(enc_att_penalty, i, i, name=None) - 1
enc_att_penalty = accum
return enc_att_penalty, None
Even though I implemented as I understand, I could not gain any accuracy improvement. And there is another down-side of this implementation. The training speed was getting slower.
Q) How to efficiently apply this attention penalty method for square and non-square attention weights?
Reference
[1] Linhao Dong, Shuang Xu, Bo Xu, Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition, ICASSP 2018, https://ieeexplore.ieee.org/document/8462506
I think you understand it well. They probably did a stripe around the diagonal, something like:
attention_penalty = (1 - tf.linalg.band_part(scaled_attention_logits, stripe_size, stripe_size)) * penalty
However, you probably need to experiment more with what the strip_size and penalty_values should be because the paper does not say much. Or you can try to write to the authors.

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

How to constrain a layer to be a probability matrix?

I recently read this paper which deals with noisy labels in convolutional neural networks.
They model label noise by a probability transition matrix which forms a simple
constrained linear layer after the softmax output.
So as an example we may have a 3-by-3 probability transition matrix (3 classes):
Example probability transition matrix. The sum of each column has to be 1.
This matrix Q is basically trained in the same way as the rest of the network via backpropagation. But it needs to be constrained to be a probability matrix. Quote from the paper:
After taking a gradient step with the Q and the model
weights, we project Q back to the subspace of probability matrices because it represents conditional probabilities.
Now I am wondering what is the best way to implement such a layer in tensorflow.
I have some ideas but i'm not sure what could work or is best procedure.
1) Hard code the constraint in the model before any training is done, something like:
# ... build conv model without Q
[...]
# shape of y_conv (output CNN) assumed to be a [3,1] vector
y_conv = tf.nn.softmax(y_conv, 0)
# add linear layer representing Q, no bias
W_Q = weight_variable([3, 3])
# add constraint: columns are valid probability distribution
W_Q = tf.nn.softmax(W_Q, 0)
# output of model:
Q_out = tf.matmul(W_Q, y_conv)
# now compute loss, gradients and start training
2) Compute and apply gradients to the whole model (Q included), then apply constraint
train_op = ...
constraint_op = tf.assign(W_Q, tf.nn.softmax(W_Q,0))
sess = tf.session()
# compute and apply gradients in form of a train_op
sess.run(train_op)
sess.run(constraint_op)
I think the second approach is more related to the paper quote, but I am not sure to what extend external assignments confuse training.
Or maybe my ideas are bananas. I hope you can give me some advice!

Explaining Variational Autoencoder gaussian parameterization

In the original Auto-Encoding Variational Bayes paper, the authors describes the "reparameterization trick" in section 2.4. The trick is to breakup your latent state z into learnable mean and sigma (learned by the encoder) and adding Gaussian noise. You then sample a datapoint from z (basically you generate an encoded image) and let the decoder map the encoded datapoint back to the original image.
I have a hard getting over how strange this is. Could someone explain a bit more on the latent variable model, specifically:
Why are we assuming the latent state is Gaussian?
How is it possible that a Gaussian can generate an image?
And how does backprop corrupt the encoder to learn a Gaussian function as opposed to an unknown non-linear function?
Here is an example implementation of the latent model from here in TensorFlow.
...neural net code maps input to hidden layers z_mean and z_log_sigma
self.z_mean, self.z_log_sigma_sq = \
self._recognition_network(network_weights["weights_recog"],
network_weights["biases_recog"])
# Draw one sample z from Gaussian distribution
n_z = self.network_architecture["n_z"]
eps = tf.random_normal((self.batch_size, n_z), 0, 1,
dtype=tf.float32)
# z = mu + sigma*epsilon
self.z = tf.add(self.z_mean,
tf.mul(tf.sqrt(tf.exp(self.z_log_sigma_sq)), eps))
...neural net code maps z to output
They are not assuming that the activations of the encoder follow a gaussian distribution, they are enforcing that of the possible solutions choose a gaussian resembling one.
The image is generated from decoding a activation/feature, the activations are distributed resembling a gaussian.
They minimize the KL divergence between the activations distribution and a gaussian one.