tensorflow cross entropy loss for sequence with different lengths - tensorflow

i'm building a seq2seq model with LSTM using tensorflow. The loss function i'm using is the softmax cross entropy loss. The problem is my input sequences have different lenghts so i padded it. The output of the model have the shape [max_length, batch_size, vocab_size]. How can i calculate the loss that the 0 padded values don't affect the loss? tf.nn.softmax_cross_entropy_with_logits provide axis parameter so we can calculate the loss with 3-dimention but it doesn't provide weights. tf.losses.softmax_cross_entropy provides weights parameter but it recieves input with shape [batch_size, nclass(vocab_size)]. Please help!

I think you'd have to write your own loss function. Check out https://danijar.com/variable-sequence-lengths-in-tensorflow/.

In this case you need to pad the two logits and labels so that they have the same length. So, if you have the tensors logits with the size of (batch_size, length, vocab_size) and labels with the size of (batch_size, length) in which length is the size of your sequence. First, you have to pad them to same length:
def _pad_tensors_to_same_length(logits, labels):
"""Pad x and y so that the results have the same length (second dimension)."""
with tf.name_scope("pad_to_same_length"):
logits_length = tf.shape(logits)[1]
labels_length = tf.shape(labels)[1]
max_length = tf.maximum(logits_length, labels_length)
logits = tf.pad(logits, [[0, 0], [0, max_length - logits_length], [0, 0]])
labels = tf.pad(labels, [[0, 0], [0, max_length - labels_length]])
return logits, labels
Then you can do the padded cross entropy:
def padded_cross_entropy_loss(logits, labels, vocab_size):
"""Calculate cross entropy loss while ignoring padding.
Args:
logits: Tensor of size [batch_size, length_logits, vocab_size]
labels: Tensor of size [batch_size, length_labels]
vocab_size: int size of the vocabulary
Returns:
Returns the cross entropy loss
"""
with tf.name_scope("loss", values=[logits, labels]):
logits, labels = _pad_tensors_to_same_length(logits, labels)
# Calculate cross entropy
with tf.name_scope("cross_entropy", values=[logits, labels]):
xentropy = tf.nn.softmax_cross_entropy_with_logits_v2(
logits=logits, labels=targets)
weights = tf.to_float(tf.not_equal(labels, 0))
return xentropy * weights

The function below takes two tensors with shapes (batch_size,time_steps,vocab_len). computes the mask for zeroing the time steps related to padding. the mask will remove the loss of padding from the categorical cross entropy.
# the labels that has 1 as the first element
def mask_loss(y_true, y_pred):
mask_value = np.zeros((vocab_len))
mask_value[0] = 1
# find out which timesteps in `y_true` are not the padding character
mask = K.equal(y_true, mask_value)
mask = 1 - K.cast(mask, K.floatx())
mask = K.sum(mask,axis=2)/2
# multplying the loss by the mask. the loss for padding will be zero
loss = tf.keras.layers.multiply([K.categorical_crossentropy(y_true, y_pred), mask])
return K.sum(loss) / K.sum(mask)

Related

Custom TensorFlow loss function with batch size > 1?

I have some neural network with following code snippets, note that batch_size == 1 and input_dim == output_dim:
net_in = tf.Variable(tf.zeros(shape = [batch_size, input_dim]), dtype=tf.float32)
input_placeholder = tf.compat.v1.placeholder(shape = [batch_size, input_dim], dtype=tf.float32)
assign_input = net_in.assign(input_placeholder)
# Some matmuls, activations, dropouts, normalizations...
net_out = tf.tanh(output_before_activation)
def loss_fn(output, input):
#input.shape = output.shape = (batch_size, input_dim)
output = tf.reshape(output, [input_dim,]) # shape them into 1d vectors
input = tf.reshape(input, [input_dim,])
return my_fn_that_only_takes_in_vectors(output, input)
# Create session, preprocess data ...
for epoch in epoch_num:
for batch in range(total_example_num // batch_size):
sess.run(assign_input, feed_dict = {input_placeholder : some_appropriate_numpy_array})
sess.run(optimizer.minimize(loss_fn(net_out, net_in)))
Currently the neural network above works fine, but it is very slow because it updates gradient every sample (batch size = 1). I would like to set batch size > 1, but my_fn_that_only_takes_in_vectors cannot accommodate matrices whose first dimension is not 1. Due to the nature of my custom loss, flattening the batch input into a vector of length (batch_size * input_dim) seems to not work.
How would I write my new custom loss_fn now that the input and output are N x input_dim where N > 1? In Keras this would not have been an issue because keras somehow takes the average of the gradients of each example in the batch. For my TensorFlow function, should I take each row as a vector individually, pass them to my_fn_that_only_takes_in_vectors, then take the average of the results?
You can use a function that computes the loss on the whole batch, and works independently on the batch size. Basically the operations are applied to the whole first dimension of the input (the first dimension represents the element number in the batch). Here is an example, I hope this helps to see how the operations are carried out:
def my_loss(y_true, y_pred):
dx2 = tf.math.squared_difference(y_true[:, 0], y_true[:, 2]) # shape (BatchSize, )
dy2 = tf.math.squared_difference(y_true[:, 1], y_true[:, 3]) # shape: (BatchSize, )
denominator = dx2 + dy2 # shape: (BatchSize, )
dst_vec = tf.math.squared_difference(y_true, y_pred) # shape: (Batch, n_labels)
numerator = tf.reduce_sum(dst_vec, axis=-1) # shape: (BatchSize,)
loss_vector = tf.cast(numerator / denominator, dtype="float32") # shape: (BatchSize,) this is a vector containing the loss of each element of the batch
loss = tf.reduce_sum(loss_vector ) #if you want to sum the losses
return loss
I am not sure whether you need to return the sum or the avg of the losses for the batch.
If you sum, make sure to use a validation dataset with same batch size, otherwise the loss is not comparable.

What's the effect of projection layer after n-grams convolution banks?

I'm studying CBHG module for extracting representations from sequence in Tacotron.
CBHG is consisted of (1-D convolution bank - highway network - bidirectional GRU).
Inputs of 1-d conv is made by 'lookup_table' for embedding a~z.
After 1-D conv caculate, there is 'tf.concat' for concated the results.
And there is projection with this result.
I think it's associate with word2vec embedding like 'CBOW' but It's very hard to me.
What is the effect of projection layer after concated result by n-grams convolution? Extracting meaningful 'n' in n-grams conv banks?
Please help me.
with tf.variable_scope('conv_bank'):
# Convolution bank: concatenate on the last axis
# to stack channels from all convolutions
conv_fn = lambda k: \
conv1d(inputs, k, bank_channel_size,
tf.nn.relu, is_training, 'conv1d_%d' % k)
conv_outputs = tf.concat(
[conv_fn(k) for k in range(1, bank_size+1)], axis=-1,
)
# Maxpooling:
maxpool_output = tf.layers.max_pooling1d(
conv_outputs,
pool_size=maxpool_width,
strides=1,
padding='same')
# Two projection layers:
proj_out = maxpool_output
for idx, proj_size in enumerate(proj_sizes):
activation_fn = None if idx == len(proj_sizes) - 1 else tf.nn.relu
proj_out = conv1d(
proj_out, proj_width, proj_size, activation_fn,
is_training, 'proj_{}'.format(idx + 1))

Understanding Tensor Inputs & Transformations for use in an LSTM (dynamic RNN)

I am building an LSTM style neural network in Tensorflow and am having some difficulty understanding exactly what input is needed and the subsequent transformations made by tf.nn.dynamic_rnn before it is passed to the sparse_softmax_cross_entropy_with_logits layer.
https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn
Understanding the input
The input function is sending a feature tensor in the form
[batch_size, max_time]
However the manual states that input tensors must be in the form
[batch_size, max_time, ...]
I have therefore expanded the input with a 1d tensor to take the form
[batch_size, max_time, 1]
At this point the input does not break upon running, but I don't understand exactly what we have done here and suspect it may be causing the problems when calculating loss (see below).
Understanding the Transformations
This expanded tensor is then the 'features' tensor used in the code below
LSTM_SIZE = 3
lstm_cell = rnn.BasicLSTMCell(LSTM_SIZE, forget_bias=1.0)
outputs, _ = tf.nn.dynamic_rnn(lstm_cell, features, dtype=tf.float64)
#slice to keep only the last cell of the RNN
outputs = outputs[-1]
#softmax layer
with tf.variable_scope('softmax'):
W = tf.get_variable('W', [LSTM_SIZE, n_classes], dtype=tf.float64)
b = tf.get_variable('b', [n_classes], initializer=tf.constant_initializer(0.0), dtype=tf.float64)
logits = tf.matmul(outputs, W) + b
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels))
This throws a value error at loss
dimensions must be equal, but are [max_time, num_classes] and [batch_size]
from https://www.tensorflow.org/versions/r0.12/api_docs/python/nn/classification -
A common use case is to have logits of shape [batch_size, num_classes] and labels of shape [batch_size]. But higher dimensions are supported.
At some point in the process max_time and batch_size have been mixed up and I'm uncertain if its at input or during the LSTM. I'm grateful for any advice!
That is because of the shape of the output of the tf.nn.dynamic_rnn. From its documentation https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn:
outputs: The RNN output Tensor.
If time_major == False (default), this will be a Tensor shaped: [batch_size, max_time, cell.output_size].
If time_major == True, this will be a Tensor shaped: [max_time, batch_size, cell.output_size].
you are in the default case, so your outputs gas shape [batch_size, max_time, output_size], and when performing outputs[-1] you obtain a tensor with shape [max_time, output_size]. Probably slicing with outputs[:, -1] should fix it.

Parameters in tf.contrib.seq2seq.sequence_loss

I'm trying to use the tf.contrib.seq2seq.sequence_loss function in a RNN model to calculate the loss.
According to the API document, this function requires at least three parameters: logits, targets and weights
sequence_loss(
logits,
targets,
weights,
average_across_timesteps=True,
average_across_batch=True,
softmax_loss_function=None,
name=None
)
logits: A Tensor of shape [batch_size, sequence_length, num_decoder_symbols] and dtype float. The logits correspond to the prediction across all classes at each timestep.
targets: A Tensor of shape [batch_size, sequence_length] and dtype int. The target represents the true class at each timestep.
weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.
average_across_timesteps: If set, sum the cost across the sequence dimension and divide the cost by the total label weight across timesteps.
average_across_batch: If set, sum the cost across the batch dimension and divide the returned cost by the batch size.
softmax_loss_function: Function (labels, logits) -> loss-batch to be used instead of the standard softmax (the default if this is None). Note that to avoid confusion, it is required for the function to accept named arguments.
name: Optional name for this operation, defaults to "sequence_loss".
My understand is logits is my prediction after using Xw+b, so the shape of it should be [batch_size, sequence_length, output size]. Then target should be my label, but the shape required in is [batch_size, sequence_length]. I suppose my label should have the same shape as the logits.
So how to convert the 3d labels to 2d? Thanks in advance
Your targets(labels) don't need to be the same shape with logits.
If we ignore batch_size(which is not relevant to your question) for a moment, this API simply calculates loss between two sequences through weighed sum loss of each word.Suppose vocab_size is 5, and we get a target word 3, logits provide a prediction for this target with a vector [0.2, 0.1, 0.15, 0.4, 0.15].
To calculate the loss between target and prediction, target need not to be the same shape with prediction as [0, 0, 0, 1, 0]. tensorflow will do this internally.
You may refer to the distinction between two api: softmax_cross_entropy_with_logits and sparse_softmax_cross_entropy_with_logits
Your labels should be a 2d matrix of shape [batch_size, sequence_length], and your logits should be a 3d tensor of shape [batch_size, sequence_length, output_size]. Therefore you don't need to extend your label's dimension if your label variable is already in shape [batch_size, sequence_length].
In case you do want to extend the dimension, you can do it like this expended_variable = tf.expand_dims(the_variable_you_wanna_expand, axis = -1)
Deprecated, use instead
import tensorflow as tf
import tensorflow_addons as tfa
tfa.seq2seq.sequence_loss(
logits: tfa.types.TensorLike,
targets: tfa.types.TensorLike,
weights: tfa.types.TensorLike,
average_across_timesteps: bool = True,
average_across_batch: bool = True,
sum_over_timesteps: bool = False,
sum_over_batch: bool = False,
softmax_loss_function: Optional[Callable] = None,
name: Optional[str] = None
) -> tf.Tensor
https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/sequence_loss

TensorFlow: How to embed float sequences to fixed size vectors?

I am looking methods to embed variable length sequences with float values to fixed size vectors. The input formats as following:
[f1,f2,f3,f4]->[f1,f2,f3,f4]->[f1,f2,f3,f4]-> ... -> [f1,f2,f3,f4]
[f1,f2,f3,f4]->[f1,f2,f3,f4]->[f1,f2,f3,f4]->[f1,f2,f3,f4]-> ... -> [f1,f2,f3,f4]
...
[f1,f2,f3,f4]-> ... -> ->[f1,f2,f3,f4]
Each line is a variable length sequnece, with max length 60. Each unit in one sequece is a tuple of 4 float values. I have already paded zeros to fill all sequences to the same length.
The following architecture seems solve my problem if I use the output as the same as input, I need the thought vector in the center as the embedding for the sequences.
In tensorflow, I have found tow candidate methods tf.contrib.legacy_seq2seq.basic_rnn_seq2seq and tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq.
However, these tow methos seems to be used to solve NLP problem, and the input must be discrete value for words.
So, is there another functions to solve my problems?
All you need is only an RNN, not the seq2seq model, since seq2seq goes with an additional decoder which is unecessary in your case.
An example code:
import numpy as np
import tensorflow as tf
from tensorflow.contrib import rnn
input_size = 4
max_length = 60
hidden_size=64
output_size = 4
x = tf.placeholder(tf.float32, shape=[None, max_length, input_size], name='x')
seqlen = tf.placeholder(tf.int64, shape=[None], name='seqlen')
lstm_cell = rnn.BasicLSTMCell(hidden_size, forget_bias=1.0)
outputs, states = tf.nn.dynamic_rnn(cell=lstm_cell, inputs=x, sequence_length=seqlen, dtype=tf.float32)
encoded_states = states[-1]
W = tf.get_variable(
name='W',
shape=[hidden_size, output_size],
dtype=tf.float32,
initializer=tf.random_normal_initializer())
b = tf.get_variable(
name='b',
shape=[output_size],
dtype=tf.float32,
initializer=tf.random_normal_initializer())
z = tf.matmul(encoded_states, W) + b
results = tf.sigmoid(z)
###########################
## cost computing and training components goes here
# e.g.
# targets = tf.placeholder(tf.float32, shape=[None, input_size], name='targets')
# cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=targets, logits=z))
# optimizer = tf.train.AdamOptimizer(learning_rate=0.1).minimize(cost)
###############################
init = tf.global_variables_initializer()
batch_size = 4
data_in = np.zeros((batch_size, max_length, input_size), dtype='float32')
data_in[0, :4, :] = np.random.rand(4, input_size)
data_in[1, :6, :] = np.random.rand(6, input_size)
data_in[2, :20, :] = np.random.rand(20, input_size)
data_in[3, :, :] = np.random.rand(60, input_size)
data_len = np.asarray([4, 6, 20, 60], dtype='int64')
with tf.Session() as sess:
sess.run(init)
#########################
# training process goes here
#########################
res = sess.run(results,
feed_dict={
x: data_in,
seqlen: data_len})
print(res)
To encode sequence to a fixed length vector you typically use recurrent neural networks (RNNs) or convolutional neural networks (CNNs).
If you use a recurrent neural network you can use the output at the last time step (last element in your sequence). This corresponds to the thought vector in your question. Have a look at tf.dynamic_rnn. dynamic_rnn requires you to specify to type of RNN cell you want to use. tf.contrib.rnn.LSTMCell and tf.contrib.rnn.GRUCell are most common.
If you want to use CNNs you need to use 1 dimensional convolutions. To build CNNs you need tf.layers.conv1d and tf.layers.max_pooling1d
I have found a solution to my problem, using the following architecture,
,
The LSTMs layer below encode the series x1,x2,...,xn. The last output, the green one, is duplicated to the same count as the input for the decoding LSTM layers above. The tensorflow code is as following
series_input = tf.placeholder(tf.float32, [None, conf.max_series, conf.series_feature_num])
print("Encode input Shape", series_input.get_shape())
# encoding layer
encode_cell = tf.contrib.rnn.MultiRNNCell(
[tf.contrib.rnn.BasicLSTMCell(conf.rnn_hidden_num, reuse=False) for _ in range(conf.rnn_layer_num)]
)
encode_output, _ = tf.nn.dynamic_rnn(encode_cell, series_input, dtype=tf.float32, scope='encode')
print("Encode output Shape", encode_output.get_shape())
# last output
encode_output = tf.transpose(encode_output, [1, 0, 2])
last = tf.gather(encode_output, int(encode_output.get_shape()[0]) - 1)
# duplite the last output of the encoding layer
decoder_input = tf.stack([last for _ in range(conf.max_series)], axis=1)
print("Decoder input shape", decoder_input.get_shape())
# decoding layer
decode_cell = tf.contrib.rnn.MultiRNNCell(
[tf.contrib.rnn.BasicLSTMCell(conf.series_feature_num, reuse=False) for _ in range(conf.rnn_layer_num)]
)
decode_output, _ = tf.nn.dynamic_rnn(decode_cell, decoder_input, dtype=tf.float32, scope='decode')
print("Decode output", decode_output.get_shape())
# Loss Function
loss = tf.losses.mean_squared_error(labels=series_input, predictions=decode_output)
print("Loss", loss)