How to convert static_rnn inputs to dynamic_rnn inputs in tensorflow? - tensorflow

I'm having trouble understanding the input parameter for tensorflow's dynamic_rnn. It'd help a lot if I could understand how to convert static_rnn inputs to dynamic_rnn inputs.
For a static_rnn, the input is supposed to be a length T list of tensors whose shapes are [batch_size, input_size], where T is the sequence length. This makes sense to me.
For a dynamic_rnn, the input is supposed to be a Tensor of shape [batch_size, max_time, ...]. I don't understand how to incorporate input_size here. More generally, I don't know what else you could put in the ellipsis.
Say, for example, that my data consists of 50-character-long sentences, so the input_size is the number of letters in the alphabet. For a static_rnn, I'd make a length 50 list of tensors whose shapes are [batch_size, input_size]. How do I convert this list of tensors to a single tensor, so that I can feed it to a dynamic_rnn?

Your dynamic_rnn input should be of shape [batch_size, sequence_length, input_size].
Basically the tensor represents batch_size examples of length sequence_length, and whatever is left in the ellipsis is the shape of a single sequence element.
The thing is, with dynamic_rnn you don't need to know sequence_length beforehand, so your input placeholder could look like
x = tf.placeholder(tf.int32, shape=(batch_size, None, input_size))
Which comes in pretty handy. Furthermore, examples in a single batch can have different lengths (but must be padded to the same length), but you must pass the sequence_length parameter to dynamic_rnn so it knows when to stop calculations for each example.

Related

In tf.keras.layers.Embedding, why it is important to know the size of dictionary?

Same as the title, in tf.keras.layers.Embedding, why it is important to know the size of dictionary as input dimension?
Because internally, the embedding layer is nothing but a matrix of size vocab_size x embedding_size. It is a simple lookup table: row n of that matrix stores the vector for word n.
So, if you have e.g. 1000 distinct words, your embedding layer needs to know this number in order to store 1000 vectors (as a matrix).
Don't confuse the internal storage of a layer with its input or output shape.
The input shape is (batch_size, sequence_length) where each entry is an integer in the range [0, vocab_size[. For each of these integers the layer will return the corresponding row (which is a vector of size embedding_size) of the internal matrix, so that the output shape becomes (batch_size, sequence_length, embedding_size).
In such setting, the dimensions/shapes of the tensors are the following:
The input tensor has size [batch_size, max_time_steps] such that each element of that tensor can have a value in the range 0 to vocab_size-1.
Then, each of the values from the input tensor pass through an embedding layer, that has a shape [vocab_size, embedding_size]. The output of the embedding layer is of shape [batch_size, max_time_steps, embedding_size].
Then, in a typical seq2seq scenario, this 3D tensor is the input of a recurrent neural network.
...
Here's how this is implemented in Tensorflow so you can get a better idea:
inputs = tf.placeholder(shape=(batch_size, max_time_steps), ...)
embeddings = tf.Variable(shape=(vocab_size, embedding_size], ...)
inputs_embedded = tf.nn.embedding_lookup(embeddings, encoder_inputs)
Now, the output of the embedding lookup table has the [batch_size, max_time_steps, embedding_size] shape.

How to use Keras LSTM with word embeddings to predict word id's

I have problems understanding how to get the correct output when using word embeddings in Keras. My settings are as follows:
My input are batches of shape (batch_size, sequence_length). Each row
in a batch represents one sentence, the word are represented by word id's. The
sentences are padded with zeros such that all are of the same length.
For example a (3,6) input batch might look like: np.array([[135600],[174580],[138272]])
My targets are given by the input batch shifted one step to the right.
So for each input word I want to predict the next word: np.array([[356000],[745800],[382720]])
I feed such an input batch into the Keras embedding layer. My embedding
size is 100, so the output will be a 3D tensor of shape (batch_size,
sequence_length, embedding_size). So in the little example its (3,6,100)
This 3D batch is fed into an LSTM layer
The output of the LSTM layer is fed into a Dense layer with
(sequence_length) output neurons having a softmax activation
function. So the shape of the output will be like the shape of the input namely (batch_size, sequence_length)
As a loss I am using the categorical crossentropy between the input and target batch
My question:
The output batch will contain probabilities because of the
softmax activation function. But what I want is the network to predict
integers such that the output fits the target batch of integers.
How can I "decode" the output such that I know which word the network is predicting? Or do I have to construct the network differently?
Edit 1:
I have changed the output and target batches from 2D arrays to 3D tensors. So instead of using a target batch of size (batch_size, sequence_length) with integer id's I am now using a one-hot encoded 3D target tensor (batch_size, sequence_length, vocab_size). To get the same format as an output of the network, I have changed the network to output sequences (by setting return_sequences=True in the LSTM layer). Further, the number of output neurons was changed to vocab_size such that the output layer now produces a batch of size (batch_size, sequence_length, vocab_size).
With this 3D encoding I can get the predicted word id using tf.argmax(outputs, 2). This approach seems to work for the moment but I would still be interested whether it's possible to keep the 2D targets/outputs
One, solution, perhaps not the best, is to output one-hot vectors the size of of your dictionary (including dummy words).
Your last layer must output (sequence_length, dictionary_size+1).
Your dense layer will already output the sequence_length if you don't add any Flatten() or Reshape() before it, so it should be a Dense(dictionary_size+1)
You can use the functions keras.utils.to_categorical() to transform an integer in a one-hot vector and keras.backend.argmax() to transform a one=hot vector into an integer.
Unfortunately, this is sort of unpacking your embedding. It would be nice if it were possible to have a reverse embedding or something like that.

LSTM Followed by Mean Pooling (TensorFlow)

I am aware that there is a similar topic at LSTM Followed by Mean Pooling, but that is about Keras and I work in pure TensorFlow.
I have an LSTM network where the recurrence is handled by:
outputs, final_state = tf.nn.dynamic_rnn(cell,
embed,
sequence_length=seq_lengths,
initial_state=initial_state)
where I pass the correct sequence lengths for each sample (padding by zeros). In any case, outputs contains irrelevant outputs since some samples produce longer outputs than others, based on sequence lengths.
Right now I'm extracting the last relevant output by means of the following method:
def extract_axis_1(data, ind):
"""
Get specified elements along the first axis of tensor.
:param data: Tensorflow tensor that will be subsetted.
:param ind: Indices to take (one for each element along axis 0 of data).
:return: Subsetted tensor.
"""
batch_range = tf.range(tf.shape(data)[0])
indices = tf.stack([batch_range, ind], axis=1)
res = tf.reduce_mean(tf.gather_nd(data, indices), axis=0)
where I pass sequence_length - 1 as indices. In reference to the last topic, I would like to select all relevant outputs followed by average pooling, instead of just the last one.
Now, I tried passing nested lists as indeces to extract_axis_1 but tf.stack does not accept this.
Any solution directions for this?
You can exploit the weight parameter of the tf.contrib.seq2seq.sequence_loss function.
From the documentation:
weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.
You need to compute a binary mask that distinguish between your valid outputs and invalid ones. Then you can just provide this mask to the weights parameter of the loss function (probably, you will want to use a loss like this one); the function will not consider the outputs with a 0 weight in the computation of the loss.
If you can't/don't need to use a sequence loss you can do exactly the same thing manually. You compute a binarymask and then multiply your outputs by this mask and provide these as inputs to your fully connected layer.

understanding tensorflow sequence_loss parameters

The sequence_Loss module's source_code has three parameters that are required they list them as outputs, targets, and weights.
Outputs and targets are self explanatory, but I'm looking to better understand is what is the weight parameter?
The other thing I find confusing is that it states that the targets should be the same length as the outputs, what exactly do they mean by the length of a tensor? Especially if its a 3 dimensional tensor.
Think of the weights as a mask applied to the input tensor. In some NLP applications, we often have different sentence length for each sentence. In order to parallel/batch multiple instance sentences into a minibatch to feed into a neural net, people use a mask matrix to denotes which element in the the input tensor is actually a valid input. For instance, the weight can be a np.ones([batch, max_length]) that means all of the input elements are legit.
We can also use a matrix of the same shape as the labels such as np.asarray([[1,1,1,0],[1,1,0,0],[1,1,1,1]]) (we assume the labels shape is 3x4), then the crossEntropy of the first row last column will be masked out as 0.
You can also use weight to calculate weighted accumulation of cross entropy.
We used this in a class and our professor said we could just pass it ones of the right shape (the comment says "list of 1D batch-sized float-Tensors of the same length as logits"). That doesn't help with what they mean, but maybe it will help you get your code to run. Worked for me.
This code should do the trick: [tf.ones(batch_size, tf.float32) for _ in logits].
Edit: from TF code:
for logit, target, weight in zip(logits, targets, weights):
if softmax_loss_function is None:
# TODO(irving,ebrevdo): This reshape is needed because
# sequence_loss_by_example is called with scalars sometimes, which
# violates our general scalar strictness policy.
target = array_ops.reshape(target, [-1])
crossent = nn_ops.sparse_softmax_cross_entropy_with_logits(
logit, target)
else:
crossent = softmax_loss_function(logit, target)
log_perp_list.append(crossent * weight)
The weights that are passed are multiplied by the loss for that particular logit. So I guess if you want to take a particular prediction extra-seriously you can increase the weight above 1.

How to create a recurrent neural network in tensor flow for variable sequence length?

I am trying the create a recurrent neural network in tensor flow. The input to the network is a sequence of vectors. The sequence length is different for all the inputs. I want to do this with batch of inputs.
Can anyone help me on how exactly to do this? I have gone through the tutorials on the tensorflow site, but it is still not clear to me.
You can use the rnn function defined here
One of the arguments it takes is sequence_length
sequence_length: Specifies the length of each sequence in inputs.
An int32 or int64 vector (tensor) size [batch_size]. Values in [0, T).
Here is how to implement complete loop
# x, state, sequence_length are placeholders
outputs, final_state = tf.nn.rnn(lstm_cell, x, state, sequence_length = sequence_lengths)
# add softmax layer, define loss, training method, etc
...
# code for one epoch
iterations = total_data_length / batch_size
max_sequence_length = max(all_possible_sequence_lengths)
cur_state = initial_state
for i in range(iterations):
# x is of dimension [max_sequence_length, batch_size, input_size]
# sequence_lengths is of dimension [batch_size]
x_data, sequence_data, y_data = mini_batch(batch_size)
feed_dict = {k: v for k, v in zip(x, x_data)}
feed_dict.append(sequence_lengths: sequence_data, ...)
outs, cur_state, _ = session.run([outputs, final_state, train], feed_dict)
This method was bit confusing to me for couple of reasons:
Input shape is [sequence_length, batch_size, input_size] and not [batch_size, sequence_length, input_size]. However, this totally makes sense if you go through the code and see how rnn() is implemented. This also means you need to reshape your outputs (having dimension same as inputs) in order to pass them to matmul and then softmax for example.
Parameter inputs in function rnn() is a python list. And you cannot pass this in feed_dict as {x: x_data}, you will get an error saying "Not able to hash type: 'list'". Instead see how i used comprehension in code above.
It depends on your data set, but you can do:
Use the max length in your dataset, or
Use a reasonable size (e.g., 256) and splits the input data up to the size if they are longer than the size.