RNN and LSTM implementation in tensorflow - tensorflow

I have been trying to learn how to code up an RNN and LSTM in tensorflow. I found an example online on this blog post
http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html
Below are the snippets which I am having trouble understanding for an LSTM network to be used eventually for char-rnn generation
x = tf.placeholder(tf.int32, [batch_size, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [batch_size, num_steps], name='labels_placeholder')
embeddings = tf.get_variable('embedding_matrix', [num_classes, state_size])
rnn_inputs = [tf.squeeze(i) for i in tf.split(1,
num_steps, tf.nn.embedding_lookup(embeddings, x))]
Different Section of the Code Now where the weights are defined
with tf.variable_scope('softmax'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
logits = [tf.matmul(rnn_output, W) + b for rnn_output in rnn_outputs]
y_as_list = [tf.squeeze(i, squeeze_dims=[1]) for i in tf.split(1, num_steps, y)]
x is the data to be fed, and y is the set of labels. In the lstm equations we have a series of gates, x(t) gets multiplied by a series and prev_hidden_state gets multiplied by some set of weights, biases are added and non-liniearities are applied.
Here are the doubts I have
In this case only one weight matrix is defined does that mean that
works for both x(t) and prev_hidden_state as well.
For the embeddings matrix I know it has to be multiplied by the
weight matrix but why is the first dimension num_classes
For the rnn_inputs we are using squeeze which removes dimensions of 1
but why would I want to do that in a one-hot-encoding.
Also from the splits I understand that we are unrolling the x of
dimension (batch_size X num_steps) into discrete (batch_size X 1)
vectors and then passing these values through the network is this
right

May I help you.
In this case only one weight matrix is defined does that mean that works for both x(t) and prev_hidden_state as well.
There are more weights as you call tf.nn.rnn_cell.LSTMCell. They are the internal weights of the RNN cell, which tensorflow created it implicitly when you call the cell.
The weight matrix you explicitly defined is the transform from the hidden state to the vocabulary space.
You can view the implicit weights accounting for the recurrent parts, taking the previous hidden state and current input and output the new hidden state. And the weight matrix you defined transform the hidden states(i.e. state_size = 200) to the higher vocabulary space.(i.e. vocab_size = 2000)
For further information, maybe you can view this tutorial : http://colah.github.io/posts/2015-08-Understanding-LSTMs/
For the embeddings matrix I know it has to be multiplied by the weight matrix but why is the first dimension num_classes
The num_classes accounts for the vocab_size, the embedding matrix is transforming the vocabulary to the required embedding size(in this example is equal to the state_size).
For the rnn_inputs we are using squeeze which removes dimensions of 1 but why would I want to do that in a one-hot-encoding.
You need to get rid of the extra dimension because tf.nn.rnn takes inputs as (batch_size, input_size) instead of (batch_size, 1, input_size).
Also from the splits I understand that we are unrolling the x of dimension (batch_size X num_steps) into discrete (batch_size X 1) vectors and then passing these values through the network is this right?
Being more precise, after embedding. (batch_size, num_steps, state_size) turns into a list of num_step elements, each of size (batch_size, 1, state_size).
The flow goes like this :
The embedding matrix embed each word as a state_size dimension vector(a row of the matrix), making the size (vocab_size, state_size).
Retrieve the the indices specified by the x placeholder and get the rnn input, which is size (batch_size, num_steps, state_size).
tf.split split the inputs to (batch_size, 1, state_size)
tf.squeeze sqeeze them to (batch_size, state_size), forming the desired input format for tf.nn.rnn.
If there's any problem with the tensorflow methods, maybe you can search them in the tensorflow API for more detailed introduction.

Related

Bahdanaus attention in Neural machine translation with attention

I am trying to understand Bahdanaus attention using the following tutorial:
https://www.tensorflow.org/tutorials/text/nmt_with_attention
The calculation is the following:
self.attention_units = attention_units
self.W1 = Dense(self.attention_units)
self.W2 = Dense(self.attention_units)
self.V = Dense(1)
score = self.V(tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)))
I have two problems:
I cannot understand why the shape of tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) is (batch_size,max_len,attention_units) ?
Using the rules of matrix multiplication I got the following results:
a) Shape of self.W1(last_inp_dec) -> (1,hidden_units_dec) * (hidden_units_dec,attention_units) = (1,attention_units)
b) Shape of self.W2(last_inp_enc) -> (max_len,hidden_units_dec) * (hidden_units_dec,attention_units) = (max_len,attention_units)
Then we add up a) and b) quantities. How do we end up with dimensionality (max_len, attention_units) or (batch_size, max_len, attention_units)? How can we do addition with different size of second dimension (1 vs max_len)?
Why do we multiply tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) by self.V? Because we want alphas as scalar?
) I cannot understand why the shape of tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) is
(batch_size,max_len,attention_units) ?
From the comments section of the code in class BahdanauAttention
query_with_time_axis shape = (batch_size, 1, hidden size)
Note that the dimension 1 was added using tf.expand_dims to make the shape compatible with values for the addition. The added dimension of 1 gets broadcast during the addition operation. Otherwise, the incoming shape was (batch_size, hidden size), which would not have been compatible
values shape = (batch_size, max_len, hidden size)
Addition of the query_with_time_axis shape and values shape gives us a shape of (batch_size, max_len, hidden size)
) Why do we multiply tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) by self.V? Because we want alphas as scalar?
self.V is the final layer, the output of which gives us the score. The random weight initialization of the self.V layer is handled by keras behind the scene in the line self.V = tf.keras.layers.Dense(1).
We are not multiplying tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) by self.V.
The construct self.V(tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) means --> the tanh activations resulting from the operation tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) form the input matrix to the single output output layer represented by self.V.
The shapes are slightly different from the ones you have given. It is best understood with a direct example perhaps?
Assuming 10 units in the alignment layer and 128 embedding dimensions on the decoder and 256 dimensions on the encoder and 19 timesteps, then:
last_inp_dec and input_enc shapes would be (?,128) and (?,19,256). We need to now expand last_inp_dec over the time axis to make it (?,1,128) so that addition is possible.
The layer weights for w1,w2,v will be (?,128,10), (?,256,10) and (?,10,1) respectively. Notice how self.w1(last_inp_dec) works out to (?,1,10). This is added to each of the self.w2(input_enc) to give a shape of (?,19,10). The result is fed to self.v and the output is (?,19,1) which is the shape we want - a set of 19 weights. Softmaxing this gives the attention weights.
Multiplying this attention weight with each encoder hidden state and summing up returns the context.
To your question as to why 'v' is needed, it is needed because Bahdanau provides the option of using 'n' units in the alignment layer (to determine w1,w2) and we need one more layer on top to massage the tensor back to the shape we want - a set of attention weights..one for each time step.
I just posted an answer at Understanding Bahdanau's Attention Linear Algebra
with all the shapes the tensors and weights involved.

In tf.keras.layers.Embedding, why it is important to know the size of dictionary?

Same as the title, in tf.keras.layers.Embedding, why it is important to know the size of dictionary as input dimension?
Because internally, the embedding layer is nothing but a matrix of size vocab_size x embedding_size. It is a simple lookup table: row n of that matrix stores the vector for word n.
So, if you have e.g. 1000 distinct words, your embedding layer needs to know this number in order to store 1000 vectors (as a matrix).
Don't confuse the internal storage of a layer with its input or output shape.
The input shape is (batch_size, sequence_length) where each entry is an integer in the range [0, vocab_size[. For each of these integers the layer will return the corresponding row (which is a vector of size embedding_size) of the internal matrix, so that the output shape becomes (batch_size, sequence_length, embedding_size).
In such setting, the dimensions/shapes of the tensors are the following:
The input tensor has size [batch_size, max_time_steps] such that each element of that tensor can have a value in the range 0 to vocab_size-1.
Then, each of the values from the input tensor pass through an embedding layer, that has a shape [vocab_size, embedding_size]. The output of the embedding layer is of shape [batch_size, max_time_steps, embedding_size].
Then, in a typical seq2seq scenario, this 3D tensor is the input of a recurrent neural network.
...
Here's how this is implemented in Tensorflow so you can get a better idea:
inputs = tf.placeholder(shape=(batch_size, max_time_steps), ...)
embeddings = tf.Variable(shape=(vocab_size, embedding_size], ...)
inputs_embedded = tf.nn.embedding_lookup(embeddings, encoder_inputs)
Now, the output of the embedding lookup table has the [batch_size, max_time_steps, embedding_size] shape.

States in the tensorflow static rnn

I'm trying to work with RNN using Tensorflow. I use the following function from this repos:
def RNN(x, weights, biases):
# Prepare data shape to match `rnn` function requirements
# Current data input shape: (batch_size, timesteps, n_input)
# Required shape: 'timesteps' tensors list of shape (batch_size, n_input)
# Unstack to get a list of 'timesteps' tensors of shape (batch_size, n_input)
x = tf.unstack(x, timesteps, 1)
# Define a lstm cell with tensorflow
lstm_cell = rnn.BasicLSTMCell(num_hidden, forget_bias=1.0)
# Get lstm cell output
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
# Linear activation, using rnn inner loop last output
return tf.matmul(outputs[-1], weights['out']) + biases['out']
I understand that outputs is a list containing intermediate outputs in the unrolled neural network. I can verify that len(outputs) equals timesteps. However, I wonder why len(states) equals 2. I think I should contain only the final state of the network.
Could you please help explain?
Thanks.
To confirm the discussion in the comments: when constructing a static RNN using BasicLSTMCell, state is a two-tuple of (c, h), where c is the final cell state and h is the final hidden state. The final cell hidden state is in fact equal to the final output in outputs. You can corroborate this by reading the source code (see BasicLSTMCell's call method).

Tensorflow reduce dimensions of rank 3 tensor

I am trying to build a CLDNN that is researched in the paper here
After the convolutional layers, the features go through a dim-reduction layer. At the point when the features leave the conv layers, the dimensions are [?, N, M]. N represents the number of windows and I think the network requires the reduction in the dimension M, so the dimensions of the features after the dim-red layer is [?,N,Q] , where Q < M.
I have two questions.
How do I do this in TensorFlow? I tried using a weight with
W = tf.Variable( tf.truncated_normal([M,Q],stddev=0.1) )
I thought the multiplication of tf.matmul(x,W) would yield [?, N, Q] but [?, N, M] and [M, Q] are not valid dimensions for multiplication. I would like to keep N constant and reduce the dimension of M.
What kind of non-linearity should I apply to the outcome of tf.matmul(x,W)? I was thinking about using a ReLU but I couldn't even get #1 done.
According to the linked paper (T. N. Sainath et al.: "Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks"),
[...] reducing the dimensionality, such that we have 256 outputs from the linear layer, was appropriate.
That means, whatever the input size is, i.e. [?, N, M] or any other dimensionality (always assuming that the first dimension is the number of samples in a mini-batch, denoted by ?), the output will be [?, Q], where typically Q=256.
As we are doing dimensionality reduction by multiplying the input with a weight matrix, no spatial information will be preserved. This means, that it doesn't matter whether each input is a matrix or a vector, so we can reshape the input to the linear layer x to have the dimensions [?, N*M]. Then, we can create a simple matrix multiplication tf.matmul(x, W) where W is a matrix with dimensions [N*M, Q].
W = tf.Variable(tf.truncated_normal([N*M, Q], stddev=0.1))
x_vec = tf.reshape(x, shape=(-1, N*M))
y = tf.matmul(x_vec, W)
Finally, regarding question 2: in the paper, the dimensionality reduction layer is a linear layer, i.e. you do not apply a non-linearity to the output.

Using Tensorflow's Connectionist Temporal Classification (CTC) implementation

I'm trying to use the Tensorflow's CTC implementation under contrib package (tf.contrib.ctc.ctc_loss) without success.
First of all, anyone know where can I read a good step-by-step tutorial? Tensorflow's documentation is very poor on this topic.
Do I have to provide to ctc_loss the labels with the blank label interleaved or not?
I could not be able to overfit my network even using a train dataset of length 1 over 200 epochs. :(
How can I calculate the label error rate using tf.edit_distance?
Here is my code:
with graph.as_default():
max_length = X_train.shape[1]
frame_size = X_train.shape[2]
max_target_length = y_train.shape[1]
# Batch size x time steps x data width
data = tf.placeholder(tf.float32, [None, max_length, frame_size])
data_length = tf.placeholder(tf.int32, [None])
# Batch size x max_target_length
target_dense = tf.placeholder(tf.int32, [None, max_target_length])
target_length = tf.placeholder(tf.int32, [None])
# Generating sparse tensor representation of target
target = ctc_label_dense_to_sparse(target_dense, target_length)
# Applying LSTM, returning output for each timestep (y_rnn1,
# [batch_size, max_time, cell.output_size]) and the final state of shape
# [batch_size, cell.state_size]
y_rnn1, h_rnn1 = tf.nn.dynamic_rnn(
tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True, num_proj=num_classes), # num_proj=num_classes
data,
dtype=tf.float32,
sequence_length=data_length,
)
# For sequence labelling, we want a prediction for each timestamp.
# However, we share the weights for the softmax layer across all timesteps.
# How do we do that? By flattening the first two dimensions of the output tensor.
# This way time steps look the same as examples in the batch to the weight matrix.
# Afterwards, we reshape back to the desired shape
# Reshaping
logits = tf.transpose(y_rnn1, perm=(1, 0, 2))
# Get the loss by calculating ctc_loss
# Also calculates
# the gradient. This class performs the softmax operation for you, so inputs
# should be e.g. linear projections of outputs by an LSTM.
loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length))
# Define our optimizer with learning rate
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
# Decoding using beam search
decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)
Thanks!
Update (06/29/2016)
Thank you, #jihyeon-seo! So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function. Did I do everything correct?
This is the code:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [max_length, -1, num_classes])
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
Update (07/11/2016)
Thank you #Xiv. Here is the code after the bug fix:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [-1, max_length, num_classes])
self._logits = tf.transpose(self._logits, (1,0,2))
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
Update (07/25/16)
I published on GitHub part of my code, working with one utterance. Feel free to use! :)
I'm trying to do the same thing.
Here's what I found you may be interested in.
It was really hard to find the tutorial for CTC, but this example was helpful.
And for the blank label, CTC layer assumes that the blank index is num_classes - 1, so you need to provide an additional class for the blank label.
Also, CTC network performs softmax layer. In your code, RNN layer is connected to CTC loss layer. Output of RNN layer is internally activated, so you need to add one more hidden layer (it could be output layer) without activation function, then add CTC loss layer.
See here for an example with bidirectional LSTM, CTC, and edit distance implementations, training a phoneme recognition model on the TIMIT corpus. If you train on that corpus's training set, you should be able to get phoneme error rates down to 20-25% after 120 epochs or so.