Difference between bidirectional_dynamic_rnn and stack_bidirectional_dynamic_rnn in Tensorflow - tensorflow

I am building a dynamic RNN network with stacking multiple LSTMs. I see there are 2 options
# cells_fw and cells_bw are list of cells eg LSTM cells
stacked_cell_fw = tf.contrib.rnn.MultiRNNCell(cells_fw)
stacked_cell_bw = tf.contrib.rnn.MultiRNNCell(cells_bw)
output = tf.nn.bidirectional_dynamic_rnn(
stacked_cell_fw, stacked_cell_bw, INPUT,
sequence_length=LENGTHS, dtype=tf.float32)
vs
output = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(cells_fw, cells_bw, INPUT,
sequence_length=LENGTHS, dtype=tf.float32)
What is the difference between the 2 approaches and is one better than the other?

If you want to have have multiple layers that pass the information backward or forward in time, there are two ways how to design this. Assume the forward layer consists of two layers F1, F2 and the backword layer consists of two layers B1, B2.
If you use tf.nn.bidirectional_dynamic_rnn the model will look like this (time flows from left to right):
If you use tf.contrib.rnn.stack_bidirectional_dynamic_rnn the model will look like this:
Here the black dot between first and second layer represents a concatentation. I.e., the outputs of the forward and backward cells are concatenated together and fed to the backward and forward layers of the next upper layer. This means both F2 and B2 receive exactly the same input and there is an explicit connection between backward and forward layers. In "Speech Recognition with Deep Recurrent Neural Networks" Graves et al. summarize this as follows:
... every hidden layer receives input from both the
forward and backward layers at the level below.
This connection only happens implicitly in the unstacked BiRNN (first image), namely when mapping back to the output. The stacked BiRNN usually performed better for my purposes, but I guess that depends on your problem setting. But for sure it is worthwile to try it out!
EDIT
In response to your comment: I base my answer on the documentation of the function tf.contrib.rnn.stack_bidirectional_dynamic_rnn which says:
Stacks several bidirectional rnn layers. The combined forward and
backward layer outputs are used as input of the next layer.
tf.bidirectional_rnn does not allow to share forward and backward
information between layers.
Also, I looked at the implementation available under this link.

Related

Best way to add features after last convolutional layer, before fully-connected layers?

I am working on a regression problem related to chess. The output will depend on about 68 values that are given by Stockfish's static evaluation function (example output shown here), as well as the state of the board. However, the static eval features should not be passed through the CNN, only through the final fully-connected layers. Therefore I want to have some convolutional layers take the (one-hot encoded) board state down to a flat vector, then extend it with the other features before passing the full vector to a fully-connected layer.
How can I use Tensorflow to combine these two feature vectors (the result from the CNN and the other game-related features) within a single Layer type that can be added to a Sequential? I couldn't find anything that would handle this in the docs. Would subclassing Layer be the only way to go?

Fully Connected Layer dimensions

I have a few uncertainties regarding the fully connected layer of a convolutional neural network. Lets say the the input is the output of a convolutional layer. I understand the previous layer is flattened. But can it have multiple channels? (for example, can the input to the fully connected layer be 16x16x3 (3 channels, flattened into a vector of 768 elements?)
Next, I understand the equation for outputs is,
outputs = activation(inputs * weights' + bias)
Is there 1 weight per input? (for example, in the example above, would there be 768 weights?)
Next, how many bias's are there? 1 per channel (so 3)? 1 no matter what? Something else?
Lastly, how do filters work in the fully connected layer? Can there be more than 1?
You might have a misunderstanding of how the fully connected neural network works. To get a better understanding of it, you could always check some good tutorials such as online courses from Stanford HERE
To answer your first question: yes, whatever dimensions you have, you need to flatten it before sending to fully connected layers.
To answer your second question, you have to understand that fully connected layer is actually a process of matrix multiplication followed by a vector addition:
input^T * weights + bias = output
where you have an input of dimension 1xIN, weights of size INxOUT, and output of size 1xOUT, so you have 1xIN * INxOUT = 1xOUT. Altogether, you will have INxOUT weights, and OUT weights for each input. You will also need OUT biases, such that the full equation is 1xIN * INxOUT + 1xOUT(bias term).
There is no filters since you are not doing convolution.
Note that fully connected layer is also equal to 1x1 convolution layer, and many implementations use later for fully connected layer, this could be confusing for beginners. For details, please refer to HERE

Feed input to intermediate layer and then do back propagation in keras

I have looked around everywhere but could not find the way to do this.
Basically I want to feed input to some intermediate layer in a keras model and want to the backpropagation for the full graph (i.e. including layer before the intermediate layer). To understand this I refer you to the figure as mentioned in the paper "Multi-view Convolutional Neural Networks for 3D Shape Recognition".
From the figure you can see that the feature are maxpooled in view pooling layer and then the resultant vector is passed to the rest of the network.
From the paper they further did he back propagation using the view pooling features.
To achieve this I am trying a simple approach. There will not be any viewpooling layer in my model. This pooling I will do offline by taking the features for multiple views and then taking the max of it. Finally the aggregated feature will be passed to rest of the network. However I am not able to figure out how to do the back propagation to the full network by passing input to intermediate layer directly.
Any help would be appreciated. Thanks
If you have the code of the tensorflow model, then this will be quite simple. The model would probably look like
def model( cnns ):
viewpool_output = f(cnns)
cnn2_output = cnn2( viewpool_output )
...
You would just need to change the model to
def model( viewpool_output ):
cnn2_output = cnn2( viewpool_output )
...
and instead of passing a "real" view pool output, you just pass whatever image you want. But you haven't given any code, so we can only guess at what it looks like.

Avoiding weight sharing among certain layers in BucketingModule in mxnet?

I am using BucketingModule for training multiple small models/bots together. Here, the bucket key is bot_id. However, each bot has separate set of target labels/classes (and hence, different size of softmax layer for each bot).
Is there any way to train such a model in mxnet, where I want to share the weights for all the layers but one (softmax) among all the bots?
How would I initialize such a model using sym_gen method?
If in the sym_gen method, for the Softmax layer I specify the num_hidden=size_dict[bot] i.e.,
pred = mx.sym.FullyConnected(data=pred, num_hidden=len(size_dict[bot]), name='pred')
pred = mx.sym.SoftmaxOutput(data=pred, label=label, name='softmax')
I get the error:
Inferred shape does not match shared_exec.arg_array's shape
which makes sense as each bot has different number of target classes.
This issue was posted and resolved here: https://github.com/apache/incubator-mxnet/issues/9042
You can make sym_gen(default_bucket_key) returns a "master network" that contains all these FC layers of different shapes, and sym_gen(other_keys) returns a subset of the master network with one particular FC. Note that for the master network, you probably need to use mx.sym.Group to group all outputs together so only one symbol is returned.

Understanding Seq2Seq model

Here is my understanding of a basic Sequence to Sequence LSTMs. Suppose we are tackling a question-answer setting.
You have two set of LSTMs (green and blue below). Each set respectively sharing weights (i.e. each of the 4 green cells have the same weights and similarly with the blue cells). The first is a many to one LSTM, which summarises the question at the last hidden layer/ cell memory.
The second set (blue) is a Many to Many LSTM which has different weights to the first set of LSTMs. The input is simply the answer sentence while the output is the same sentence shifted by one.
The question is two fold:
1. Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
2. Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
(image taken from suriyadeepan.github.io)
Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
Both hidden state h and cell memory c are passed to the decoder.
TensorFlow
In seq2seq source code, you can find the following code in basic_rnn_seq2seq():
_, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
return rnn_decoder(decoder_inputs, enc_state, cell)
If you use an LSTMCell, the returned enc_state from the encoder will be a tuple (c, h). As you can see, the tuple is passed directly to the decoder.
Keras
In Keras, the "state" defined for an LSTMCell is also a tuple (h, c) (note that the order is different from TF). In LSTMCell.call(), you can find:
h_tm1 = states[0]
c_tm1 = states[1]
To get the states returned from an LSTM layer, you can specify return_state=True. The returned value is a tuple (o, h, c). The tensor o is the output of this layer, which will be equal to h unless you specify return_sequences=True.
Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
###TensorFlow###
Just provide the initial state to an LSTMCell when calling it. For example, in the official RNN tutorial:
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
...
output, state = lstm(current_batch_of_words, state)
There's also an initial_state argument for functions such as tf.nn.static_rnn. If you use the seq2seq module, provide the states to rnn_decoder as have been shown in the code for question 1.
###Keras###
Use the keyword argument initial_state in the LSTM function call.
out = LSTM(32)(input_tensor, initial_state=(h, c))
You can actually find this usage on the official documentation:
###Note on specifying the initial state of RNNs###
You can specify the initial state of RNN layers symbolically by
calling them with the keyword argument initial_state. The value of
initial_state should be a tensor or list of tensors representing the
initial state of the RNN layer.
EDIT:
There's now an example script in Keras (lstm_seq2seq.py) showing how to implement basic seq2seq in Keras. How to make prediction after training a seq2seq model is also covered in this script.
(Edit: this answer is incomplete and hasn't considered actual possibilities of state transfering. See the accepted answer).
From a Keras point of view, that picture has only two layers.
The green group is one LSTM layer.
The blue group is another LSTM layer.
There isn't any communication between green and blue other than passing the outputs. So, the answer for 1 is:
Only the thought vector (which is the actual output of the layer) is passed to the other layer.
Memory and state (not sure if these are two different entities) are totally contained inside a single layer and are not initially intended to be seen or shared with any other layer.
Each individual block in that image is totally invisible in keras. They are considered "time steps", something that only appears in the shape of the input data. It's rarely important to worry about them (unless for very advanced usages).
In keras, it's like this:
Easily, you have access only to the external arrows (including "thought vector").
But having access to each step (each individual green block in your picture) is not an exposed thing. So...
Passing the states from one layer to the other is also not expected in Keras. You will probably have to hack things. (See this: https://github.com/fchollet/keras/issues/2995)
But considering a thought vector big enough, you could say it will learn a way to carry what is important in itself.
The only notion you have from the steps is:
You have to input things shaped like (sentences, length, wordIdFeatures)
The steps will be performed considering that each slice in the length dimension is an input to each green block.
You may choose to have a single output (sentences, cells), for which you completely lose track of steps. Or...
Outputs like (sentences, length, cells), from which you know the output of each block through the length dimension.
One to many or many to many?
Now, the first layer is many to one (but nothing prevents it from being many to many too if you want).
But the second... that's complicated.
If the thought vector was made by a many to one. You will have to manage a way of creating a one to many. (That's not trivial in keras, but you could think of repeating the thought vector for the expected length, making it be the input to all steps. Or maybe fill an entire sequence with zeros or ones, keeping only the first element as the thought vector)
If the thought vector was made by a many to many, you can take advantage of this and keep an easy many to many, if you're willing to accept that the output has exactly the same number of steps as the input.
Keras doesn't have a ready solution for 1 to many cases. (From a single input predict a whole sequence).