There is no "name" variable in the constructor of BasicLSTMCell - tensorflow

In order to differentiate LSTMs, I wish to give a name to the BasicLSTMCell variable in my code. But it reported the following error:
num_units=self.config.num_lstm_units, state_is_tuple=True, name="some_basic_lstm")
TypeError: __init__() got an unexpected keyword argument 'name'
And I found in the library of my tensorflow installation. Int the file rnn_cell_impl.py:
class BasicLSTMCell(RNNCell):
"""Basic LSTM recurrent network cell.
The implementation is based on: http://arxiv.org/abs/1409.2329.
We add forget_bias (default: 1) to the biases of the forget gate in order to
reduce the scale of forgetting in the beginning of the training.
It does not allow cell clipping, a projection layer, and does not
use peep-hole connections: it is the basic baseline.
For advanced models, please use the full #{tf.nn.rnn_cell.LSTMCell}
that follows.
"""
def __init__(self, num_units, forget_bias=1.0,
state_is_tuple=True, activation=None, reuse=None):
"""Initialize the basic LSTM cell.
Args:
num_units: int, The number of units in the LSTM cell.
forget_bias: float, The bias added to forget gates (see above).
Must set to `0.0` manually when restoring from CudnnLSTM-trained
checkpoints.
state_is_tuple: If True, accepted and returned states are 2-tuples of
the `c_state` and `m_state`. If False, they are concatenated
along the column axis. The latter behavior will soon be deprecated.
activation: Activation function of the inner states. Default: `tanh`.
reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not `True`, and the existing scope already has
the given variables, an error is raised.
Is it a bug in my version of tensorflow? How can I give it a "name"?

I think #aswinids provided the best answer here in comments, but let me explain why it is should not be considered a bug. An LSTM cell is comprised of at least 4 variables (there are a few others used for control flow and such). There are 4 sub-network operations that occur in an LSTM. The diagram below from Colah's blog illustrates the internals of an LSTM cell (http://colah.github.io/posts/2015-08-Understanding-LSTMs/):
Each of the yellow boxes has a set of weights assigned to it and is effectively a single layer neural network operation (piped together in an interesting way, defined by the LSTM architecture).
A good approach to naming these would then be tf.variable_scope('some_name') such that all 4 of the variables defined in the LSTM have a common base naming structure such as:
lstm_cell/f_t
lstm_cell/i_t
lstm_cell/C_t
lstm_cell/o_t
I suspect that previously they just did this and hard coded lstm_cell or whatever name they used as the prefix for all the variables under the LSMT cell. In the later versions as #ashwinids points out, there is a name variable and I suspect that just replaced lstm_cell I used in the example here.

Related

What is _uses_learning_phase in Keras?

I'm trying to write my own recurrent layer in Keras and noticed this line in the Keras source:
# Properly set learning phase on output tensor.
if 0 < self.dropout + self.recurrent_dropout:
if training is None:
output._uses_learning_phase = True
Checking the backend code for in_train_phase:
if training is None:
training = learning_phase()
uses_learning_phase = True
else:
uses_learning_phase = False
This is rather confusing. Isn't "training" the "learning phase"?! I guess more importantly, do I need to set _uses_learning_phase on output in my custom recurrent layer?
Intro
A "Training Flag" is meant to enable a Model (or Layer) to behave different from training when it predicts results or is being tested.
Depending on the backend used, Keras may need to implement its own boolean "training flag" (on CNTK as for Keras 2.2.4) or can use a native backend tensor (like with Tensorflow) Therefor dynamic-purpose code was integrated.
As consequence Layer class has a property described as followed:
uses_learning_phase: Whether any operation
of the layer uses `K.in_training_phase()`
or `K.in_test_phase()`.
and output tensors may be given an attribute _uses_learning_phase which is read by the property. If any output tensor has the attribute (and it is true), the layer's property returns true.
Usage in Keras's Recurrent layer
Your code snippet comes from keras/layers/recurrent.py and when calling the private _generate_dropout_mask method, the backend's operation creator "in_train_phase()" is being called. Therefore the output tensor's flag "_uses_learning_phase" is being set.
Explanation of quoted backend code
in_training_phase() and in_test_phase() are just the same. "training" is an optional argument and references the Training Flag. If the argument is not given, the Training Flag is refered automatically at
training = learning_phase()
However, the output tensor's attribute _uses_learning_phase is only set (and set True), if Training Flag is a tensor of the backend AND optional training argument was not set. (This may also explain, why a layer needs to set _uses_learning_phase itself, but I see no usecase for creating an operation via in_test_phase without flagging the output tensor. For now, assume there is one.)

What is the difference between the trainable_weights and trainable_variables in the tensorflow basic lstm_cell?

While trying to copy the weights of a LSTM Cell in Tensorflow using the Basic LSTM Cell as documented here, i stumbled upon both the trainable_weights and trainable_variables property.
Source code has not really been informative for a noob like me sadly. A little bit of experimenting did yield the following information though:
Both have the exact same layout, being a list of length two, where the first entry is a tf.Variable of shape: (2*num_units, 4*num_units), the second entry of the list is of shape (4*num_units,), where num_units is the num_units from initializing the BasicLSTMCell.
The intuitive guess for me is now, that the first list item is a concatenation of the weights of the four internal layers of the lstm, the second item being a concatenation of the respective biases, fitting the expected sizes of these obviously.
Now the question is, whether there is actually any difference between these? I assume they might just be a result of inheriting these from the rnn_cell class?
From the source code of the Layer class that RNNCell inherits from:
#property
def trainable_variables(self):
return self.trainable_weights
See here. The RNN classes don't seem to overwrite this definition -- I would assume it's there for special layer types that have trainable variables that don't quite qualify as "weights". Batch normalization would come to mind, but unfortunately I can't find any mention of trainable_variables in that one's source code (except for GraphKeys.TRAINABLE_VARIABLES which is different).

Understanding Seq2Seq model

Here is my understanding of a basic Sequence to Sequence LSTMs. Suppose we are tackling a question-answer setting.
You have two set of LSTMs (green and blue below). Each set respectively sharing weights (i.e. each of the 4 green cells have the same weights and similarly with the blue cells). The first is a many to one LSTM, which summarises the question at the last hidden layer/ cell memory.
The second set (blue) is a Many to Many LSTM which has different weights to the first set of LSTMs. The input is simply the answer sentence while the output is the same sentence shifted by one.
The question is two fold:
1. Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
2. Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
(image taken from suriyadeepan.github.io)
Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
Both hidden state h and cell memory c are passed to the decoder.
TensorFlow
In seq2seq source code, you can find the following code in basic_rnn_seq2seq():
_, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
return rnn_decoder(decoder_inputs, enc_state, cell)
If you use an LSTMCell, the returned enc_state from the encoder will be a tuple (c, h). As you can see, the tuple is passed directly to the decoder.
Keras
In Keras, the "state" defined for an LSTMCell is also a tuple (h, c) (note that the order is different from TF). In LSTMCell.call(), you can find:
h_tm1 = states[0]
c_tm1 = states[1]
To get the states returned from an LSTM layer, you can specify return_state=True. The returned value is a tuple (o, h, c). The tensor o is the output of this layer, which will be equal to h unless you specify return_sequences=True.
Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
###TensorFlow###
Just provide the initial state to an LSTMCell when calling it. For example, in the official RNN tutorial:
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
...
output, state = lstm(current_batch_of_words, state)
There's also an initial_state argument for functions such as tf.nn.static_rnn. If you use the seq2seq module, provide the states to rnn_decoder as have been shown in the code for question 1.
###Keras###
Use the keyword argument initial_state in the LSTM function call.
out = LSTM(32)(input_tensor, initial_state=(h, c))
You can actually find this usage on the official documentation:
###Note on specifying the initial state of RNNs###
You can specify the initial state of RNN layers symbolically by
calling them with the keyword argument initial_state. The value of
initial_state should be a tensor or list of tensors representing the
initial state of the RNN layer.
EDIT:
There's now an example script in Keras (lstm_seq2seq.py) showing how to implement basic seq2seq in Keras. How to make prediction after training a seq2seq model is also covered in this script.
(Edit: this answer is incomplete and hasn't considered actual possibilities of state transfering. See the accepted answer).
From a Keras point of view, that picture has only two layers.
The green group is one LSTM layer.
The blue group is another LSTM layer.
There isn't any communication between green and blue other than passing the outputs. So, the answer for 1 is:
Only the thought vector (which is the actual output of the layer) is passed to the other layer.
Memory and state (not sure if these are two different entities) are totally contained inside a single layer and are not initially intended to be seen or shared with any other layer.
Each individual block in that image is totally invisible in keras. They are considered "time steps", something that only appears in the shape of the input data. It's rarely important to worry about them (unless for very advanced usages).
In keras, it's like this:
Easily, you have access only to the external arrows (including "thought vector").
But having access to each step (each individual green block in your picture) is not an exposed thing. So...
Passing the states from one layer to the other is also not expected in Keras. You will probably have to hack things. (See this: https://github.com/fchollet/keras/issues/2995)
But considering a thought vector big enough, you could say it will learn a way to carry what is important in itself.
The only notion you have from the steps is:
You have to input things shaped like (sentences, length, wordIdFeatures)
The steps will be performed considering that each slice in the length dimension is an input to each green block.
You may choose to have a single output (sentences, cells), for which you completely lose track of steps. Or...
Outputs like (sentences, length, cells), from which you know the output of each block through the length dimension.
One to many or many to many?
Now, the first layer is many to one (but nothing prevents it from being many to many too if you want).
But the second... that's complicated.
If the thought vector was made by a many to one. You will have to manage a way of creating a one to many. (That's not trivial in keras, but you could think of repeating the thought vector for the expected length, making it be the input to all steps. Or maybe fill an entire sequence with zeros or ones, keeping only the first element as the thought vector)
If the thought vector was made by a many to many, you can take advantage of this and keep an easy many to many, if you're willing to accept that the output has exactly the same number of steps as the input.
Keras doesn't have a ready solution for 1 to many cases. (From a single input predict a whole sequence).

Tensorflow RNN weight matrices initialization

I'm using bidirectional_rnn with GRUCell but this is a general question regarding the RNN in Tensorflow.
I couldn't find how to initialize the weight matrices (input to hidden, hidden to hidden). Are they initialized randomly? to zeros? are they initialized differently for each LSTM I create?
EDIT: Another motivation for this question is in pre-training some LSTMs and using their weights in a subsequent model. I don't currently know how to do that currently without saving all the states and restoring the entire model.
Thanks.
How to initialize weight matrices for RNN?
I believe people are using random normal initialization for weight matrices for RNN. Check out the example in TensorFlow GitHub Repo. As the notebook is a bit long, they have a simple LSTM model where they use tf.truncated_normal to initialize weights and tf.zeros to initialize biases (although I have tried using tf.ones to initialize biases before, seem to also work). I believe that the standard deviation is a hyperparameter you could tune yourself. Sometimes weights initialization is important to the gradient flow. Although as far as I know, LSTM itself is designed to handle gradient vanishing problem (and gradient clipping is for helping gradient exploding problem), so perhaps you don't need to be super careful with the setup of std_dev in LSTM? I've read papers recommending Xavier initialization (TF API doc for Xavier initializer) in Convolution Neural Network context. I don't know if people use that in RNN, but I imagine you can even try those in RNN if you want to see if it helps.
Now to follow up with #Allen's answer and your follow up question left in the comments.
How to control initialization with variable scope?
Using the simple LSTM model in the TensorFlow GitHub python notebook that I linked to as an example.
Specifically, if I want to re-factorize the LSTM part of the code in above picture using variable scope control, I may code something as following...
import tensorflow as tf
def initialize_LSTMcell(vocabulary_size, num_nodes, initializer):
'''initialize LSTMcell weights and biases, set variables to reuse mode'''
gates = ['input_gate', 'forget_gate', 'memory_cell', 'output_gate']
with tf.variable_scope('LSTMcell') as scope:
for gate in gates:
with tf.variable_scope(gate) as gate_scope:
wx = tf.get_variable("wx", [vocabulary_size, num_nodes], initializer)
wt = tf.get_variable("wt", [num_nodes, num_nodes], initializer)
bi = tf.get_variable("bi", [1, num_nodes, tf.constant_initializer(0.0)])
gate_scope.reuse_variables() #this line can probably be omitted, b.z. by setting 'LSTMcell' scope variables to 'reuse' as the next line, it'll turn on the reuse mode for all its child scope variables
scope.reuse_variables()
def get_scope_variables(scope_name, variable_names):
'''a helper function to fetch variable based on scope_name and variable_name'''
vars = {}
with tf.variable_scope(scope_name, reuse=True):
for var_name in variable_names
var = tf.get_variable(var_name)
vars[var_name] = var
return vars
def LSTMcell(i, o, state):
'''a function for performing LSTMcell computation'''
gates = ['input_gate', 'forget_gate', 'memory_cell', 'output_gate']
var_names = ['wx', 'wt', 'bi']
gate_comp = {}
with tf.variable_scope('LSTMcell', reuse=True):
for gate in gates:
vars = get_scope_variables(gate, var_names)
gate_comp[gate] = tf.matmul(i, vars['wx']) + tf.matmul(o, vars['wt']) + vars['bi']
state = tf.sigmoid(gate_comp['forget_gate']) * state + tf.sigmoid(gate_comp['input_gate']) * tf.tanh(gate_comp['memory_cell'])
output = tf.sigmoid(gate_comp['output_gate']) * tf.tanh(state)
return output, state
The usage of the re-factorized code would be something like following...
initialize_LSTMcell(volcabulary_size, num_nodes, tf.truncated_normal_initializer(mean=-0.1, stddev=.01))
#...Doing some computation...
LSTMcell(input_tensor, output_tensor, state)
Even though the refactorized code may look less straightforward, but using scope variable control ensures scope encapsulation and allows flexible variable controls (in my opinion at least).
In pre-training some LSTMs and using their weights in a subsequent model. How to do that without saving all the states and restoring the entire model.
Assuming you have a pre-trained model froze and loaded in, if you wanna use their frozen 'wx', 'wt' and 'bi', you can simply find their parent scope names and variable names, then fetch the variables using similar structure in get_scope_variables func.
with tf.variable_scope(scope_name, reuse=True):
var = tf.get_variable(var_name)
Here is a link to understanding variable scope and sharing variables. I hope this is helpful.
The RNN models will create their variables with get_variable, and you can control the initialization by wrapping the code which creates those variables with a variable_scope and passing a default initializer to it. Unless the RNN specifies one explicitly (looking at the code, it doesn't), uniform_unit_scaling_initializer is used.
You should also be able to share model weights by declaring the second model and passing reuse=True to its variable_scope. As long as the namespaces match up, the new model will get the same variables as the first model.
A simple way to initialize all kernel weights with certain initializer is to leave the initializer in tf.variable_scope(). For example:
with tf.variable_scope('rnn', initializer=tf.variance_scaling_initializer()):
basic_cell= tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, state= tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

Tensorflow RNN input size

I am trying to use tensorflow to create a recurrent neural network. My code is something like this:
import tensorflow as tf
rnn_cell = tf.nn.rnn_cell.GRUCell(3)
inputs = [tf.constant([[0, 1]], dtype=tf.float32), tf.constant([[2, 3]], dtype=tf.float32)]
outputs, end = tf.nn.rnn(rnn_cell, inputs, dtype=tf.float32)
Now, everything runs just fine. However, I am rather confused by what is actually going on. The output dimensions are always the batch size x the size of the rnn cell's hidden state - how can they be completely independent of the input size?
If my understanding is correct, the inputs are concatenated to the rnn's hidden state at each step, and then multiplied by a weight matrix (among other operations). This means that the dimensions of the weight matrix need to depend on the input size, which is impossible, because the rnn_cell is created before the inputs are even declared!
After seeing the answer to a question about tensorflow's GRU implementation, I've realized what's going on. Counter to my intuition, the GRUCell constructor doesn't create any weight or bias variables at all. Instead, it creates its own variable scope, and then instantiates the variables on demand when actually called. Tensorflow's variable scoping mechanism ensures that the variables are only created once, and shared across subsequent calls to the GRU.
I'm not sure why they decided to go with this rather confusing implementation, which is as far as I can tell is undocumented. To me it seems more appropriate to use python's object-level variable scoping to encapsulate the tensorflow variables within the GRUCell itself, rather than relying on an additional implicit scoping mechanism.