Tensorflow RNN weight matrices initialization - tensorflow

I'm using bidirectional_rnn with GRUCell but this is a general question regarding the RNN in Tensorflow.
I couldn't find how to initialize the weight matrices (input to hidden, hidden to hidden). Are they initialized randomly? to zeros? are they initialized differently for each LSTM I create?
EDIT: Another motivation for this question is in pre-training some LSTMs and using their weights in a subsequent model. I don't currently know how to do that currently without saving all the states and restoring the entire model.
Thanks.

How to initialize weight matrices for RNN?
I believe people are using random normal initialization for weight matrices for RNN. Check out the example in TensorFlow GitHub Repo. As the notebook is a bit long, they have a simple LSTM model where they use tf.truncated_normal to initialize weights and tf.zeros to initialize biases (although I have tried using tf.ones to initialize biases before, seem to also work). I believe that the standard deviation is a hyperparameter you could tune yourself. Sometimes weights initialization is important to the gradient flow. Although as far as I know, LSTM itself is designed to handle gradient vanishing problem (and gradient clipping is for helping gradient exploding problem), so perhaps you don't need to be super careful with the setup of std_dev in LSTM? I've read papers recommending Xavier initialization (TF API doc for Xavier initializer) in Convolution Neural Network context. I don't know if people use that in RNN, but I imagine you can even try those in RNN if you want to see if it helps.
Now to follow up with #Allen's answer and your follow up question left in the comments.
How to control initialization with variable scope?
Using the simple LSTM model in the TensorFlow GitHub python notebook that I linked to as an example.
Specifically, if I want to re-factorize the LSTM part of the code in above picture using variable scope control, I may code something as following...
import tensorflow as tf
def initialize_LSTMcell(vocabulary_size, num_nodes, initializer):
'''initialize LSTMcell weights and biases, set variables to reuse mode'''
gates = ['input_gate', 'forget_gate', 'memory_cell', 'output_gate']
with tf.variable_scope('LSTMcell') as scope:
for gate in gates:
with tf.variable_scope(gate) as gate_scope:
wx = tf.get_variable("wx", [vocabulary_size, num_nodes], initializer)
wt = tf.get_variable("wt", [num_nodes, num_nodes], initializer)
bi = tf.get_variable("bi", [1, num_nodes, tf.constant_initializer(0.0)])
gate_scope.reuse_variables() #this line can probably be omitted, b.z. by setting 'LSTMcell' scope variables to 'reuse' as the next line, it'll turn on the reuse mode for all its child scope variables
scope.reuse_variables()
def get_scope_variables(scope_name, variable_names):
'''a helper function to fetch variable based on scope_name and variable_name'''
vars = {}
with tf.variable_scope(scope_name, reuse=True):
for var_name in variable_names
var = tf.get_variable(var_name)
vars[var_name] = var
return vars
def LSTMcell(i, o, state):
'''a function for performing LSTMcell computation'''
gates = ['input_gate', 'forget_gate', 'memory_cell', 'output_gate']
var_names = ['wx', 'wt', 'bi']
gate_comp = {}
with tf.variable_scope('LSTMcell', reuse=True):
for gate in gates:
vars = get_scope_variables(gate, var_names)
gate_comp[gate] = tf.matmul(i, vars['wx']) + tf.matmul(o, vars['wt']) + vars['bi']
state = tf.sigmoid(gate_comp['forget_gate']) * state + tf.sigmoid(gate_comp['input_gate']) * tf.tanh(gate_comp['memory_cell'])
output = tf.sigmoid(gate_comp['output_gate']) * tf.tanh(state)
return output, state
The usage of the re-factorized code would be something like following...
initialize_LSTMcell(volcabulary_size, num_nodes, tf.truncated_normal_initializer(mean=-0.1, stddev=.01))
#...Doing some computation...
LSTMcell(input_tensor, output_tensor, state)
Even though the refactorized code may look less straightforward, but using scope variable control ensures scope encapsulation and allows flexible variable controls (in my opinion at least).
In pre-training some LSTMs and using their weights in a subsequent model. How to do that without saving all the states and restoring the entire model.
Assuming you have a pre-trained model froze and loaded in, if you wanna use their frozen 'wx', 'wt' and 'bi', you can simply find their parent scope names and variable names, then fetch the variables using similar structure in get_scope_variables func.
with tf.variable_scope(scope_name, reuse=True):
var = tf.get_variable(var_name)
Here is a link to understanding variable scope and sharing variables. I hope this is helpful.

The RNN models will create their variables with get_variable, and you can control the initialization by wrapping the code which creates those variables with a variable_scope and passing a default initializer to it. Unless the RNN specifies one explicitly (looking at the code, it doesn't), uniform_unit_scaling_initializer is used.
You should also be able to share model weights by declaring the second model and passing reuse=True to its variable_scope. As long as the namespaces match up, the new model will get the same variables as the first model.

A simple way to initialize all kernel weights with certain initializer is to leave the initializer in tf.variable_scope(). For example:
with tf.variable_scope('rnn', initializer=tf.variance_scaling_initializer()):
basic_cell= tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, state= tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

Related

Getting Gradients of Each Layer in Keras 2

It's been days that I've been struggling just to simply view layers' gradients in the debug mode of Keras2. Needless to say, I have already tried codes such as:
import Keras.backend as K
gradients = K.gradients(model.output, model.input)
sess = tf.compat.v1.keras.backend.get_session()
evaluated_gradients = sess.run(gradients, feed_dict={model.input:images})
or
evaluated_gradients = sess.run(gradients, feed_dict{model.input.experimantal_ref():images})
or
with tf.compat.v1.Session(graph=tf.compat.v1.keras.backend.get_default_graph())
or similar approaches using
tf.compat.v1
which all lead to the following error:
RuntimeError: The Session graph is empty. Add operations to the graph
before calling run().
I assume this should be the most basic tool any deep learning package could provide, it is strange why there seems no easy way to do so in Keras2. Any ideas?
You can try to do this on TF 2 with eager mode on.
Please notice that you need to use tf.keras for everything, including your model, layers, etc. For this to work you can never use keras alone, it must be tf.keras. This means, for instance, using tf.keras.layers.Dense, tf.keras.models.Sequential, etc..
input_images_tensor = tf.constant(input_images_numpy)
with tf.GradientTape() as g:
g.watch(input_images_tensor)
output_tensor = model(input_images_tensor)
gradients = g.gradient(output_tensor, input_images_tensor)
If you are going to calculate the gradients more than once with the same tape, you need the tape to be persistent=True and delete it manually after you get the gradients. (See details on the link below)
You can get the gradients regarding any "trainable" weight without needing watch. If you are going to get gradients with respect to non-trainable tensors (such as the input images), then you must call g.watch as above for each of these variables).
More details on GradientTape: https://www.tensorflow.org/api_docs/python/tf/GradientTape

Eager execution get trainable variables

In all the toturials (including tf official docs) that I see about tfe, The example uses the gradient tape, and manually adding all the gradients to the list of computed gradients e.g
variables = [w1, b1, w2, b2] <--- manually store all the variables
optimizer = tf.train.AdamOptimizer()
with tf.GradientTape() as tape:
y_pred = model.predict(x, variables)
loss = model.compute_loss(y_pred, y)
grads = tape.gradient(loss, variables) < ---- send them to tape.gradient
optimizer.apply_gradients(zip(grads, variables))
But is it the only way? even for huge models we need to accumulate all the parameters, or we somehow can access the defaults graph variables list
Trying to access tf.get_default_graph().get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
or trainable_variables inside a tfe session gave the empty list.
To the best of my understanding, Eager mode in TensorFlow stores information about model in objects, for example in tf.keras.Model or tf.estimator.Estimator. In the absence of graph you can get the list of variables only there, using tf.keras.Model.trainable_variables for example.
Eager mode, however, can work with graph object created explicitly. In this case, i think it will store list of variables. Without it, keras model object will be the only explicit storage for variables.

Using tf.train.Saver() on convolutional layers in tensorflow

I'm attempting to use tf.train.Saver() to apply transfer learning between two convolutional neural network graphs in tensorflow and I'd like to validate that my methods are working as expected. Is there a way to inspect the trainable features in a tf.layers.conv2d() layer?
My methods
1. initialize layer
conv1 = tf.layers.conv2d(inputs=X_reshaped, filters=conv1_fmaps, kernel_size=conv1_ksize,
strides=conv1_stride, padding=conv1_pad,
activation=tf.nn.relu,
kernel_initializer=tf.contrib.layers.xavier_initializer(),
bias_initializer=tf.zeros_initializer(), trainable=True,
name="conv1")
2. {Train the network}
3. Save current graph
tf.train.Saver().save(sess, "./my_model_final.ckpt")
4. Build new graph that includes the same layer, load specified weights with Saver()
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="conv[1]")
reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars])
restore_saver = tf.train.Saver(reuse_vars_dict)
...
restore_saver.restore(sess, "./my_model_final.ckpt")
5. {Train and evaluate the new graph}
My Question:
1) My code works 'as expected' and without error, but I'm not 100% confident it's working like I think it is. Is there a way to print the trainable features from a layer to ensure that I'm loading and saving weights correctly? Is there a "better" way to save/load parameters with the tf.layers API? I noticed a request on GitHub related to this. Ideally, I'd like to check these values on the first graph a) after initialization b) after training and on the new graph i) after loading the weights ii) after training/evaluation.
Is there a way to print the trainable features from a layer to ensure that I'm loading and saving weights correctly?
Yes, you first need to get a handle on the layer's variables. There are several ways to do that, but arguably the simplest is using the get_collection() function:
conv1_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="conv1")
Note that the scope here is treated as a regular expression, so you can write things like conv[123] if you want all variables from scopes conv1, conv2 and conv3.
If you just want trainable variables, you can replace GLOBAL_VARIABLES with TRAINABLE_VARIABLES.
If you just want to check a single variable, such as the layer's kernel, then you can use get_tensor_by_name() like this:
graph = tf.get_default_graph()
kernel_var = graph.get_tensor_by_name("conv1/kernel:0")
Yet another option is to just iterate on all variables and filter based on their names:
conv1_vars = [var for var in tf.global_variables()
if var.op.name.startswith("conv1/")]
Once you have a handle on these variables, you can just evaluate them at different points, e.g. just after initialization, just after restoring the graph, just after training, and so on, and compare the values. For example, this is how you would get the values just after initialization:
with tf.Session() as sess:
init.run()
conv1_var_values_after_init = sess.run(conv1_vars)
Then once you have captured the variable values at the various points that you are interested in, you can check whether or not they are equal (or close enough, taking into account tiny floating point imprecisions) like so:
same = np.allclose(conv1_var_values_after_training,
conv1_var_values_after_restore)
Is there a "better" way to save/load parameters with the tf.layers API?
Not that I'm aware of. The feature request you point to is not really about saving/loading the parameters to disk, but rather to be able to easily get a handle on a layer's variables, and to easily create an assignment node to set their values.
For example, it will be possible (in TF 1.4) to get a handle on a layer's kernel and get its value very simply, like this:
conv1_kernel_value = conv1.kernel.eval()
Of course, you can use this to get/set a variable's value and load/save it to disk, like this:
conv1 = tf.layers.conv2d(...)
new_kernel = tf.placeholder(...)
assign_kernel = conv1.kernel.assign(new_kernel)
init = tf.global_variables_initializer()
with tf.Session() as sess:
init.run()
loaded_kernel = my_function_to_load_kernel_value_from_disk(...)
assign_kernel.run(feed_dict={new_kernel: loaded_kernel})
...
It's not pretty. It might be useful if you want to load/save to a database (instead of a flat file), but in general I would recommend using a Saver.
I hope this helps.

Reusing part of a tensorflow trained graph

So, I trained a tensorflow model with a few layers, more or less like this:
with tf.variable_scope('model1') as scope:
inputs = tf.placeholder(tf.int32, [None, num_time_steps])
embeddings = tf.get_variable('embeddings', (vocab_size, embedding_size))
lstm = tf.nn.rnn_cell.LSTMCell(lstm_units)
embedded = tf.nn.embedding_lookup(embeddings, inputs)
_, state = tf.nn.dynamic_rnn(lstm, embedded, dtype=tf.float32, scope=scope)
# more stuff on the state
Now, I wanted to reuse the embedding matrix and the lstm weights in another model, which is very different from this one except for these two components.
As far as I know, if I load them with a tf.Saver object, it will look for
variables with the exact same names, but I'm using different variable_scopes in the two graphs.
In this answer, it is suggested to create the graph where the LSTM is trained as a superset of the other one, but I don't think it is possible in my case, given the differences in the two models. Anyway, I don't think it is a good idea to make one graph dependent on the other, if they do independent things.
I thought about changing the variable scope of the LSTM weights and embeddings in the serialized graph. I mean, where it originally read model1/Weights:0 or something, it would be another_scope/Weights:0. Is it possible and feasible?
Of course, if there is a better solution, it is also welcome.
I found out that the Saver can be initialized with a dictionary mapping variable names (without the trailing :0) in the serialized file to the variable objects I want to restore in the graph. For example:
varmap = {'model1/some_scope/weights': variable_in_model2,
'model1/another_scope/weights': another_variable_in_model2}
saver = tf.train.Saver(varmap)
saver.restore(sess, path_to_saved_file)

Tensorflow RNN input size

I am trying to use tensorflow to create a recurrent neural network. My code is something like this:
import tensorflow as tf
rnn_cell = tf.nn.rnn_cell.GRUCell(3)
inputs = [tf.constant([[0, 1]], dtype=tf.float32), tf.constant([[2, 3]], dtype=tf.float32)]
outputs, end = tf.nn.rnn(rnn_cell, inputs, dtype=tf.float32)
Now, everything runs just fine. However, I am rather confused by what is actually going on. The output dimensions are always the batch size x the size of the rnn cell's hidden state - how can they be completely independent of the input size?
If my understanding is correct, the inputs are concatenated to the rnn's hidden state at each step, and then multiplied by a weight matrix (among other operations). This means that the dimensions of the weight matrix need to depend on the input size, which is impossible, because the rnn_cell is created before the inputs are even declared!
After seeing the answer to a question about tensorflow's GRU implementation, I've realized what's going on. Counter to my intuition, the GRUCell constructor doesn't create any weight or bias variables at all. Instead, it creates its own variable scope, and then instantiates the variables on demand when actually called. Tensorflow's variable scoping mechanism ensures that the variables are only created once, and shared across subsequent calls to the GRU.
I'm not sure why they decided to go with this rather confusing implementation, which is as far as I can tell is undocumented. To me it seems more appropriate to use python's object-level variable scoping to encapsulate the tensorflow variables within the GRUCell itself, rather than relying on an additional implicit scoping mechanism.