How to get summary information on tensorflow RNN - tensorflow

I implemented a simple RNN using tensorflow, shown below:
cell = tf.contrib.rnn.BasicRNNCell(state_size)
cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, batch_size, dypte=tf.float32)
This works fine. But I'd like to log the weight variables to summary writer. Is there any way to do this?
By the way, do we use tf.nn.rnn_cell.BasicRNNCell or tf.contrib.rnn.BasicRNNCell? Or are they identical?

But I'd like to log the weight variables to summary writer. Is there any way to do this?
You can get a variable via tf.get_variable() function. tf.summary.histogram accepts the tensor instance, so it'd be easier to use Graph.get_tensor_by_name():
n_steps = 2
n_inputs = 3
n_neurons = 5
X = tf.placeholder(dtype=tf.float32, shape=[None, n_steps, n_inputs])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
with tf.variable_scope('rnn', reuse=True):
print(tf.get_variable('basic_rnn_cell/kernel'))
kernel = tf.get_default_graph().get_tensor_by_name('rnn/basic_rnn_cell/kernel:0')
tf.summary.histogram('kernel', kernel)
By the way, do we use tf.nn.rnn_cell.BasicRNNCell or tf.contrib.rnn.BasicRNNCell? Or are they identical?
Yes, they are synonyms, but I prefer to use tf.nn.rnn_cell package, because everything in tf.contrib is sort of experimental and can be changed in 1.x versions.

Maxim's answer is great.
I found another approach useful for me where you don't have to provide names of weight variables. This approach uses an optimizer object and compute_gradients method.
Say, you calculate the "loss" after having (outputs, states) from dynamic_rnn call. Now get an optimizer of your choice. Say Adam,
optzr = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optzr.compute_gradients(loss)
"grads_and_vars" is A list of (gradient, variable) pairs. Now by iterating "grads_and_vars" you can have all the weights/biases and corresponding gradients if any. Like,
for grad, vars in grads_and_vars:
print (vars, vars.name)
tf.summary.histogram(vars.name, vars)
Ref:
https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer#compute_gradients

Related

I cant understand LSTM implementation in tensorflow 1

I have been looking at an implementation of LSTM layers in a neural network architecture. An LSTM layer has been defined in it as given below. I am having trouble understanding this code. I have listed my doubts after the code snippet.
code source:https://gist.github.com/awjuliani/66e8f477fc1ad000b1314809d8523455#file-a3c-py
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(RNN_SIZE,state_is_tuple=True)
c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
state_init = [c_init, h_init]
c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
state_in = (c_in, h_in)
rnn_in = tf.expand_dims(self.h3, [0])
step_size = tf.shape(inputs)[:1]
state_in = tf.nn.rnn_cell.LSTMStateTuple(c_in, h_in)
lstm_outputs, lstm_state = tf.nn.dynamic_rnn(
lstm_cell, rnn_in, initial_state=state_in, sequence_length=step_size,
time_major=False)
lstm_c, lstm_h = lstm_state
state_out = (lstm_c[:1, :], lstm_h[:1, :])
self.rnn_out = tf.reshape(lstm_outputs, [-1, RNN_SIZE])
Here are my doubts:
I understand we need to initialize a random Context and hidden
vectors to pass to our first LSTM cell. But why do initialize both c_init, h_init and then c_in, h_in. What purpose do they serve?
How are they different from each other? (same for state_in and state_init?)
Why do we use LSTMStateTuple?
def work(self, max_episode_length, gamma, sess, coord, saver, dep):
........
rnn_state = self.local_AC.state_init
def train(self, rollout, sess, gamma, bootstrap_value):
......
rnn_state = self.local_AC.state_init
feed_dict = {self.local_AC.target_v: discounted_rewards,
self.local_AC.inputs: np.vstack(observations),
self.local_AC.actions: actions,
self.local_AC.advantages: advantages,
self.local_AC.state_in[0]: rnn_state[0],
self.local_AC.state_in[1]: rnn_state[1]}
At the beginning of work, and then
before training a new batch, the network state is filled with zeros
I understand we need to initialize a random Context and hidden vectors to pass to our first LSTM cell. But why do initialize both c_init, h_init, and then c_in, h_in. What purpose do they serve? How are they different from each other? (same for state_in and state_init?)
To start using LSTM, one should initialise its cell and state state - named c and h respectively. For every input, these states are considered 'empty' and should be initialised with zeros. So that, we have here
c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
state_in = (c_in, h_in)
state_in = tf.nn.rnn_cell.LSTMStateTuple(c_in, h_in)
Why are there are two variables, state_in and state_init? The first is just placeholders that will be initialised with the second at the evaluation state (i.e., session.run). Because state_in doesn't contain any actual values, in other words, numpy arrays are used during the training phase and tf.placeholders during the phase when one defines an architecture of the network.
TL;DR
Why so? Well, tf1.x (was?) is quite a low-level system. It has the following entities:
tf.Session aka computational session - thing that contain a computational graph(s) and allows user to provide inputs to the graph(s) via session.run.
tf.Graph, that is a representation of a computational graph. Usually engineer defines graph using tf.placeholders and tf.Variabless. One could connect them 'just like' math operations:
with tf.Session() as sess:
a = tf.placeholder(tf.float32, (1,))
b = tf.Variable(1.0, dtype=tf.float32)
tf.global_variables_initializer()
c = a * b
# ...and so on
tf. placeholder's are placeholers, but not actual values, intended to be filled with actual values at the session.run stage. And tf.Variables, well, for the actual weights of the neural network to be optimized. Why not plain NumPy arrays, but something else? It's because TensorFlow automatically adds each tensor and placeholder as an edge to the default computational graph (it's impossible to do the same with NumPy arrays); also, it allows to define an architecture and then initialize/train it with different inputs, which is good.
So, to do a computation (forward/backward propagation, etc.), one has to set placeholders and variables to some values. To do so, in a simple example, we could do the following:
import tensorflow as tf
with tf.compat.v1.Session() as sess:
a = tf.compat.v1.placeholder(tf.float32, shape=())
b = tf.compat.v1.Variable(1.0, dtype=tf.float32)
init = tf.compat.v1.global_variables_initializer()
c = a + b
sess.run(init)
a_value = 2.0
result = sess.run([c], feed_dict={a: a_value})
print("value of [c]:", result)
(I use tf.compat.v1 instead of just tf here because I work in tf2 environment; you could omit it)
Note two things: first, I create init operation. Because in tf1.x it is not enough to initialize a variable like tf.Variable(1.0), but the user has to kinda 'notify' the framework about creating and running init operation.
Then I do a computation: I initialize an a_value variable and map it to the placeholder a' in the sess.runmethod.Session.run` requires a list of tensors to be calculated as a first argument and a mapping from placeholders necessary to compute target tensors to their actual values.
Back to your example: state_in is a placeholder and state_init contains values to be fed into this placeholder somewhere in the code.
It would look like this: less.run(..., feed_dict={state_in: state_init, ...}).
Why do we use LSTMStateTuple?
Addressing the second part of the question: it looks like TensorFlow developers implemented it for some performance optimization. From the source code:
logging.warning(
"%s: Using a concatenated state is slower and will soon be"
"deprecated. Use state_is_tuple=True.", self)
and if state_is_tuple=True, state should be a StateTuple. But I'm not 100% sure about it - I don't remember how I used it. After all, StateTuple is just a collections.namedtuple with two named attributes, c and h.

How to get weights in tf.layers.dense?

I wanna draw the weights of tf.layers.dense in tensorboard histogram, but it not show in the parameter, how could I do that?
The weights are added as a variable named kernel, so you could use
x = tf.dense(...)
weights = tf.get_default_graph().get_tensor_by_name(
os.path.split(x.name)[0] + '/kernel:0')
You can obviously replace tf.get_default_graph() by any other graph you are working in.
I came across this problem and just solved it. tf.layers.dense 's name is not necessary to be the same with the kernel's name's prefix. My tensor is "dense_2/xxx" but it's kernel is "dense_1/kernel:0". To ensure that tf.get_variable works, you'd better set the name=xxx in the tf.layers.dense function to make two names owning same prefix. It works as the demo below:
l=tf.layers.dense(input_tf_xxx,300,name='ip1')
with tf.variable_scope('ip1', reuse=True):
w = tf.get_variable('kernel')
By the way, my tf version is 1.3.
The latest tensorflow layers api creates all the variables using the tf.get_variable call. This ensures that if you wish to use the variable again, you can just use the tf.get_variable function and provide the name of the variable that you wish to obtain.
In the case of a tf.layers.dense, the variable is created as: layer_name/kernel. So, you can obtain the variable by saying:
with tf.variable_scope("layer_name", reuse=True):
weights = tf.get_variable("kernel") # do not specify
# the shape here or it will confuse tensorflow into creating a new one.
[Edit]: The new version of Tensorflow now has both Functional and Object-Oriented interfaces to the layers api. If you need the layers only for computational purposes, then using the functional api is a good choice. The function names start with small letters for instance -> tf.layers.dense(...). The Layer Objects can be created using capital first letters e.g. -> tf.layers.Dense(...). Once you have a handle to this layer object, you can use all of its functionality. For obtaining the weights, just use obj.trainable_weights this returns a list of all the trainable variables found in that layer's scope.
I am going crazy with tensorflow.
I run this:
sess.run(x.kernel)
after training, and I get the weights.
Comes from the properties described here.
I am saying that I am going crazy because it seems that there are a million slightly different ways to do something in tf, and that fragments the tutorials around.
Is there anything wrong with
model.get_weights()
After I create a model, compile it and run fit, this function returns a numpy array of the weights for me.
In TF 2 if you're inside a #tf.function (graph mode):
weights = optimizer.weights
If you're in eager mode (default in TF2 except in #tf.function decorated functions):
weights = optimizer.get_weights()
in TF2 weights will output a list in length 2
weights_out[0] = kernel weight
weights_out[1] = bias weight
the second layer weight (layer[0] is the input layer with no weights) in a model in size: 50 with input size: 784
inputs = keras.Input(shape=(784,), name="digits")
x = layers.Dense(50, activation="relu", name="dense_1")(inputs)
x = layers.Dense(50, activation="relu", name="dense_2")(x)
outputs = layers.Dense(10, activation="softmax", name="predictions")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(...)
model.fit(...)
kernel_weight = model.layers[1].weights[0]
bias_weight = model.layers[1].weights[1]
all_weight = model.layers[1].weights
print(len(all_weight)) # 2
print(kernel_weight.shape) # (784,50)
print(bias_weight.shape) # (50,)
Try to make a loop for getting the weight of each layer in your sequential network by printing the name of the layer first which you can get from:
model.summary()
Then u can get the weight of each layer running this code:
for layer in model.layers:
print(layer.name)
print(layer.get_weights())

Running the same RNN over two tensors in tensorflow

I'd like to run the same RNN over two tensors in tensorflow. My current solution looks like this:
cell = tf.nn.rnn_cell.GRUCell(cell_size)
with tf.variable_scope("encoder", reuse=None):
out1 = tf.nn.dynamic_rnn(cell, tensor1, dtype=tf.float32)
with tf.variable_scope("encoder", reuse=True):
out2 = tf.nn.dynamic_rnn(cell, tensor2, dtype=tf.float32)
Is this is the best way to ensure that the weights between the two RNN ops are shared?
Yeah that is basically how I would do it. For a really simple model like this it does not matter much but for a more complicated model I would define a function to build the graph.
def makeEncoder(input_tensor):
cell = tf.nn.rnn_cell.GRUCell(cell_size)
return tf.nn.dynamic_rnn(cell, tensor1, dtype=tf.float32)
with tf.variable_scope('encoder') as scope:
out1 = makeEncoder(tensor1)
scope.reuse_variables()
out2 = makeEncoder(tensor2)
The other way to do it would be to use tf.cond(...) as a switch to change between the inputs based on a boolean placeholder. They would then go to just one output. I have found that this can get a bit messy. Also you would need to provide both inputs even if you really only need one. I think my first solution is the best.

Tensorflow: optimize over input with gradient descent

I have a TensorFlow model (a convolutional neural network) which I successfully trained using gradient descent (GD) on some input data.
Now, in a second step, I would like to provide an input image as initialization then and optimize over this input image with fixed network parameters using GD. The loss function will be a different one, but this a detail.
So, my main question is how to tell the gradient descent algorithm to
stop optimizing the network parameters
to optimize over the input image
The first can probably done with this
Holding variables constant during optimizer
Do you guys have ideas about the second point?
I guess I can recode the gradient descent algorithm myself using the TF gradient function, but my gut feeling tells me that there should be an easier way, which also allows me to benefit from more complex GD variants (Adam etc.).
No need for your SDG own implementation. TensorFlow provides all functions:
import tensorflow as tf
import numpy as np
# some input
data_pldhr = tf.placeholder(tf.float32)
img_op = tf.get_variable('input_image', [1, 4, 4, 1], dtype=tf.float32, trainable=True)
img_assign = img_op.assign(data_pldhr)
# your starting image
start_value = (np.ones((4, 4), dtype=np.float32) + np.eye(4))[None, :, :, None]
# override variable_getter
def nontrainable_getter(getter, *args, **kwargs):
kwargs['trainable'] = False
return getter(*args, **kwargs)
# all variables in this scope are not trainable
with tf.variable_scope('myscope', custom_getter=nontrainable_getter):
x = tf.layers.dense(img_op, 10)
y = tf.layers.dense(x, 10)
# the usual stuff
cost_op = tf.losses.mean_squared_error(x, y)
train_op = tf.train.AdamOptimizer(0.1).minimize(cost_op)
# fire up the training process
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(img_assign, {data_pldhr: start_value})
print(sess.run(img_op))
for i in range(10):
_, c = sess.run([train_op, cost_op])
print(c)
print(sess.run(img_op))
represent an image as tf.Variable with trainable=True
initialise this variable with the starting image (initial guess)
recreate the NN graph using TF variables with trainable=False and copy the weights from the trained NN graph using tf.assign
calculate the loss function
plug the loss into any TF optimiser algorithm you want
Another alternative is to use ScipyOptimizerInterface, which allows to use scipy's minimizer. This supports constrained minimization.
I'm looking for a solution to the same problem, but my model is not an easy one as I have an LSTM network with cells created with MultiRNNCell, I don't think it is possible to get the weight and clone the network. Is there any workaround so that I can compute the gradient wrt the input?

Weights of Seq2Seq Models

I went through the code and I'm afraid I don't grasp an important point.
I can't seem to find the weights matrix of the model for the encoder and decoder, neither where they are updated. I found the target_weights but it seems to be reinitialized at every get_batch() call so I don't really understand what they stand for either.
My actual goal is to concatenate two hidden states of two source encoders for one decoder by applying a linear transformation with a weight matrix that I'll have to train along with the model (I'm building a manytoone model), but I have no idea where to start because of my problem mentionned above.
This might help you start. There are a couple of models implemented in tensorflow.python.ops.seq2seq.py (with/without buckets, attention, etc.) but take a look at the definition for embedding_attention_seq2seq (which is the one called in their example model seq2seq_model.py that you seem to be referencing):
def embedding_attention_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
num_heads=1, output_projection=None,
feed_previous=False, dtype=dtypes.float32,
scope=None, initial_state_attention=False):
with variable_scope.variable_scope(scope or "embedding_attention_seq2seq"):
# Encoder.
encoder_cell = rnn_cell.EmbeddingWrapper(cell, num_encoder_symbols)
encoder_outputs, encoder_state = rnn.rnn(
encoder_cell, encoder_inputs, dtype=dtype)
# First calculate a concatenation of encoder outputs to put attention on.
top_states = [array_ops.reshape(e, [-1, 1, cell.output_size])
for e in encoder_outputs]
attention_states = array_ops.concat(1, top_states)
....
You can see where it picks out the top layer of encoder outputs as top_states before handing them off to the decoder.
So you could implement a similar function with two encoders and concatenate those states before handing off to the decoder.
The value created in the get_batch function is only used for the first iteration. Even though the weights are passed every time into the function, their value gets updated as a global variable in the Seq2Seq model class in the init function.
with tf.name_scope('Optimizer'):
# Gradients and SGD update operation for training the model.
params = tf.trainable_variables()
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.GradientDescentOptimizer(self.learning_rate)
for b in range(len(buckets)):
gradients = tf.gradients(self.losses[b], params)
clipped_gradients, norm = tf.clip_by_global_norm(gradients,
max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(
zip(clipped_gradients, params), global_step=self.global_step))
self.saver = tf.train.Saver(tf.global_variables())
The weights are fed seperately as a place-holder because they are normalized in the get_batch function to create zero weights for the PAD inputs.
# Batch decoder inputs are re-indexed decoder_inputs, we create weights.
for length_idx in range(decoder_size):
batch_decoder_inputs.append(
np.array([decoder_inputs[batch_idx][length_idx]
for batch_idx in range(self.batch_size)], dtype=np.int32))
# Create target_weights to be 0 for targets that are padding.
batch_weight = np.ones(self.batch_size, dtype=np.float32)
for batch_idx in range(self.batch_size):
# We set weight to 0 if the corresponding target is a PAD symbol.
# The corresponding target is decoder_input shifted by 1 forward.
if length_idx < decoder_size - 1:
target = decoder_inputs[batch_idx][length_idx + 1]
if length_idx == decoder_size - 1 or target == data_utils.PAD_ID:
batch_weight[batch_idx] = 0.0
batch_weights.append(batch_weight)