Tensorflow RNN-LSTM - reset hidden state

Tensorflow RNN-LSTM - reset hidden state - tensorflow

I'm building a statefull LSTM used for language recognition.
Being statefull I can train the network with smaller files and a new batch will be like a next sentence in a discussion.
However for the network to be properly trained I need to reset the hidden state of the LSTM between some batches.
I'm using a variable to store the hidden_state of the LSTM for performance :
with tf.variable_scope('Hidden_state'):
hidden_state = tf.get_variable("hidden_state", [self.num_layers, 2, self.batch_size, self.hidden_size],
tf.float32, initializer=tf.constant_initializer(0.0), trainable=False)
# Arrange it to a tuple of LSTMStateTuple as needed
l = tf.unstack(hidden_state, axis=0)
rnn_tuple_state = tuple([tf.contrib.rnn.LSTMStateTuple(l[idx][0], l[idx][1])
for idx in range(self.num_layers)])
# Build the RNN
with tf.name_scope('LSTM'):
rnn_output, _ = tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length=input_seq_lengths,
initial_state=rnn_tuple_state, time_major=True)
Now I'm confused on how to reset the hidden state. I've tried two solutions but it's not working :
First solution
Reset the "hidden_state" variable with :
rnn_state_zero_op = hidden_state.assign(tf.zeros_like(hidden_state))
It does work and I think it's because the unstack and tuple construction are not "re-played" into the graph after running the rnn_state_zero_op operation.
Second solution
Following LSTMStateTuple vs cell.zero_state() for RNN in Tensorflow I tried to reset the cell state with :
rnn_state_zero_op = cell.zero_state(self.batch_size, tf.float32)
It doesn't seem to work either.
Question
I've another solution in mind but it's guessing at best : I'm not keeping the state returned by tf.nn.dynamic_rnn, I've thought of it but I get a tuple and I can't find a way to build an op to reset the tuple.
At this point I've to admit that I don't quite understand the internal working of tensorflow and if it's even possible to do what I'm trying to do.
Is there a proper way to do it ?
Thanks !

Thanks to this answer to another question I was able to find a way to have complete control on whether or not (and when) the internal state of the RNN should be reset to 0.
First you need to define some variables to store the state of the RNN, this way you will have control over it :
with tf.variable_scope('Hidden_state'):
state_variables = []
for state_c, state_h in cell.zero_state(self.batch_size, tf.float32):
state_variables.append(tf.nn.rnn_cell.LSTMStateTuple(
tf.Variable(state_c, trainable=False),
tf.Variable(state_h, trainable=False)))
# Return as a tuple, so that it can be fed to dynamic_rnn as an initial state
rnn_tuple_state = tuple(state_variables)
Note that this version define directly the variables used by the LSTM, this is much better than the version in my question because you don't have to unstack and build the tuple, which add some ops to the graph that you cannot run explicitly.
Secondly build the RNN and retrieve the final state :
# Build the RNN
with tf.name_scope('LSTM'):
rnn_output, new_states = tf.nn.dynamic_rnn(cell, rnn_inputs,
sequence_length=input_seq_lengths,
initial_state=rnn_tuple_state,
time_major=True)
So now you have the new internal state of the RNN. You can define two ops to manage it.
The first one will update the variables for the next batch. So in the next batch the "initial_state" of the RNN will be fed with the final state of the previous batch :
# Define an op to keep the hidden state between batches
update_ops = []
for state_variable, new_state in zip(rnn_tuple_state, new_states):
# Assign the new state to the state variables on this layer
update_ops.extend([state_variable[0].assign(new_state[0]),
state_variable[1].assign(new_state[1])])
# Return a tuple in order to combine all update_ops into a single operation.
# The tuple's actual value should not be used.
rnn_keep_state_op = tf.tuple(update_ops)
You should add this op to your session anytime you want to run a batch and keep the internal state.
Beware : if you run batch 1 with this op called then batch 2 will start with the batch 1 final state, but if you don't call it again when running batch 2 then batch 3 will start with batch 1 final state also. My advice is to add this op every time you run the RNN.
The second op will be used to reset the internal state of the RNN to zeros:
# Define an op to reset the hidden state to zeros
update_ops = []
for state_variable in rnn_tuple_state:
# Assign the new state to the state variables on this layer
update_ops.extend([state_variable[0].assign(tf.zeros_like(state_variable[0])),
state_variable[1].assign(tf.zeros_like(state_variable[1]))])
# Return a tuple in order to combine all update_ops into a single operation.
# The tuple's actual value should not be used.
rnn_state_zero_op = tf.tuple(update_ops)
You can call this op whenever you want to reset the internal state.

Simplified version of AMairesse post for one LSTM layer:
zero_state = tf.zeros(shape=[1, units[-1]])
self.c_state = tf.Variable(zero_state, trainable=False)
self.h_state = tf.Variable(zero_state, trainable=False)
self.init_encoder = tf.nn.rnn_cell.LSTMStateTuple(self.c_state, self.h_state)
self.output_encoder, self.state_encoder = tf.nn.dynamic_rnn(cell_encoder, layer, initial_state=self.init_encoder)
# save or reset states
self.update_ops += [self.c_state.assign(self.state_encoder.c, use_locking=True)]
self.update_ops += [self.h_state.assign(self.state_encoder.h, use_locking=True)]
or you can use replacement for init_encoder to reset states at step == 0 (you need to pass self.step_tf into session.run() as placeholder):
self.step_tf = tf.placeholder_with_default(tf.constant(-1, dtype=tf.int64), shape=[], name="step")
self.init_encoder = tf.cond(tf.equal(self.step_tf, 0),
true_fn=lambda: tf.nn.rnn_cell.LSTMStateTuple(zero_state, zero_state),
false_fn=lambda: tf.nn.rnn_cell.LSTMStateTuple(self.c_state, self.h_state))

Related

tf.zeros vs tf.placeholder as RNN initial state

Tensorflow newbie here! I understand that Variables will be trained over time, placeholders are used input data that doesn't change as your model trains (like input images, and class labels for those images).
I'm trying to implement the forward propagation of RNN using Tensorflow, and wondering on what type I should save the output of the RNN cell. In numpy RNN implementation, it uses
hiddenStates = np.zeros((T, self.hidden_dim)) #T is the length of the sequence
Then it iteratively saves the output in the np.zeros array.
In case of TF, which one should I use, tf.zeros or tf.placeholder?
What is the best practice in this case? I think it should be fine to use tf.zeros but wanted to double check.

First of all, it is important to you to understand that everything inside Tensorflow is a Tensor. So when you are performing some kind of computation (e.g. an rnn implementation like outputs = rnn(...)) the output of this computation is returned as a Tensor. So you don't need to store it inside any kind of structure. You can retrieve it by running the correspondent node (i.e. output) like session.run(output, feed_dict).
Told this, I think you need to take the final state of an RNN and provide it as initial state of a subsequent computation. Two ways:
A) If you are using RNNCell implementations During the construction of your model you can construct the zero state like this:
cell = (some RNNCell implementation)
initial_state = cell.zero_state(batch_size, tf.float32)
B) If you are uimplementing your own staff Define the state as a zero Tensor:
initial_state = tf.zeros([batch_size, hidden_size])
Then, in both cases you will have something like:
output, final_state = rnn(input, initial_state)
In your execution loop you can initialize your state first and then provide the final_state as initial_stateinside your feed_dict:
state = session.run(initial_state)
for step in range(epochs):
feed_dict = {initial_state: state}
_, state = session.run((train_op,final_state), feed_dict)
How you actually construct your feed_dict depends on the implementation of the RNN.
For an BasicLSTMCell, for example, a state is an LSTMState object and you need to provide both c and h:
feed_dict = {initial_state.c=state.c, initial_state.h: state.h}

How to initialize a keras tensor employed in an API model

I am trying to implemente a Memory-augmented neural network, in which the memory and the read/write/usage weight vectors are updated according to a combination of their previous values. These weigths are different from the classic weight matrices between layers that are automatically updated with the fit() function! My problem is the following: how can I correctly initialize these weights as keras tensors and use them in the model? I explain it better with the following simplified example.
My API model is something like:
input = Input(shape=(5,6))
controller = LSTM(20, activation='tanh',stateful=False, return_sequences=True)(input)
write_key = Dense(4,activation='tanh')(controller)
read_key = Dense(4,activation='tanh')(controller)
w_w = Add()([w_u, w_r]) #<---- UPDATE OF WRITE WEIGHTS
to_write = Dot()([w_w, write_key])
M = Add()([M,to_write])
cos_sim = Dot()([M,read_key])
w_r = Lambda(lambda x: softmax(x,axis=1))(cos_sim) #<---- UPDATE OF READ WEIGHTS
w_u = Add()([w_u,w_r,w_w]) #<---- UPDATE OF USAGE WEIGHTS
retrieved_memory = Dot()([w_r,M])
controller_output = concatenate([controller,retrieved_memory])
final_output = Dense(6,activation='sigmoid')(controller_output)`
You can see that, in order to compute w_w^t, I have to have first defined w_r^{t-1} and w_u^{t-1}. So, at the beginning I have to provide a valid initialization for these vectors. What is the best way to do it? The initializations I would like to have are:
M = K.variable(numpy.zeros((10,4))) # MEMORY
w_r = K.variable(numpy.zeros((1,10))) # READ WEIGHTS
w_u = K.variable(numpy.zeros((1,10))) # USAGE WEIGHTS`
But, analogously to what said in #2486(entron), these commands do not return a keras tensor with all the needed meta-data and so this returns the following error:
AttributeError: 'NoneType' object has no attribute 'inbound_nodes'
I also thought to use the old M, w_r and w_u as further inputs at each iteration and analogously get in output the same variables to complete the loop. But this means that I have to use the fit() function to train online the model having just the target as final output (Model 1), and employ the predict() function on the model with all the secondary outputs (Model 2) to get the variables to use at the next iteration. I have also to pass the weigth matrices from Model 1 to Model 2 using get_weights() and set_weights(). As you can see, it becomes a little bit messy and too slow.
Do you have any suggestions for this problem?
P.S. Please, do not focus too much on the API model above because it is a simplified (almost meaningless) version of the complete one where I skipped several key steps.

Confused on how to run tensorflow LSTM

I have seen two different ways of calling lstm on tensorflow and I am confused on what is the difference of one method with the other. And in which situation to use one or the other
The first one is to create an lstm and then call it immediatly like the code below
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
initial_state = tf.zeros([batch_size, lstm.state_size])
for i in range(num_steps):
# The value of state is updated after processing each batch of words.
output, state = lstm(words[:, i], state)
And the second one is call lstm cell through rnn.rnn() like below.
# Define a lstm cell with tensorflow
lstm = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
# Split data because rnn cell needs a list of inputs for the RNN inner loop
inputToLstmSplited = tf.split(0, n_steps, inputToLstm) # n_steps * (batch_size, n_hidden)
inputToLstmSplitedFiltered = tf.matmul(inputToLstmSplited, weights['hidden']) + biases['hidden']
# Get lstm cell out
outputs, states = rnn.rnn(lstm, inputToLstmSplited, initial_state=istate)

The second effectively does the same as the loop in the first, returning a list of all the outputs collected in the loop and the final state. It does it a bit more efficiently though and with a number of safety checks. It also supports useful features like variable sequence lengths. The first option is presented in Tensorflow tutorials to give you an idea of how an RNN is unravelled, but the second option is preferred for "production" code.

Weights of Seq2Seq Models

I went through the code and I'm afraid I don't grasp an important point.
I can't seem to find the weights matrix of the model for the encoder and decoder, neither where they are updated. I found the target_weights but it seems to be reinitialized at every get_batch() call so I don't really understand what they stand for either.
My actual goal is to concatenate two hidden states of two source encoders for one decoder by applying a linear transformation with a weight matrix that I'll have to train along with the model (I'm building a manytoone model), but I have no idea where to start because of my problem mentionned above.

This might help you start. There are a couple of models implemented in tensorflow.python.ops.seq2seq.py (with/without buckets, attention, etc.) but take a look at the definition for embedding_attention_seq2seq (which is the one called in their example model seq2seq_model.py that you seem to be referencing):
def embedding_attention_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
num_heads=1, output_projection=None,
feed_previous=False, dtype=dtypes.float32,
scope=None, initial_state_attention=False):
with variable_scope.variable_scope(scope or "embedding_attention_seq2seq"):
# Encoder.
encoder_cell = rnn_cell.EmbeddingWrapper(cell, num_encoder_symbols)
encoder_outputs, encoder_state = rnn.rnn(
encoder_cell, encoder_inputs, dtype=dtype)
# First calculate a concatenation of encoder outputs to put attention on.
top_states = [array_ops.reshape(e, [-1, 1, cell.output_size])
for e in encoder_outputs]
attention_states = array_ops.concat(1, top_states)
....
You can see where it picks out the top layer of encoder outputs as top_states before handing them off to the decoder.
So you could implement a similar function with two encoders and concatenate those states before handing off to the decoder.

The value created in the get_batch function is only used for the first iteration. Even though the weights are passed every time into the function, their value gets updated as a global variable in the Seq2Seq model class in the init function.
with tf.name_scope('Optimizer'):
# Gradients and SGD update operation for training the model.
params = tf.trainable_variables()
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.GradientDescentOptimizer(self.learning_rate)
for b in range(len(buckets)):
gradients = tf.gradients(self.losses[b], params)
clipped_gradients, norm = tf.clip_by_global_norm(gradients,
max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(
zip(clipped_gradients, params), global_step=self.global_step))
self.saver = tf.train.Saver(tf.global_variables())
The weights are fed seperately as a place-holder because they are normalized in the get_batch function to create zero weights for the PAD inputs.
# Batch decoder inputs are re-indexed decoder_inputs, we create weights.
for length_idx in range(decoder_size):
batch_decoder_inputs.append(
np.array([decoder_inputs[batch_idx][length_idx]
for batch_idx in range(self.batch_size)], dtype=np.int32))
# Create target_weights to be 0 for targets that are padding.
batch_weight = np.ones(self.batch_size, dtype=np.float32)
for batch_idx in range(self.batch_size):
# We set weight to 0 if the corresponding target is a PAD symbol.
# The corresponding target is decoder_input shifted by 1 forward.
if length_idx < decoder_size - 1:
target = decoder_inputs[batch_idx][length_idx + 1]
if length_idx == decoder_size - 1 or target == data_utils.PAD_ID:
batch_weight[batch_idx] = 0.0
batch_weights.append(batch_weight)

Update only part of the word embedding matrix in Tensorflow

Assuming that I want to update a pre-trained word-embedding matrix during training, is there a way to update only a subset of the word embedding matrix?
I have looked into the Tensorflow API page and found this:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1])) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
However how do I apply that to the word-embedding matrix. Suppose I do:
word_emb = tf.Variable(0.2 * tf.random_uniform([syn0.shape[0],s['es']], minval=-1.0, maxval=1.0, dtype=tf.float32),name='word_emb',trainable=False)
gather_emb = tf.gather(word_emb,indices) #assuming that I pass some indices as placeholder through feed_dict
opt = tf.train.AdamOptimizer(1e-4)
grad = opt.compute_gradients(loss,gather_emb)
How do I then use opt.apply_gradients and tf.scatter_update to update the original embeddign matrix? (Also, tensorflow throws an error if the second argument of compute_gradient is not a tf.Variable)

TL;DR: The default implementation of opt.minimize(loss), TensorFlow will generate a sparse update for word_emb that modifies only the rows of word_emb that participated in the forward pass.
The gradient of the tf.gather(word_emb, indices) op with respect to word_emb is a tf.IndexedSlices object (see the implementation for more details). This object represents a sparse tensor that is zero everywhere, except for the rows selected by indices. A call to opt.minimize(loss) calls AdamOptimizer._apply_sparse(word_emb_grad, word_emb), which makes a call to tf.scatter_sub(word_emb, ...)* that updates only the rows of word_emb that were selected by indices.
If on the other hand you want to modify the tf.IndexedSlices that is returned by opt.compute_gradients(loss, word_emb), you can perform arbitrary TensorFlow operations on its indices and values properties, and create a new tf.IndexedSlices that can be passed to opt.apply_gradients([(word_emb, ...)]). For example, you could cap the gradients using MyCapper() (as in the example) using the following calls:
grad, = opt.compute_gradients(loss, word_emb)
train_op = opt.apply_gradients(
[tf.IndexedSlices(MyCapper(grad.values), grad.indices)])
Similarly, you could change the set of indices that will be modified by creating a new tf.IndexedSlices with a different indices.
* In general, if you want to update only part of a variable in TensorFlow, you can use the tf.scatter_update(), tf.scatter_add(), or tf.scatter_sub() operators, which respectively set, add to (+=) or subtract from (-=) the value previously stored in a variable.

Since you just want to select the elements to be updated (and not to change the gradients), you can do as follows.
Let indices_to_update be a boolean tensor that indicates the indices you wish to update, and entry_stop_gradients is defined in the link, Then:
gather_emb = entry_stop_gradients(gather_emb, indices_to_update)
(Source)

Actually, I was also struggling with such a problem. In my case, I needed to train a model with w2v embeddings, but not all of the tokens existed in embedding matrix. Thus for those tokens which were not in matrix, I made random initialization. Of course tokens for which embeddings were already trained, shouldn't be updated, thus I've came up with such a solution:
class PartialEmbeddingsUpdate(tf.keras.layers.Layer):
def __init__(self, len_vocab,
weights,
indices_to_update):
super(PartialEmbeddingsUpdate, self).__init__()
self.embeddings = tf.Variable(weights, name='embedding', dtype=tf.float32)
self.bool_mask = tf.equal(tf.expand_dims(tf.range(0,len_vocab),1), tf.expand_dims(indices_to_update,0))
self.bool_mask = tf.reduce_any(self.bool_mask,1)
self.bool_mask_not = tf.logical_not(self.bool_mask)
self.bool_mask_not = tf.expand_dims(tf.cast(self.bool_mask_not, dtype=self.embeddings.dtype),1)
self.bool_mask = tf.expand_dims(tf.cast(self.bool_mask, dtype=self.embeddings.dtype),1)
def call(self, input):
input = tf.cast(input, dtype=tf.int32)
embeddings = tf.stop_gradient(self.bool_mask_not * self.embeddings) + self.bool_mask * self.embeddings
return tf.gather(embeddings,input)
Where len_vocab - is your vocabulary length, weights - matrix of weights (some of which shouldn't be updated) and indices_to_update - indices of those tokens which should be updated. After that I applied this layer instead of tf.keras.layers.Embeddings. Hope it helps everyone, who encountered the same problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tensorflow RNN-LSTM - reset hidden state - tensorflow

Related

tf.zeros vs tf.placeholder as RNN initial state

How to initialize a keras tensor employed in an API model

Confused on how to run tensorflow LSTM

Weights of Seq2Seq Models

Update only part of the word embedding matrix in Tensorflow

Categories

Resources