How to modify the Tensorflow Sequence2Sequence model to implement Bidirectional LSTM rather than Unidirectional one? - tensorflow

Refer to this post to know the background of the problem:
Does the TensorFlow embedding_attention_seq2seq method implement a bidirectional RNN Encoder by default?
I am working on the same model, and want to replace the unidirectional LSTM layer with a Bidirectional layer. I realize I have to use static_bidirectional_rnn instead of static_rnn, but I am getting an error due to some mismatch in the tensor shape.
I replaced the following line:
encoder_outputs, encoder_state = core_rnn.static_rnn(encoder_cell, encoder_inputs, dtype=dtype)
with the line below:
encoder_outputs, encoder_state_fw, encoder_state_bw = core_rnn.static_bidirectional_rnn(encoder_cell, encoder_cell, encoder_inputs, dtype=dtype)
That gives me the following error:
InvalidArgumentError (see above for traceback): Incompatible shapes:
[32,5,1,256] vs. [16,1,1,256]
[[Node: gradients/model_with_buckets/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/add_grad/BroadcastGradientArgs
= BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](gradients/model_with_buckets/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/add_grad/Shape,
gradients/model_with_buckets/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/add_grad/Shape_1)]]
I understand that the outputs of both the methods are different, but I do not know how to modify attention code to incorporate that. How do I send both the forward and backward states to the attention module- do I concatenate both the hidden states?

I find from the error message that the batch size of two tensors somewhere don't match, one is 32 and the other is 16. I suppose it is because the output list of the bidirectional rnn is double sized of that of the unidirectional one. And you just don't adjust to that in the following code accordingly.
How do I send both the forward and backward states to the attention
module- do I concatenate both the hidden states?
You can reference this code:
def _reduce_states(self, fw_st, bw_st):
"""Add to the graph a linear layer to reduce the encoder's final FW and BW state into a single initial state for the decoder. This is needed because the encoder is bidirectional but the decoder is not.
Args:
fw_st: LSTMStateTuple with hidden_dim units.
bw_st: LSTMStateTuple with hidden_dim units.
Returns:
state: LSTMStateTuple with hidden_dim units.
"""
hidden_dim = self._hps.hidden_dim
with tf.variable_scope('reduce_final_st'):
# Define weights and biases to reduce the cell and reduce the state
w_reduce_c = tf.get_variable('w_reduce_c', [hidden_dim * 2, hidden_dim], dtype=tf.float32, initializer=self.trunc_norm_init)
w_reduce_h = tf.get_variable('w_reduce_h', [hidden_dim * 2, hidden_dim], dtype=tf.float32, initializer=self.trunc_norm_init)
bias_reduce_c = tf.get_variable('bias_reduce_c', [hidden_dim], dtype=tf.float32, initializer=self.trunc_norm_init)
bias_reduce_h = tf.get_variable('bias_reduce_h', [hidden_dim], dtype=tf.float32, initializer=self.trunc_norm_init)
# Apply linear layer
old_c = tf.concat(axis=1, values=[fw_st.c, bw_st.c]) # Concatenation of fw and bw cell
old_h = tf.concat(axis=1, values=[fw_st.h, bw_st.h]) # Concatenation of fw and bw state
new_c = tf.nn.relu(tf.matmul(old_c, w_reduce_c) + bias_reduce_c) # Get new cell from old cell
new_h = tf.nn.relu(tf.matmul(old_h, w_reduce_h) + bias_reduce_h) # Get new state from old state
return tf.contrib.rnn.LSTMStateTuple(new_c, new_h) # Return new cell and state

Related

Keras Bidirectional LSTM seq2seq inference model expects 3 inputs but only receives 1, even though I am passing in 3 inputs

I am creating a language model with a bidirecitonal LSTM, seq2seq model.
I have created the model and trained it successfully:
lstm_units = 100
# Set up embedding layer using pretrained weights
embedding_layer = Embedding(total_words+1, emb_dimension, input_length=max_input_len, weights=[embedding_matrix], name="Embedding")
# Encoder
encoder_input_x = Input(shape=(None,), name="Enc_x_Input")
encoder_embedding_x = embedding_layer(encoder_input_x)
encoder_lstm_x, enc_state_h_fwd, enc_state_c_fwd, enc_state_h_bwd, enc_state_c_bwd = Bidirectional(LSTM(lstm_units, dropout=0.5, return_state=True, name="Enc_LSTM1"), name="Enc_Bi1")(encoder_embedding_x) # pass hidden activation and memory cell states forward
encoder_state_h = Concatenate()([enc_state_h_fwd, enc_state_h_bwd])
encoder_state_c = Concatenate()([enc_state_c_fwd, enc_state_c_bwd])
encoder_states = [encoder_state_h, encoder_state_c] # package states to pass to decoder
# Decoder
decoder_input_x = Input(shape=(None,), name="Dec_x_Input")
decoder_embedding_x = embedding_layer(decoder_input_x)
decoder_lstm_layer = LSTM(lstm_units*2, return_state=True, return_sequences=True, dropout=0.5, name="Dec_LSTM1") # We define an LSTM layer without passing anything in here, as we will need to use this LSTM later.
decoder_lstm_x, _, _ = decoder_lstm_layer(decoder_embedding_x, initial_state=encoder_states) # we pass in encoder states
decoder_dense_layer = TimeDistributed(Dense(total_words+1, activation="softmax", name="Dec_Softmax")) # we set this dense to a variable so we can use it later, as above with the LSTM
decoder_output_x = decoder_dense_layer(decoder_lstm_x)
model = Model(inputs=[encoder_input_x, decoder_input_x], outputs=decoder_output_x)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
I then set up the inference model:
# Inference Encoder
inf_encoder_model = Model(encoder_input_x, encoder_states) # Here we are creating a model using layers from the model we built earlier.
# The encoder model outputs the encoder_states, ie. the concatenated h and c values from the BLSTM
# Inference Decoder
# Create new inputs for decoder state
inf_dec_state_h_input = Input(shape=(2*lstm_units,), name="Dec_h_state_input") # The must be sized to fit both FWD and BWD h values from the BLSTM
inf_dec_state_c_input = Input(shape=(2*lstm_units,), name="Dec_c_state_input")
inf_dec_state_input = [inf_dec_state_h_input, inf_dec_state_c_input] # package states to pass to decoder
# Decoder LSTM + Dense
inf_decoder_lstm_x, inf_dec_state_h, inf_dec_state_c = decoder_lstm_layer(decoder_embedding_x, initial_state=inf_dec_state_input) # reuse embedding layer from training. We pass in encoder states
inf_decoder_states = [inf_dec_state_h, inf_dec_state_c] # I think we we loop inference, we'll pass these states back in to the input instead of the encoder states
inf_decoder_output = decoder_dense_layer(inf_decoder_lstm_x)
decoder_model = Model([decoder_input_x] + inf_dec_state_input, [inf_decoder_output] + inf_decoder_states) # we reuse the decoder_input_x from the training model
The decoder model for inference is set up to take the decoder inputs + the c and h states which are output from the encoder.
When running the inference loop using this code:
states = inf_encoder_model.predict(x_inputs[700])
# Generate empty target sequence of length 1.
target_seq = np.zeros((max_output_len, 1), dtype=int)
# Populate the first character of target sequence with the start character.
target_seq[0, 0] = 4 # 4 is the start of sequence token used during training
# Get prediction
prediction, h, c = decoder_model.predict([target_seq] + states)
it gives me a long error that ends with:
ValueError: Layer Dec_LSTM1 expects 3 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'model_16/Embedding/embedding_lookup/Identity_1:0' shape=(None, 1, 100) dtype=float32>]
The encoder states seem to be fine; a list containing 2 arrays, the h and c values, each with shape (60, 200). The target_seq is an array of shape (1, 60). x_inputs[700] is training data, also of shape (1, 60).
Why is the model.predict line suggesting I am giving it 1 input tensor when I am giving it a list containing 3 arrays?

How to use tensorflow seq2seq without embeddings?

I have been working on LSTM for timeseries forecasting by using tensorflow. Now, i want to try sequence to sequence (seq2seq). In the official site there is a tutorial which shows NMT with embeddings . So, how can I use this new seq2seq module without embeddings? (directly using time series "sequences").
# 1. Encoder
encoder_cell = tf.contrib.rnn.BasicLSTMCell(LSTM_SIZE)
encoder_outputs, encoder_state = tf.nn.static_rnn(
encoder_cell,
x,
dtype=tf.float32)
# Decoder
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(LSTM_SIZE)
helper = tf.contrib.seq2seq.TrainingHelper(
decoder_emb_inp, decoder_lengths, time_major=True)
decoder = tf.contrib.seq2seq.BasicDecoder(
decoder_cell, helper, encoder_state)
# Dynamic decoding
outputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder)
outputs = outputs[-1]
# output is result of linear activation of last layer of RNN
weight = tf.Variable(tf.random_normal([LSTM_SIZE, N_OUTPUTS]))
bias = tf.Variable(tf.random_normal([N_OUTPUTS]))
predictions = tf.matmul(outputs, weight) + bias
What should be the args for TrainingHelper() if I use input_seq=x and output_seq=label?
decoder_emb_inp ???
decoder_lengths ???
Where input_seq are the first 8 point of the sequence, and output_seq are the last 2 point of the sequence.
Thanks on advance!
I got it to work for no embedding using a very rudimentary InferenceHelper:
inference_helper = tf.contrib.seq2seq.InferenceHelper(
sample_fn=lambda outputs: outputs,
sample_shape=[dim],
sample_dtype=dtypes.float32,
start_inputs=start_tokens,
end_fn=lambda sample_ids: False)
My inputs are floats with the shape [batch_size, time, dim]. For the example below dim would be 1, but this can easily be extended to more dimensions. Here's the relevant part of the code:
projection_layer = tf.layers.Dense(
units=1, # = dim
kernel_initializer=tf.truncated_normal_initializer(
mean=0.0, stddev=0.1))
# Training Decoder
training_decoder_output = None
with tf.variable_scope("decode"):
# output_data doesn't exist during prediction phase.
if output_data is not None:
# Prepend the "go" token
go_tokens = tf.constant(go_token, shape=[batch_size, 1, 1])
dec_input = tf.concat([go_tokens, target_data], axis=1)
# Helper for the training process.
training_helper = tf.contrib.seq2seq.TrainingHelper(
inputs=dec_input,
sequence_length=[output_size] * batch_size)
# Basic decoder
training_decoder = tf.contrib.seq2seq.BasicDecoder(
dec_cell, training_helper, enc_state, projection_layer)
# Perform dynamic decoding using the decoder
training_decoder_output = tf.contrib.seq2seq.dynamic_decode(
training_decoder, impute_finished=True,
maximum_iterations=output_size)[0]
# Inference Decoder
# Reuses the same parameters trained by the training process.
with tf.variable_scope("decode", reuse=tf.AUTO_REUSE):
start_tokens = tf.constant(
go_token, shape=[batch_size, 1])
# The sample_ids are the actual output in this case (not dealing with any logits here).
# My end_fn is always False because I'm working with a generator that will stop giving
# more data. You may extend the end_fn as you wish. E.g. you can append end_tokens
# and make end_fn be true when the sample_id is the end token.
inference_helper = tf.contrib.seq2seq.InferenceHelper(
sample_fn=lambda outputs: outputs,
sample_shape=[1], # again because dim=1
sample_dtype=dtypes.float32,
start_inputs=start_tokens,
end_fn=lambda sample_ids: False)
# Basic decoder
inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
inference_helper,
enc_state,
projection_layer)
# Perform dynamic decoding using the decoder
inference_decoder_output = tf.contrib.seq2seq.dynamic_decode(
inference_decoder, impute_finished=True,
maximum_iterations=output_size)[0]
Have a look at this question. Also I found this tutorial to be very useful to understand seq2seq models, although it does use embeddings. So replace their GreedyEmbeddingHelper by an InferenceHelper like the one I posted above.
P.s. I posted the full code at https://github.com/Andreea-G/tensorflow_examples

Extracting hidden state vectors of an LSTM model at each time step

I'm using keras with the tensorflow backend.
I have a trained LSTM model whose hidden state vector I would like to extract at each timestep.
What is the best way to do this in keras?
The function handling whether all hidden state vectors are returned is Recurrent.call() (it has been renamed to RNN.call() in the latest version). It checks the parameter return_sequences to make the decision.
When the backend function K.rnn() is called in this function:
last_output, outputs, states = K.rnn(self.step,
preprocessed_input,
initial_state,
go_backwards=self.go_backwards,
mask=mask,
constants=constants,
unroll=self.unroll,
input_length=input_shape[1])
...
if self.return_sequences:
output = outputs
else:
output = last_output
The tensor outputs is what you want. You can get this tensor by calling Recurrent.call() again, but with return_sequences=True. This should do no harm to your trained LSTM model (at least in current Keras).
Here's a toy Bi-LSTM model demonstrating this method:
input_tensor = Input(shape=(None,), dtype='int32')
embedding = Embedding(10, 100, mask_zero=True)(input_tensor)
hidden = Bidirectional(LSTM(10, return_sequences=True))(embedding)
hidden = Bidirectional(LSTM(10, return_sequences=True))(hidden)
hidden = Bidirectional(LSTM(2))(hidden)
out = Dense(1, activation='sigmoid')(hidden)
model = Model(input_tensor, out)
First, set return_sequences to True for the last LSTM layer (since you're using a Bidirectional wrapper, you have to set forward_layer and backward_layer also):
target_layer = model.layers[-2]
target_layer.return_sequences = True
target_layer.forward_layer.return_sequences = True
target_layer.backward_layer.return_sequences = True
Now by calling this layer again, the tensor containing hidden vectors at all time steps will be returned (there will be a side-effect of creating an additional inbound node, but it shouldn't affect the prediction).
outputs = target_layer(target_layer.input)
m = Model(model.input, outputs)
You can get the hidden vectors by, e.g., calling m.predict(X_test).
X_test = np.array([[1, 3, 2, 0, 0]])
print(m.predict(X_test))
[[[ 0.00113332 -0.0006666 0.00428438 -0.00125567]
[ 0.00106074 -0.00041183 0.00383953 -0.00027285]
[ 0.00080892 0.00027685 0.00238486 0.00036328]
[ 0.00080892 0.00027685 0. 0. ]
[ 0.00080892 0.00027685 0. 0. ]]]
As you can see, hidden vectors of all 5 time steps are returned, and the last 2 time steps are properly masked.

Gradients are always zero

I have written an algorithm using tensorflow framework and faced with the problem, that tf.train.Optimizer.compute_gradients(loss) returns zero for all weights. Another problem is if I put batch size larger than about 5, tf.histogram_summary for weights throws an error that some of values are NaN.
I cannot provide here a reproducible example, because my code is quite bulky and I am not so good in TF for make it shorter. I will try to paste here some fragments.
Main loop:
images_ph = tf.placeholder(tf.float32, shape=some_shape)
labels_ph = tf.placeholder(tf.float32, shape=some_shape)
output = inference(BATCH_SIZE, images_ph)
loss = loss(labels_ph, output)
train_op = train(loss, global_step)
session = tf.Session()
session.run(tf.initialize_all_variables())
for i in xrange(MAX_STEPS):
images, labels = train_dataset.get_batch(BATCH_SIZE, yolo.INPUT_SIZE, yolo.OUTPUT_SIZE)
session.run([loss, train_op], feed_dict={images_ph : images, labels_ph : labels})
Train_op (here is the problem occures):
def train(total_loss)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(total_loss)
# Here gradients are zeros
for grad, var in grads:
if grad is not None:
tf.histogram_summary("gradients/" + var.op.name, grad)
return opt.apply_gradients(grads, global_step=global_step)
Loss (the loss is calculated correctly, since it changes from sample to sample):
def loss(labels, output)
return tf.reduce_mean(tf.squared_difference(labels, output))
Inference: a set of convolution layers with ReLU followed by 3 fully connected layers with sigmoid activation in the last layer. All weights initialized by truncated normal rv's. All labels are vectors of fixed length with real numbers in range [0,1].
Thanks in advance for any help! If you have some hypothesis for my problem, please share I will try them. Also I can share the whole code if you like.

adding one hot encoding throws error in previously working code in Tensorflow

with tf.variable_scope("rnn_seq2seq"):
w = tf.get_variable("proj_w", [num_units, seq_width])
w_t = tf.transpose(w)
b = tf.get_variable("proj_b", [seq_width])
output_projection=(w,b)
output,state = rnn_seq2seq(enc_inputs,dec_inputs,cell,output_projection=output_projection,feed_previous=False)
weights=[tf.ones([batch_size * dec_steps])]
loss=[]
for i in xrange(dec_steps -1):
logits = tf.nn.xw_plus_b(output[i],output_projection[0],output_projection[1])
If I introduce one hot encoding on the logits here, the program gives error later although both returns the same dimensions. If I comment out this line, the program does not give any error.
prev = logits
logits = tf.to_float(tf.equal(prev,tf.reduce_max(prev,reduction_indices=[1],keep_dims=True)))
print prev
print logits
Tensor("rnn_seq2seq/xw_plus_b:0", shape=TensorShape([Dimension(800), Dimension(14)]), dtype=float32)
Tensor("rnn_seq2seq/ToFloat:0", shape=TensorShape([Dimension(800), Dimension(14)]), dtype=float32)
Rest of code:
crossent =tf.nn.softmax_cross_entropy_with_logits(logits,dec_inputs[i+1],name="SequenceLoss/CrossEntropy{0}".format(i))
loss.append(crossent)
cost = tf.reduce_sum(tf.add_n(loss))
final_state = state[-1]
tvars = tf.trainable_variables()
grads,norm = tf.clip_by_global_norm(tf.gradients(cost,tvars),5)
lr = tf.Variable(0.0,name="learningRate")
optimizer = tf.train.GradientDescentOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads,tvars))
---> 23 grads,norm = tf.clip_by_global_norm(tf.gradients(cost,tvars),5)
ValueError: List argument 'values' to 'Pack' Op with length 0 shorter
than minimum length 1.
Neural networks can only be trained if all the operations they perform are differentiable. The "one-hot" step you apply is not differentiable, and hence such a neural network cannot be trained using any gradient descent-based optimizer (= any optimizer that tensor flow implements).
The general approach is to use softmax (which is differentiable) during training to approximate one-hot encoding (and your model already has softmax following computing logits, so commenting out the "one-hot" is actually all you need to do).