Output of Keras LSTMCell - tensorflow

I am implementing an LSTM cell using tensorflow.keras.layers.LSTMCell in order to optimize some parameters.
Over the different time steps of the recurrent network I want my input vector x[t+1] to be the output of the previous iteration at time step t. However, looking at the equations of the LSTMcell (https://en.wikipedia.org/wiki/Long_short-term_memory) and also to the source code of the call() method LSTMCell on Keras (https://github.com/keras-team/keras/blob/master/keras/layers/rnn/lstm.py line 298), the output will be h, [h,c] being h the hidden state and c the cell state.
My question is how can I implement the LSTMCell on Keras so that on each iteration some inputs for the next step are output. One example of this could be made it's introducing antoher weights Wp and biases bp (appart from the ones that appears on equations of LSTM cells) that are trained so that x[t+1] = Wp * ht + bp. I have seen that this approach may be the one that uses the object Model() on Keras, but I am quite new on this and I don't know how it works.
Besides that, as a second question I would like to know why the output of the call() method for LSTM cell is h, [h,c]. What is the point on having h repeated twice there?
enter image description here
Thanks a lot everybody for your help,
To sum up, I need to find the predictions yt that appear on the link.

Related

LSTM with Keras to optimize a black box function

I'm trying to implement the recurrent neural network architecture proposed in this paper (https://arxiv.org/abs/1611.03824), where the authors use a LSTM to minimize a black-box function (which however is assumed to be differentiable). Here is a diagram of the proposed architecture: RNN. Briefly, the idea is to use an LSTM like an optimizer, which has to learn a good heuristic to propose new parameters for the unknown function y=f(parameters), so that it moves towards a minimum. Here's how the proposed procedure works:
Select an initial value for the parameters p0, and for the function y0 = f(p0)
Call to LSTM cell with input=[p0,y0], and whose output is a new value for the parameters output=p1
Evaluate y1 = f(p1)
Call the LSTM cell with input=[p1,y1], and obtain output=p2
Evaluate y2 = f(p2)
Repeat for few times, for example stopping at fifth iteration: y5 = f(p5).
I'm trying to implement a similar model in Tensorflow/Keras but I'm having some troubles. In particular, this case is different from "standard" ones because we don't have a predefinite time sequence to be analyzed, but instead it is generated online, after each iteration of the LSTM cell. Thus, in this case, our input would consist of just the starting guess [p0,y0=f(p0)] at time t=0. If I understood it correctly, this model is similar to the one-to-many LSTM, but with the difference that the input to the next time step does not come from just the previous cell, but also form the output an additional function (in our case f).
I managed to create a custom tf.keras.layers.Layer which performs the calculation for a single time step (that is it performs the LSTM cell and then use its output as input to the function f):
class my_layer(tf.keras.layers.Layer):
def __init__(self, units = 4):
super(my_layer, self).__init__()
self.cell = tf.keras.layers.LSTMCell(units)
def call(self, inputs):
prev_cost = inputs[0]
prev_params = inputs[1]
prev_h = inputs[2]
prev_c = inputs[3]
# Concatenate the previous parameters and previous cost to create new input
new_input = tf.keras.layers.concatenate([prev_cost, prev_params])
# New parameters obtained by the LSTM cell, along with new internsal states: h and c
new_params, [new_h, new_c] = self.cell(new_input, states = [prev_h, prev_c])
# Function evaluation
new_cost = f(new_params)
return [new_cost, new_params, new_h, new_c]
but I do not know how to build the recurrent part. I tried to do it manually, that is doing something like:
my_cell = my_layer(units = 4)
outputs = my_cell(inputs)
outputs1 = my_cell(outputs)
outputs2 = my_cell(outputs1)
Is that correct? Is there some other way to do it more appropriately?
Bonus question: I would like to train the LSTM to be able to optimize not only a single function f, but rather a class of different functions [f1, f2, ...] which share some common structure which make them similar enough to be optimized using the same LSTM. How could I implement such a training loop which takes as inputs a list of this functions [f1, f2, ...], and tries to minimize them all? My first thought was to do that "brute force" way: use a for loop over the function and a tf.GradientTape which evaluates and applies the gradients for each function.
Any help is much appreciated!
Thank you very much in advance! :)

Keras Looping LSTM layers

I am trying to build a model which is basically sequence to sequence model but i have a special encoder namely "Secondary Encoder".
Timesteps in Secondary Encoder = 300
this encoder has a special property, in essence it is a GRU, but at each timestep the hidden state produced by the GRUCell is needed to be altered, it is needed to be Added with another variable and then this combination(new hidden state) is passed on to the next GRUCell which uses this as initial_state........this thing repeated 300 times.
As 300 GRUCells are required (one for each time step) it is not feasible to hard code each of the 300 layers and create the model.
So, I need help to figure out how to write a loop to implement this thing in keras or maybe how to create a custom Layer (if this is a better choice).
what I thought (pseudocode) :-
here alpha is the variable that I was talking that i want to add
x = Input(shape=...)
encoder_cell = GRU(10,return_state=True)
init_state = xxxx //some value to give as initialiser to first GRU cell
for t in range(300):
_,hstate = encoder_cell(x[t],initial_state = init_state)
init_state = hstate + alpha
model = Model(inputs = x, outputs = init_state)
will this work ? will the model be able to interpret that it needs to loop 300 times at each training example?
The model is quite big it has skip connections and lots of other things that's why i need your help to figure out this subset of my problem before i implement the rest, and please ignore the syntax, this is just pseudocode.
Also, I need to call this model again n again, so i think the iterative way will slow down the process by quite a lot right?

Understanding Seq2Seq model

Here is my understanding of a basic Sequence to Sequence LSTMs. Suppose we are tackling a question-answer setting.
You have two set of LSTMs (green and blue below). Each set respectively sharing weights (i.e. each of the 4 green cells have the same weights and similarly with the blue cells). The first is a many to one LSTM, which summarises the question at the last hidden layer/ cell memory.
The second set (blue) is a Many to Many LSTM which has different weights to the first set of LSTMs. The input is simply the answer sentence while the output is the same sentence shifted by one.
The question is two fold:
1. Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
2. Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
(image taken from suriyadeepan.github.io)
Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
Both hidden state h and cell memory c are passed to the decoder.
TensorFlow
In seq2seq source code, you can find the following code in basic_rnn_seq2seq():
_, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
return rnn_decoder(decoder_inputs, enc_state, cell)
If you use an LSTMCell, the returned enc_state from the encoder will be a tuple (c, h). As you can see, the tuple is passed directly to the decoder.
Keras
In Keras, the "state" defined for an LSTMCell is also a tuple (h, c) (note that the order is different from TF). In LSTMCell.call(), you can find:
h_tm1 = states[0]
c_tm1 = states[1]
To get the states returned from an LSTM layer, you can specify return_state=True. The returned value is a tuple (o, h, c). The tensor o is the output of this layer, which will be equal to h unless you specify return_sequences=True.
Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
###TensorFlow###
Just provide the initial state to an LSTMCell when calling it. For example, in the official RNN tutorial:
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
...
output, state = lstm(current_batch_of_words, state)
There's also an initial_state argument for functions such as tf.nn.static_rnn. If you use the seq2seq module, provide the states to rnn_decoder as have been shown in the code for question 1.
###Keras###
Use the keyword argument initial_state in the LSTM function call.
out = LSTM(32)(input_tensor, initial_state=(h, c))
You can actually find this usage on the official documentation:
###Note on specifying the initial state of RNNs###
You can specify the initial state of RNN layers symbolically by
calling them with the keyword argument initial_state. The value of
initial_state should be a tensor or list of tensors representing the
initial state of the RNN layer.
EDIT:
There's now an example script in Keras (lstm_seq2seq.py) showing how to implement basic seq2seq in Keras. How to make prediction after training a seq2seq model is also covered in this script.
(Edit: this answer is incomplete and hasn't considered actual possibilities of state transfering. See the accepted answer).
From a Keras point of view, that picture has only two layers.
The green group is one LSTM layer.
The blue group is another LSTM layer.
There isn't any communication between green and blue other than passing the outputs. So, the answer for 1 is:
Only the thought vector (which is the actual output of the layer) is passed to the other layer.
Memory and state (not sure if these are two different entities) are totally contained inside a single layer and are not initially intended to be seen or shared with any other layer.
Each individual block in that image is totally invisible in keras. They are considered "time steps", something that only appears in the shape of the input data. It's rarely important to worry about them (unless for very advanced usages).
In keras, it's like this:
Easily, you have access only to the external arrows (including "thought vector").
But having access to each step (each individual green block in your picture) is not an exposed thing. So...
Passing the states from one layer to the other is also not expected in Keras. You will probably have to hack things. (See this: https://github.com/fchollet/keras/issues/2995)
But considering a thought vector big enough, you could say it will learn a way to carry what is important in itself.
The only notion you have from the steps is:
You have to input things shaped like (sentences, length, wordIdFeatures)
The steps will be performed considering that each slice in the length dimension is an input to each green block.
You may choose to have a single output (sentences, cells), for which you completely lose track of steps. Or...
Outputs like (sentences, length, cells), from which you know the output of each block through the length dimension.
One to many or many to many?
Now, the first layer is many to one (but nothing prevents it from being many to many too if you want).
But the second... that's complicated.
If the thought vector was made by a many to one. You will have to manage a way of creating a one to many. (That's not trivial in keras, but you could think of repeating the thought vector for the expected length, making it be the input to all steps. Or maybe fill an entire sequence with zeros or ones, keeping only the first element as the thought vector)
If the thought vector was made by a many to many, you can take advantage of this and keep an easy many to many, if you're willing to accept that the output has exactly the same number of steps as the input.
Keras doesn't have a ready solution for 1 to many cases. (From a single input predict a whole sequence).

Understanding the functioning of a recurrent neural network with LSTM cells

Context:
I have a recurrent neural network with LSTM cells
The input to the network is a batch of size (batch_size, number_of_timesteps, one_hot_encoded_class) in my case (128, 300, 38)
The different rows of the batch (1-128) are not necessarily related
to each other
The target for one time step is given by the value of the next
time step.
My questions:
When I train the network using an input batch of (128,300,38) and a target batch of the same size,
does the network always consider only the last time-step t to predict the value of the next timestep t+1?
or does it consider all time steps from the beginning of the sequence up to time step t?
or does the LSTM cell internally remember all previous states?
I am confused about the functioning because the network is trained on multiple time steps simulatenously so I am not sure how the LSTM cell can still have knowledge of the previous states.
I hope somebody can help. Thanks in advance!
Code for dicussion:
cells = []
for i in range(self.n_layers):
cell = tf.contrib.rnn.LSTMCell(self.n_hidden)
cells.append(cell)
cell = tf.contrib.rnn.MultiRNNCell(cells)
init_state = cell.zero_state(self.batch_size, tf.float32)
outputs, final_state = tf.nn.dynamic_rnn(
cell, inputs=self.inputs, initial_state=init_state)
self.logits = tf.contrib.layers.linear(outputs, self.num_classes)
softmax_ce = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=self.logits)
self.loss = tf.reduce_mean(softmax_ce)
self.train_step = tf.train.AdamOptimizer(self.lr).minimize(self.loss)
The above is a simple RNN unrolled to the neuron level with 3 time steps.
As you can see that the output at time step t, depends upon all time steps from the beginning. The network is trained using back-propagation through time where the weights are updated by the contribution of all error gradients across time. The weights are shared across time, so there is nothing like simultaneous update on all time steps.
The knowledge of the previous states are transfered through the state variable s_t as it is a function of previous inputs. So at any time step, the prediction is made based on the current input as well as (function of) previous inputs captured by the state variable.
NOTE: A basic rnn was used instead of LSTM because of simplicity.
Here's what would be helpful to keep in mind for your case specifically:
Given the input shape of [128, 300, 38]
One call to dynamic_rnn will propagate through all 300 steps, and if you are using something like LSTM, the state will also be carried through those 300 steps
However, each SUBSEQUENT call to dynamic_rnn will not automatically remember the state from the previous call. By the second call, the weights/etc. will have been updated thanks to the first call, but you will still need to pass the state that resulted from the first call into the second call. That's why dynamic_rnn has a parameter initial_state and that's why one of its outputs is final_state (i.e. the state after processing all 300 steps in ONE call). So you are meant to take the final state from call N and pass it back as the initial state for call N+1 to dynamic_rnn. This allrelates specifically to LSTM, since this is what you asked for
You are right to note that elements in one batch don't necessarily relate to each other within the same batch. This is something you need to consider carefully. Because with successive calls to dynamic_rnn, batch elements in your input sequences have to relate to their respective counterparts in the previous/following sequence, but not to each other. I.e. element 3 in the first call may have nothing to do with the other 127 elements within the same batch, but element 3 in the NEXT call has to be the temporal/logical continuation of element 3 in the PREVIOUS call, and so forth. This way, the state that you keep passing forward makes sense continuously

Tensorflow RNN sequence training

I'm making my first steps learning TF and have some trouble training RNNs.
My toy problem goes like this: a two layers LSTM + dense layer network is fed with raw audio data and should test whether a certain frequency is present in the sound.
so the network should 1 to 1 map float(audio data sequence) to float(pre-chosen frequency volume)
I've got this to work on Keras and seen a similar TFLearn solution but would like to implement this on bare Tensorflow in a relatively efficient way.
what i've done:
lstm = rnn_cell.BasicLSTMCell(LSTM_SIZE,state_is_tuple=True,forget_bias=1.0)
lstm = rnn_cell.DropoutWrapper(lstm)
stacked_lstm = rnn_cell.MultiRNNCell([lstm] * 2,state_is_tuple=True)
outputs, states = rnn.dynamic_rnn(stacked_lstm, in, dtype=tf.float32)
outputs = tf.transpose(outputs, [1, 0, 2])
last = tf.gather(outputs, int(outputs.get_shape()[0]) - 1)
network= tf.matmul(last, W) + b
# cost function, optimizer etc...
during training I fed this with (BATCH_SIZE, SEQUENCE_LEN,1) batches and it seems like the loss converged correctly but I can't figure out how to predict with the trained network.
My (awful lot of) questions:
how do i make this network return a sequence right from Tensorflow without going back to python for each sample(feed a sequence and predict a sequence of the same size)?
If I do want to predict one sample at a time and iterate in python what is the correct way to do it?
During testing is dynamic_rnn needed or it's just used for unrolling for BPTT during training? why is dynamic_rnn returning all the back propagation steps Tensors? these are the outputs of each layer of the unrolled network right?
after some research:
how do i make this network return a sequence right from Tensorflow
without going back to python for each sample(feed a sequence and
predict a sequence of the same size)?
you can use state_saving_rnn
class Saver():
def __init__(self):
self.d = {}
def state(self, name):
if not name in self.d:
return tf.zeros([1,LSTM_SIZE],tf.float32)
return self.d[name]
def save_state(self, name, val):
self.d[name] = val
return tf.identity('save_state_name') #<-important for control_dependencies
outputs, states = rnn.state_saving_rnn(stacked_lstm, inx, Saver(),
('lstmstate', 'lstmstate2', 'lstmstate3', 'lstmstate4'),sequence_length=[EVAL_SEQ_LEN])
#4 states are for two layers of lstm each has hidden and CEC variables to restore
network = [tf.matmul(outputs[-1], W) for i in xrange(EVAL_SEQ_LEN)]
one problem is that state_saving_rnn is using rnn() and not dynamic_rnn() therefore unroll at compile time EVAL_SEQ_LEN steps you might want to re-implement state_saving_rnn with dynamic_rnn if you want to input long sequences
If I do want to predict one sample at a time and iterate in python what is the correct way to do it?
you can use dynamic_rnn and supply initial_state. this is probably just as efficient as state_saving_rnn. look at state_saving_rnn implementations for reference
During testing is dynamic_rnn needed or it's just used for unrolling for BPTT during training? why is dynamic_rnn returning all the back propagation steps Tensors? these are the outputs of each layer of the unrolled network right?
dynamic_rnn does do unrolling at runtime similarly to compile time rnn(). I guess it returns all the steps for you to branch the graph in some other places - after less time steps. in a network that use [one time step input * current state -> one output, new state] like the one described above it's not needed in testing but could be used for training truncated time back propagation