Tensorflow - LSTM state reuse within batch - tensorflow

I am working on a Tensorflow NN which uses an LSTM to track a parameter (time series data regression problem). A batch of training data contains a batch_size of consecutive observations. I would like to use the LSTM state as input to the next sample. So, if I have a batch of data observations, I would like to feed the state of the first observation as input to the second observation and so on. Below I define the lstm state as a tensor of size = batch_size. I would like to reuse the state within a batch:
state = tf.Variable(cell.zero_states(batch_size, tf.float32), trainable=False)
cell = tf.nn.rnn_cell.BasicLSTMCell(100)
output, curr_state = tf.nn.rnn(cell, data, initial_state=state)
In the API there is a tf.nn.state_saving_rnn but the documentation is kinda vague. My question: How to reuse curr_state within a training batch.

You are basically there, just need to update state with curr_state:
state_update = tf.assign(state, curr_state)
Then, make sure you either call run on state_update itself or an operation that has state_update as a dependency, or the assignment will not actually happen. For example:
with tf.control_dependencies([state_update]):
model_output = ...
As suggested in the comments, the typical case for RNNs is that you have a batch where the first dimension (0) is the number of sequences and the second dimension (1) is the maximum length of each sequence (if you pass time_major=True when you build the RNN these two are swapped). Ideally, in order to get good performance, you stack multiple sequences into one batch, and then split that batch time-wise. But that's all a different topic really.

Related

Keras variable input

Im working through a Keras example at https://www.tensorflow.org/tutorials/text/text_generation
The model is built here:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
tf.keras.layers.GRU(rnn_units,
return_sequences=True,
stateful=True,
recurrent_initializer='glorot_uniform'),
tf.keras.layers.Dense(vocab_size)
])
return model
During training, they always pass in a length 100 array of ints.
But during prediction, they are able to pass in any length of input and the output is the same length as the input. I was always under the impression that the lengths of the time steps had to be the same. Is that not the case and the # of time steps of the RNN somehow can change?
RNNs are sequence models, ie. they take in a sequence of input and give out a sequence of outputs. The sequence length is also called the time steps is number of time the RNN cell is unwrapped and for each unwrapping an input is passed and RNN cell using its gates gives out an output (per each unwrapping). So in theory you can have as long sequence as you want. Now lets assume you have different inputs of different size, since you cannot have variable size inputs in a single batches you have to collect the inputs of same size an make a batch if you want to train using batches. You can as well use batch size of 1 and not worry about all this, but training become painfully slow.
In ptractical situations, while training we divide input into same sizes so that training become fast. There are situations like language translation models where this is not feasible.
So in theory RNNs does not have any limitation on the sequence length, however large sequence will start to loose the context at the begging as the sequence length increases.
While predictions you can use any sequence length you want to.
In you case your output size is same as input size because of return_sequences=True. You can as well have single output by using return_sequences=False where in only the output of last unwrapping is returned by keras.
Length of training sequences should not be equal to predicted length.
RNN deals with two vectors: new word and hidden state (accumulated from the previous words). It doesn't keep length of sequence.
But to get good prediction of long sequences - you have to train RNN with long sequences - because RNN should learn a long context.

Keras LSTM: Is batchsize equal to t from xt?

I know there have been so many questions already for this but i can't i find a clear answer to this.
Is this correct? Taken from Understanding Keras LSTMs here. Does the batch-size correspond to to 5 (0-4) in this picture? Taken from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ here. With a keras line like this:
model.add(LSTM(units, batch_input_shape=(batch_size, n_time_steps, n_features), stateful=False))
note the statefull=False,
So one input vector (one blue bubble) would be the size n_time_steps*n_features, right?
To make your understanding clear, the batch_input_shape = (batch_size,time_steps,n_features) in the first image you have mentioned it will be represented as batch_input_shape = (batch_size,4,3). In the second image it will be batch_input_shape = (batch_size,5,1).
In both pictures batch size is not represented, so don't get confused about batch size here.
A better understanding of these dimensions can be observed below.
For Stateful = True, the model expects the input to be in a sequence i.e not shuffled, nonoverlapping.
In this scenario, you need to fix the batch_size first.
If the data is small you can set the batch_size to 1(which is in most of the cases)
If data is large, you can set any number for batch_size and split the data to those equal number of batches so your data will be continuos when the next iteration starts.
At each iteration, the model instead of having a hidden state full of zeros, it will take the previous batch's final state as the initial state to the present batch.

Defining assignment function as variable in tensroflow?

I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?
First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.

Understanding the functioning of a recurrent neural network with LSTM cells

Context:
I have a recurrent neural network with LSTM cells
The input to the network is a batch of size (batch_size, number_of_timesteps, one_hot_encoded_class) in my case (128, 300, 38)
The different rows of the batch (1-128) are not necessarily related
to each other
The target for one time step is given by the value of the next
time step.
My questions:
When I train the network using an input batch of (128,300,38) and a target batch of the same size,
does the network always consider only the last time-step t to predict the value of the next timestep t+1?
or does it consider all time steps from the beginning of the sequence up to time step t?
or does the LSTM cell internally remember all previous states?
I am confused about the functioning because the network is trained on multiple time steps simulatenously so I am not sure how the LSTM cell can still have knowledge of the previous states.
I hope somebody can help. Thanks in advance!
Code for dicussion:
cells = []
for i in range(self.n_layers):
cell = tf.contrib.rnn.LSTMCell(self.n_hidden)
cells.append(cell)
cell = tf.contrib.rnn.MultiRNNCell(cells)
init_state = cell.zero_state(self.batch_size, tf.float32)
outputs, final_state = tf.nn.dynamic_rnn(
cell, inputs=self.inputs, initial_state=init_state)
self.logits = tf.contrib.layers.linear(outputs, self.num_classes)
softmax_ce = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=self.logits)
self.loss = tf.reduce_mean(softmax_ce)
self.train_step = tf.train.AdamOptimizer(self.lr).minimize(self.loss)
The above is a simple RNN unrolled to the neuron level with 3 time steps.
As you can see that the output at time step t, depends upon all time steps from the beginning. The network is trained using back-propagation through time where the weights are updated by the contribution of all error gradients across time. The weights are shared across time, so there is nothing like simultaneous update on all time steps.
The knowledge of the previous states are transfered through the state variable s_t as it is a function of previous inputs. So at any time step, the prediction is made based on the current input as well as (function of) previous inputs captured by the state variable.
NOTE: A basic rnn was used instead of LSTM because of simplicity.
Here's what would be helpful to keep in mind for your case specifically:
Given the input shape of [128, 300, 38]
One call to dynamic_rnn will propagate through all 300 steps, and if you are using something like LSTM, the state will also be carried through those 300 steps
However, each SUBSEQUENT call to dynamic_rnn will not automatically remember the state from the previous call. By the second call, the weights/etc. will have been updated thanks to the first call, but you will still need to pass the state that resulted from the first call into the second call. That's why dynamic_rnn has a parameter initial_state and that's why one of its outputs is final_state (i.e. the state after processing all 300 steps in ONE call). So you are meant to take the final state from call N and pass it back as the initial state for call N+1 to dynamic_rnn. This allrelates specifically to LSTM, since this is what you asked for
You are right to note that elements in one batch don't necessarily relate to each other within the same batch. This is something you need to consider carefully. Because with successive calls to dynamic_rnn, batch elements in your input sequences have to relate to their respective counterparts in the previous/following sequence, but not to each other. I.e. element 3 in the first call may have nothing to do with the other 127 elements within the same batch, but element 3 in the NEXT call has to be the temporal/logical continuation of element 3 in the PREVIOUS call, and so forth. This way, the state that you keep passing forward makes sense continuously

Tensorflow dynamic_rnn parameters meaning

I'm struggling to understand the cryptic RNN docs. Any help with the following will be greatly appreciated.
tf.nn.dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None, dtype=None, parallel_iterations=None, swap_memory=False, time_major=False, scope=None)
I'm struggling to understand how these parameters relate to the mathematical LSTM equations and RNN definition. Where is the cell unroll size? Is it defined by the 'max_time' dimension of the inputs? Is the batch_size only a convenience for splitting long data or it's related to minibatch SGD? Is the output state passed across batches?
tf.nn.dynamic_rnn takes in a batch (with the minibatch meaning) of unrelated sequences.
cell is the actual cell that you want to use (LSTM, GRU,...)
inputs has a shape of batch_size x max_time x input_size in which max_time is the number of steps in the longest sequence (but all sequences could be of the same length)
sequence_length is a vector of size batch_size in which each element gives the length of each sequence in the batch (leave it as default if all your sequences are of the same size. This parameter is the one that defines the cell unroll size.
Hidden state handling
The usual way of handling hidden state is to define an initial state tensor before the dynamic_rnn, like this for instance :
hidden_state_in = cell.zero_state(batch_size, tf.float32)
output, hidden_state_out = tf.nn.dynamic_rnn(cell,
inputs,
initial_state=hidden_state_in,
...)
In the above snippet, both hidden_state_in and hidden_state_out have the same shape [batch_size, ...] (the actual shape depends on the type of cell you use but the important thing is that the first dimension is the batch size).
This way, dynamic_rnn has an initial hidden state for each sequence. It will pass on the hidden state from time step to time step for each sequence in the inputs parameter on its own, and hidden_state_out will contain the final output state for each sequence in the batch. No hidden state is passed between sequences of the same batch, but only between time steps of the same sequence.
When do I need to feed back the hidden state manually?
Usually, when you're training, every batch is unrelated so you don't have to feed back the hidden state when doing a session.run(output).
However, if you're testing, and you need the output at each time step, (i.e. you have to do a session.run() at every time step) you'll want to evaluate and feed back the output hidden state using something like this :
output, hidden_state = sess.run([output, hidden_state_out],
feed_dict={hidden_state_in:hidden_state})
otherwise tensorflow will just use the default cell.zero_state(batch_size, tf.float32) at each time step which equates to reinitialising the hidden state at each time step.