LSTM: inverting by entering output value to get to input value - tensorflow

How about this thought experiment. You have an LSTM network to encode a sequence of integers (inputs are x_t). For each cell, the weights, biases and outputs of the gates, as well as the previous states (C_t-1, h_t-1) are calculated and stored for the forward pass. Each cell has, thus, a certain hard-coded pattern. We'll use this image for a visualization aid (from Wikipedia).
I now want to take a cell with all its calculated values, and switch the input with the output. The new input is a new value that gets entered into h_t, let's call it h'_t. The computation is inverted to arrive at a new value for x_t that we'll call x'_t. Does the cell keep its properties so that there is a (non-linear) relationship between the output of a cell and the input, regardless of what the output value I enter into it is? This calculation would then be performed for the whole sequence of LSTM cells, but each cell would get a different h'_t value as an input.
Furthermore, would it be enough to only calculate it backwards for the output gate, since one can arrive at x'_t either through the output gate, the input gate, or the forget gate?


Is it advisable to save the final state from training of an RNN to initialize it during testing?

After training a RNN does it makes sense to save the final state so that it is then the initial state for testing?
I am using:
stacked_lstm = rnn.MultiRNNCell([rnn.BasicLSTMCell(n_hidden,state_is_tuple=True) for _ in range(number_of_layers)], state_is_tuple=True)
The state has a very specific meaning and purpose. This isn't a question of "advisable" or not, there's a right and wrong answer here, and it depends on your data.
Consider each timestep in your sequence of data. At the first time step your state should be initialized to all zeros. This value has a specific meaning, it tells the network that this is the beginning of your sequence.
At each time step the RNN is computing a new state. The MultiRNNCell implementation in tensorflow is hiding this from you, but internally in that function a new hidden state is computed at each time step and passed forward.
The value of state at the 2nd time step is the output of the state at the 1st time step, and so on and so forth.
So the answer to your question is yes only if the next batch is continuing in time from the previous batch. Let me explain this with a couple of examples where you do, and don't perform this operation respectively.
Example 1: let's say you are training a character RNN, a common tutorial example where your input is each character in the works of Shakespear. There are millions of characters in this sequence. You can't train on a sequence that long. So you break your sequence into segments of 100 (if you don't know why to do otherwise limit your sequences to roughly 100 time steps). In this example, each training step is a sequence of 100 characters, and is a continuation of the last 100 characters. So you must carry the state forward to the next training step.
Example 2: where this isn't use would be in training an RNN to recognize MNIST handwritten digits. In this case you split your image into 28 rows of 28 pixels and each training has only 28 time steps, one per row in the image. In this case each training iteration starts at the beginning of the sequence for that image and trains fully until the end of the sequence for that image. You would not carry the hidden state forward in this case, your hidden state must start with zero's to tell the system that this is the beginning of a new image sequence, not the continuation of the last image you trained on.
I hope those two examples illustrate the important difference there. Know that if you have sequence lengths that are very long (say over ~100 timesteps) you need to break them up and think through the process of carrying forward the state appropriately. You can't effectively train on infinitely long sequence lengths. If your sequence lengths are under this rough threshold then you won't worry about this detail and always initialize your state to zero.
Also know that even though you only train on say 100 timesteps at a time the RNN can still be expected to learn patterns that operate over longer sequences, Karpathy's fabulous paper/blog on "The unreasonable effectiveness of RNNs" demonstrates this beautifully. Those character level RNNs can keep track of important details like whether a quote is open or not over many hundreds of characters, far more than were ever trained on in one batch, specifically because the hidden state was carried forward in the appropriate manner.

Tensorflow RNN Input shape

Updated question: This is a good resource:
See the section on "LSTM State Within A Batch".
If I interpret this correctly, the author did not need to reshape the data as x,y,z (as he did in the preceding example); he just increased the batch size. So an LSTM cells hidden state (the one that gets passed from one time step to the next) started at row 0, and keeps getting updated until all rows in the batch have finished? is that right?
If that is correct then why does one ever need to have a time step greater than 1? Could I not just stack all my time-series rows in order, and feed them as a single batch?
Original question:
I'm getting myself into an absolute muddle trying to understand the correct way to shape my data for tensorflow, particularly around time_steps. Reading around has only confused me further, so I thought I'd cave in and ask.
I'm trying to model time series data in which the data at time t is a 5 columns in width (5 features , 1 label).
So then t-1 will also have another 5 features, and 1 label
Here is an example with 2 rows.
x=[1,1,1,1,1] y=[5]
x=[2,2,2,2,2] y=[15]
I've got an RNN model to work by feeding in a 1x1x5 matrix into my x variable. Which implies my 'time step' has a dimension of 1. However as with the above example, the second line I feed in is correlated to the first (15 = 5 +(2+2+2+2+20 in case you haven't spotted it)
So is the way I'm currently entering it correct? How does the time stamp dimension work?
Or should I be thinking of it as batch size, rows, cols in my head?
Either way can someone tell me what are the dimensions are I should be reshaping my input data to? For sake of argument assume I've split the data into batches of 1000. So within those 1000 rows I want a prediction for every row, but the RNN should be look to the row above it in my batch to figure out the answer.
x1=[1,1,1,1,1] y=[5]
x2=[2,2,2,2,2] y=[15]

Clarification of TensorFlow AttentionWrapper's Layer Size

In tensorflow.contrib.seq2seq's AttentionWrapper, what does "depth" refer to as stated in the attention_layer_size documentation? When the documentation says to "use the context as attention" if the value is None, what is meant by "the context"?
In Neural Machine Translation by Jointly Learning to Align and Translate they give a description of the (Bahdanau) attention mechanism; essentially what happens is that you compute scalar "alignment scores" a_1, a_2, ..., a_n that indicate how important each element of your encoded input sequence is at a given moment in time (i.e. which part of the input sentence you should pay attention to right now in the current timestep).
Assuming your (encoded) input sequence that you want to "pay attention"/"attend over" is a sequence of vectors denoted as e_1, e_2, ..., e_n, the context vector at a given timestep is the weighted sum over all of these as determined by your alignment scores:
context = c := (a_1*e_1) + (a_2*e_2) + ... + (a_n*e_n)
(Remember that the a_k's are scalars; you can think of this as an "averaged-out" letter/word in your sentence --- so ideally if your model is trained well, the context looks most similar to the e_i you want to pay attention to the most, but bears a little bit of resemblance to e_{i-1}, e_{i+1}, etc. Intuitively, think of a "smeared-out" input element, if that makes any sense...)
Anyway, if attention_layer_size is not None, then it specifies the number of hidden units in a feedforward layer within your decoder that is used to mix this context vector with the output of the decoder's internal RNN cell to get the attention value. If attention_layer_size == None, it just uses the context vector above as the attention value, and no mixing of the internal RNN cell's output is done. (When I say "mixing", I mean that the context vector and the RNN cell's output are concatenated and then projected to the dimensionality that you specify by setting attention_layer_size.)
The relevant part of the implementation is at this line and has a description of how it's computed.
Hope that helps!

Is Tensorflow RNN implements Elman network fully?

Q: Is Tensorflow RNN implemented to ouput Elman Network's hidden state?
cells = tf.contrib.rnn.BasicRNNCell(4)
outputs, state = tf.nn.dynamic_rnn(cell=cells, etc...)
I'm quiet new to TF's RNN and curious about meaning of outputs, and state.
I'm following stanford's tensorflow tutorial but there seems no detailed explanation so I'm asking here.
After testing, I think state is hidden state after sequence calculation and outputs is array of hidden states after each time steps.
so I want to make it clear. outputs and state are just hidden state vectors so to fully implement Elman network, I have to make V matrix in the picture and do matrix multiplication again. am I correct?
I believe you are asking what the output of a intermediate state and output is.
From what I understand, the state would be intermediate output after a convolution / sequence calculation and is hidden, so your understanding is in the right direction.
Output may vary as how you decide to implement your network model, but on a general basis, it is an array where any operation (convolution, sequence calc etc) has been applied after which activation & downsampling/pooling has been applied, to concentrate on identifiable features across that layer.
From Colah's blog ( ):
Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.
Hope this helps.
Thank you

Tensorflow: getting outputs form bidirectional_rnn with variable sequence length

I'm using tf.nn.bidirectional_rnn with the sequence_length parameter for variable input size, and I can't figure out how to get the final output for each sample in the minibatch:
output, _, _ = tf.nn.bidirectional_rnn(forward1,backward1,input,dtype=tf.float32,sequence_length=input_lengths)
Now, if I had constant sequence lengths, I would simply use output[-1] and get the final output. In my case I have variable sequences (their lengths are known).
Also, is this output the output of both forward and backward LSTMs?
This question can be answered by looking at the source code
For sequences with dynamic length, the source code says:
If the sequence_length vector is provided, dynamic calculation is
performed. This method of calculation does not compute the RNN steps
past the maximum sequence length of the minibatch (thus saving
computational time), and properly propagates the state at an
example's sequence length to the final state output.
Therefore, in order to get the actual last output, you should slice the resulting output.
For bidirectional_rnn, the source code says:
A tuple (outputs, output_state_fw, output_state_bw) where:
outputs is a length T list of outputs (one for each input), which
are depth-concatenated forward and backward outputs.
output_state_fw is the final state of the forward rnn.
output_state_bw is the final state of the backward rnn.
Therefore, the output is a tuple rather than a tensor.
You can concatenate this tuple into a vector if you wish.