Tensorflow RNN Input shape - tensorflow

Updated question: This is a good resource: http://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/
See the section on "LSTM State Within A Batch".
If I interpret this correctly, the author did not need to reshape the data as x,y,z (as he did in the preceding example); he just increased the batch size. So an LSTM cells hidden state (the one that gets passed from one time step to the next) started at row 0, and keeps getting updated until all rows in the batch have finished? is that right?
If that is correct then why does one ever need to have a time step greater than 1? Could I not just stack all my time-series rows in order, and feed them as a single batch?
Original question:
I'm getting myself into an absolute muddle trying to understand the correct way to shape my data for tensorflow, particularly around time_steps. Reading around has only confused me further, so I thought I'd cave in and ask.
I'm trying to model time series data in which the data at time t is a 5 columns in width (5 features , 1 label).
So then t-1 will also have another 5 features, and 1 label
Here is an example with 2 rows.
x=[1,1,1,1,1] y=[5]
x=[2,2,2,2,2] y=[15]
I've got an RNN model to work by feeding in a 1x1x5 matrix into my x variable. Which implies my 'time step' has a dimension of 1. However as with the above example, the second line I feed in is correlated to the first (15 = 5 +(2+2+2+2+20 in case you haven't spotted it)
So is the way I'm currently entering it correct? How does the time stamp dimension work?
Or should I be thinking of it as batch size, rows, cols in my head?
Either way can someone tell me what are the dimensions are I should be reshaping my input data to? For sake of argument assume I've split the data into batches of 1000. So within those 1000 rows I want a prediction for every row, but the RNN should be look to the row above it in my batch to figure out the answer.
x1=[1,1,1,1,1] y=[5]
x2=[2,2,2,2,2] y=[15]
...
etc.

Related

should I shift a dataset to use it for regression with LSTM?

Maybe this is a silly question but I didn't find much about it when I google it.
I have a dataset and I use it for regression but a normal regression with FFNN didn't worked so I thought why not try an LSTM since my data is time dependent I think because it was token from a vehicle while driving so the data is monotonic and maybe I can use LSTM in this Case to do a regression to predict a continuous value (if this doesn't make sense please tell me).
Now the first step is to prepare my data for using LSTM, since I ll predict the future I think my target(Ground truth or labels) should be shifted to the up, am I right?
So if I have a pandas dataframe where each row hold the features and the target(at the end of the row), I assume that the features should stay where they are and the target would be shifted it one step up so that the features in the first row will correspond to the target of the second row (am I wrong).
This way the LSTM will be able to predict the future value from those features.
I didn't find much about this in the internet so please can you provide me how can I do this with some Code?
I also know what I can use pandas.DataFrame.shift to shift a dataset but the last value will hold a NaN I think! how to deal with this? it would be great if you show me some examples or code.
We might need a bit more information regarding the data you are using. Also, I would suggest starting with a more simple recurrent neural network before you start going for LSTMs. The way these networks work is by you feeding the first bit of information, then the next bit of information, then the next bit etc. Let's say that when you feed the first bit of information in, it occurs at time t, then the second bit of information is fed at time t+1 ... etc. up until time t+n.
You can have the neural network output a value at each time step (so a value is outputted at time t, t+1... t+n after each respective input has been fed in). This is a many-to-many network. Or you can have the neural network output a value after all inputs have been provided (i.e. the value is outputted at time t+n). This is called a many-to-one network. What you need is dependednt on your use-case.
For example, say you were recording vehicle behaviour every 100ms and after 10 seconds (i.e. the 100th time step), you wanted to predict the likelihood that the driver was under the influence of alcohol. In this case, you would use a many-to-one network where you put in subsequent vehicle behaviour recordings at subsequent time steps (the first recording at time t, then the next recording at time t+1 etc.) and then the final timestep has the probability value outputted.
If you want a value outputted after every time step, you use a many-to-many design. It's also possible to output a value every k timesteps.

Is it advisable to save the final state from training of an RNN to initialize it during testing?

After training a RNN does it makes sense to save the final state so that it is then the initial state for testing?
I am using:
stacked_lstm = rnn.MultiRNNCell([rnn.BasicLSTMCell(n_hidden,state_is_tuple=True) for _ in range(number_of_layers)], state_is_tuple=True)
The state has a very specific meaning and purpose. This isn't a question of "advisable" or not, there's a right and wrong answer here, and it depends on your data.
Consider each timestep in your sequence of data. At the first time step your state should be initialized to all zeros. This value has a specific meaning, it tells the network that this is the beginning of your sequence.
At each time step the RNN is computing a new state. The MultiRNNCell implementation in tensorflow is hiding this from you, but internally in that function a new hidden state is computed at each time step and passed forward.
The value of state at the 2nd time step is the output of the state at the 1st time step, and so on and so forth.
So the answer to your question is yes only if the next batch is continuing in time from the previous batch. Let me explain this with a couple of examples where you do, and don't perform this operation respectively.
Example 1: let's say you are training a character RNN, a common tutorial example where your input is each character in the works of Shakespear. There are millions of characters in this sequence. You can't train on a sequence that long. So you break your sequence into segments of 100 (if you don't know why to do otherwise limit your sequences to roughly 100 time steps). In this example, each training step is a sequence of 100 characters, and is a continuation of the last 100 characters. So you must carry the state forward to the next training step.
Example 2: where this isn't use would be in training an RNN to recognize MNIST handwritten digits. In this case you split your image into 28 rows of 28 pixels and each training has only 28 time steps, one per row in the image. In this case each training iteration starts at the beginning of the sequence for that image and trains fully until the end of the sequence for that image. You would not carry the hidden state forward in this case, your hidden state must start with zero's to tell the system that this is the beginning of a new image sequence, not the continuation of the last image you trained on.
I hope those two examples illustrate the important difference there. Know that if you have sequence lengths that are very long (say over ~100 timesteps) you need to break them up and think through the process of carrying forward the state appropriately. You can't effectively train on infinitely long sequence lengths. If your sequence lengths are under this rough threshold then you won't worry about this detail and always initialize your state to zero.
Also know that even though you only train on say 100 timesteps at a time the RNN can still be expected to learn patterns that operate over longer sequences, Karpathy's fabulous paper/blog on "The unreasonable effectiveness of RNNs" demonstrates this beautifully. Those character level RNNs can keep track of important details like whether a quote is open or not over many hundreds of characters, far more than were ever trained on in one batch, specifically because the hidden state was carried forward in the appropriate manner.

Setting up the input on an RNN in Keras

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original

Should my seq2seq RNN idea work?

I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").

Influence of number of steps back when using truncated backpropagation through time

Im currently developing a model for time series prediction using LSTM cells with tensorflow. My model is similar to the ptb_word_lm. It works, but I'm not sure how to understand the number of steps back parameter when using truncated backpropagation through time (the parameter is called num_steps in the example).
As far as I understand, the model parameters are updated after every num_steps steps. But does that also mean that the model does not recognize dependencies that are farther away than num_steps. I think it should because the internal state should capture them. But then which effect has a large/small num_steps value.
The num_steps in the ptb_word_lm example shows the sequence length or the num of words to be processed for predicting the next word.
For example if you have a sentence.
"Widows and orphans occur when the first line of a paragraph is the last line in a column or page, or when the last line of a paragraph is the first line of a new column or page."
If u say num_steps = 5
then you mean for
input = "Widows and orphans occur when"
output = "and orphans occur when the"
i.e given the words ("Widows","and","orphans","occur","when"), you are trying to predict the occurrence of the word ("the").
so, the num_steps actually plays a important role in remembering the larger context(i.e the given words) for predicting the probability of the next word
Hope, this is helpful..