Setting up the input on an RNN in Keras - input

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?

RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original

Related

Is it advisable to save the final state from training of an RNN to initialize it during testing?

After training a RNN does it makes sense to save the final state so that it is then the initial state for testing?
I am using:
stacked_lstm = rnn.MultiRNNCell([rnn.BasicLSTMCell(n_hidden,state_is_tuple=True) for _ in range(number_of_layers)], state_is_tuple=True)
The state has a very specific meaning and purpose. This isn't a question of "advisable" or not, there's a right and wrong answer here, and it depends on your data.
Consider each timestep in your sequence of data. At the first time step your state should be initialized to all zeros. This value has a specific meaning, it tells the network that this is the beginning of your sequence.
At each time step the RNN is computing a new state. The MultiRNNCell implementation in tensorflow is hiding this from you, but internally in that function a new hidden state is computed at each time step and passed forward.
The value of state at the 2nd time step is the output of the state at the 1st time step, and so on and so forth.
So the answer to your question is yes only if the next batch is continuing in time from the previous batch. Let me explain this with a couple of examples where you do, and don't perform this operation respectively.
Example 1: let's say you are training a character RNN, a common tutorial example where your input is each character in the works of Shakespear. There are millions of characters in this sequence. You can't train on a sequence that long. So you break your sequence into segments of 100 (if you don't know why to do otherwise limit your sequences to roughly 100 time steps). In this example, each training step is a sequence of 100 characters, and is a continuation of the last 100 characters. So you must carry the state forward to the next training step.
Example 2: where this isn't use would be in training an RNN to recognize MNIST handwritten digits. In this case you split your image into 28 rows of 28 pixels and each training has only 28 time steps, one per row in the image. In this case each training iteration starts at the beginning of the sequence for that image and trains fully until the end of the sequence for that image. You would not carry the hidden state forward in this case, your hidden state must start with zero's to tell the system that this is the beginning of a new image sequence, not the continuation of the last image you trained on.
I hope those two examples illustrate the important difference there. Know that if you have sequence lengths that are very long (say over ~100 timesteps) you need to break them up and think through the process of carrying forward the state appropriately. You can't effectively train on infinitely long sequence lengths. If your sequence lengths are under this rough threshold then you won't worry about this detail and always initialize your state to zero.
Also know that even though you only train on say 100 timesteps at a time the RNN can still be expected to learn patterns that operate over longer sequences, Karpathy's fabulous paper/blog on "The unreasonable effectiveness of RNNs" demonstrates this beautifully. Those character level RNNs can keep track of important details like whether a quote is open or not over many hundreds of characters, far more than were ever trained on in one batch, specifically because the hidden state was carried forward in the appropriate manner.

Variable size multi-label candidate sampling in tensorflow?

nce_loss() asks for a static int value for num_true. That works well for problems where we have the same amount of labels per training example and we know it in advance.
When labels have a variable shape [None], and being batched and/or bucketed by bucket size with .padded_batch() + .group_by_window() it is necessary to provide a variable size num_true in order to accustom for all training examples. This is currently unsupported to my knowledge (correct me if I'm wrong).
In other words suppose we have either a dataset of images with an arbitrary amount of labels per each image (dog, cat, duck, etc.) or a text dataset with numerous multiple classes per sentence (class_1, class_2, ..., class_n). Classes are NOT mutually exclusive, and can vary in size between examples.
But as the amount of possible labels can be huge 10k-100k is there a way to do a sampling loss to improve performance (in comparison with a sigmoid_cross_entropy)?
Is there a proper way to do this or any other workarounds?
nce_loss = tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=labels,
inputs=inputs,
num_sampled=num_sampled,
# Something like this:
# `num_true=(tf.shape(labels)[-1])` instead of `num_true=const_int`
# , would be preferable here
num_classes=self.num_classes)
I see two issues:
1) Work with NCE with different numbers of true values;
2) Classes that are NOT mutually exclusive.
To the first issue, as #michal said, there is an expectative of including this functionality in the future. I have tried almost the same thing: to use labels with shape=(None, None), i.e., true_values dimension None. The sampled_values parameter has the same problem: true_values number must be a fixed integer number. The recomended work around is to use a class (0 is the best one) representing <PAD> and complete the number of true_values. In my case, 0 is an special token that represents <PAD>. Part of code is here:
assert len(labels) <= (window_size * 2)
zeros = ((window_size * 2) - len(labels)) * [0]
labels = labels + zeros
labels.sort()
I sorted the label because considering another recommendation:
Note: By default this uses a log-uniform (Zipfian) distribution for
sampling, so your labels must be sorted in order of decreasing
frequency to achieve good results.
In my case, the special tokens and more frequent words have lower indexes, otherwise, less frequent words have higher indexes. I included all label classes associated to the input at same time and completed with zero till the true_values number. Of course, you must ignore the 0 class at the end.

Tensorflow RNN Input shape

Updated question: This is a good resource: http://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/
See the section on "LSTM State Within A Batch".
If I interpret this correctly, the author did not need to reshape the data as x,y,z (as he did in the preceding example); he just increased the batch size. So an LSTM cells hidden state (the one that gets passed from one time step to the next) started at row 0, and keeps getting updated until all rows in the batch have finished? is that right?
If that is correct then why does one ever need to have a time step greater than 1? Could I not just stack all my time-series rows in order, and feed them as a single batch?
Original question:
I'm getting myself into an absolute muddle trying to understand the correct way to shape my data for tensorflow, particularly around time_steps. Reading around has only confused me further, so I thought I'd cave in and ask.
I'm trying to model time series data in which the data at time t is a 5 columns in width (5 features , 1 label).
So then t-1 will also have another 5 features, and 1 label
Here is an example with 2 rows.
x=[1,1,1,1,1] y=[5]
x=[2,2,2,2,2] y=[15]
I've got an RNN model to work by feeding in a 1x1x5 matrix into my x variable. Which implies my 'time step' has a dimension of 1. However as with the above example, the second line I feed in is correlated to the first (15 = 5 +(2+2+2+2+20 in case you haven't spotted it)
So is the way I'm currently entering it correct? How does the time stamp dimension work?
Or should I be thinking of it as batch size, rows, cols in my head?
Either way can someone tell me what are the dimensions are I should be reshaping my input data to? For sake of argument assume I've split the data into batches of 1000. So within those 1000 rows I want a prediction for every row, but the RNN should be look to the row above it in my batch to figure out the answer.
x1=[1,1,1,1,1] y=[5]
x2=[2,2,2,2,2] y=[15]
...
etc.

Tensorflow: getting outputs form bidirectional_rnn with variable sequence length

I'm using tf.nn.bidirectional_rnn with the sequence_length parameter for variable input size, and I can't figure out how to get the final output for each sample in the minibatch:
output, _, _ = tf.nn.bidirectional_rnn(forward1,backward1,input,dtype=tf.float32,sequence_length=input_lengths)
Now, if I had constant sequence lengths, I would simply use output[-1] and get the final output. In my case I have variable sequences (their lengths are known).
Also, is this output the output of both forward and backward LSTMs?
Thanks.
This question can be answered by looking at the source code rnn.py.
For sequences with dynamic length, the source code says:
If the sequence_length vector is provided, dynamic calculation is
performed. This method of calculation does not compute the RNN steps
past the maximum sequence length of the minibatch (thus saving
computational time), and properly propagates the state at an
example's sequence length to the final state output.
Therefore, in order to get the actual last output, you should slice the resulting output.
For bidirectional_rnn, the source code says:
A tuple (outputs, output_state_fw, output_state_bw) where:
outputs is a length T list of outputs (one for each input), which
are depth-concatenated forward and backward outputs.
output_state_fw is the final state of the forward rnn.
output_state_bw is the final state of the backward rnn.
Therefore, the output is a tuple rather than a tensor.
You can concatenate this tuple into a vector if you wish.

Variable length dimension in tensor

I'm trying to implement the paper "End-to-End memory networks" (http://arxiv.org/abs/1503.08895)
Each training example consists of a number of phrases, a question and then the answer. The number of sentences is variable, as is the number of words in each sentence and the question. Each word is encoded as an integer. So my input would have the form [batch size, # of sentences, # words in sentence].
Now my problem is that the second and third dimension are unknown for each mini-batch. Can I still somehow represent this input as a single tensor or do I have to use lists of tensors, so that I have a list of length batch_size, and then a sublist of length number of sentences and then for each sentence a tensor, whose size is also not known in advance, corresponding to the words encoded as integers.
Can I use this second approach or will tensorflow then not be able to backpropagate, e.g. I have an operation where I have to calculate the following sum: \sum_i tf.scalar_mul(p_i, c_i), where p_i is a scalar and c_i is an embedding vector that was previously calculated. The tensors for the p and c values are then stored in a list, so I would have to sum over the elements in the two lists in a loop. I'm assuming that tensorflow would not be able to incoorporate this loop in the computation graph, correct? I'm sceptical since theano has a special scan function that allows one to loop over input, so I'm assuming that a regular loop would cause problems in the computation graph. How does tensorflow handle this?
Moving Yaroslav's comment to an answer:
TensorFlow has tf.scan. Dimensions may also be dynamic as in Theano.