LSTM how batch size and sequence length affect memory - tensorflow

I have a question regarding batch size and sequence length. Let’s suppose that I have 10 different independent time series, each of length 100.
5 are of a person doing one activity, and the other 5 are of a person doing another activity.
I want to create an LSTM that will be able to remember the sequences all the way from the first sample in each sequence and classify test samples that I input into one activity or the other.
Now, for a first try, let’s say that I can input test-samples of length 100. How would I do this? Would I create an LSTM and then feed in data of the shape [10, 100, 1] in one go? Or would I feed in data of the shape [1, 100, 1] 10 times? The question here is that does batching affect how the LSTM will memorize past inputs? I do not want the LSTM to remember between independent sequences, but I do want it to remember all the way from the beginning of each time sequence.
Secondly, let’s say that I want to now chunk up the sequences I use to train the LSTM. The goal remains the same as before. So now I window the sequences into chunks of 10. Do I feed it in as [10,10,1] for each sequence? If I do this, will the LSTM memorize the temporal dynamics of the sequence all the way to the beginning? Will doing the LSTM this way be analogous to not chunking up the sequences and feeding them in full length?

I can answer the part of your question that has to do with batching. There are two reasons to batch.
It is more efficient for the computer to do the matrix multiplications in batches. If you are doing it on a CPU then part of the efficiency comes from being able to cache the matrix and not have to reload it from memory. During evaluation, the sequences in the batch do not interfere with each other. It is the same as if each one is computed individually.
During training, having multiple sequences in a batch reduces noise in the gradient. The weight update is computed by averaging the gradients of all the sequences in the batch. Having more sequences gives a more reliable estimate of which direction to move the parameters in order to improve the loss function.

Related

How to train an LSTM on multiple independent time-series of sensor-data

I have sensor measurements for 10 different people performing the same experiment in which they need to complete a specific task. For each timestep in the measurements I have the corresponding label and my goal is to train a sequential classifier which predicts the action a person is performing given the sensor observations. So, basically, for each person I have a separate dataset containing timesteps, several sensor measurements and the corresponding action (activity) for each timestep. I want to perform a leave-one-out cross validation, which would mean that I will take the sequence of measurements and action labels for 9 people for the training part and 1 sequence for the test part. However, I don't know how to train my model on the 9 different independent measurement sequences (they have also different lengths).
My idea is to first apply masking/padding to make the sequences of equal length L, then concatenate the padded sequences and for the training to use a batch size of n, where L is divisible by n without remainder. I am not sure though if this is the right way to go. Maybe Keras already supports training sequential models on independent sequences?
I would be happy to hear your recommendations. Thank you!

Keras : Shuffling dataset while using LSTM

Correct me if I am wrong but according to the official Keras documentation, by default, the fit function has the argument 'shuffle=True', hence it shuffles the whole training dataset on each epoch.
However, the point of using recurrent neural networks such as LSTM or GRU is to use the precise order of each data so that the state of the previous data influence the current one.
If we shuffle all the data, all the logical sequences are broken. Thus I don't understand why there are so much examples of LSTM where the argument is not set to False. What is the point of using RNN without sequences ?
Also, when I set the shuffle option to False, my LSTM model is less performant eventhought there are dependencies between the data: I use the KDD99 dataset where the connections are linked.
If we shuffle all the data, all the logical sequences are broken.
No, the shuffling happens on the batches axis, not on the time axis.
Usually, your data for an RNN has a shape like this: (batch_size, timesteps, features)
Usually, you give your network not only one sequence to learn from, but many sequences. Only the order in which these many sequences are being trained on gets shuffled. The sequences themselves stay intact.
Shuffling is usually always a good idea because your network shall only learn the training examples themselves, not their order.
This being said, there are cases where you have indeed only one huge sequence to learn from. In that case you have the option to still divide your sequence into several batches. If this is the case, you are absolutely right with your concern that shuffling would have a huge negative impact, so don't do that in this case!
Note: RNNs have a stateful parameter that you can set to True. In that case the last state of the previous batch will be passed to the following one which effectively makes your RNN see all batches as one huge sequence. So, absolutely do this, if you have a huge sequence over multiple batches.

Time series classification using LSTM - How to approach?

I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with samples a1,a2,a3,a4), B(b1,b2,b3,b4) and C(c1,c2,c3,c4), then I will feed the LSTM with batches of (a1,b1,c1), then (a2,b2,c2) etc.? Meaning that all time series needs to be of the same size/number of samples?
If so, can anynone more experienced be so kind and describe me very simply how to approach the whole process of learning the LSTM and creating the classifier?
My intention is to use TensorFlow, but I am still new to this.
If your goal is classification, then your data should be a a time series and a label. During training, you feed each into the lstm, and look only at the last output and backprop as necessary.
Judging from your question, you are probably confused about batching -- you can train multiple items at once. However, each item in the batch would get its own hidden state, and only the parameters of the layers are updated.
The time series in a single batch should be of the same length. You should terminate each sequence with a END token and pad items that are too short with a special token PAD -- the lstm should learn that PAD's after and END are useless.
There is no need for different batches to have the same number of items, nor to have items of the same length.

When training seq2seq model with bucketing method do we keep separate RNNs for each bucket?

Let's say we have 3 buckets of different lengths. So do we train 3 different nets?
Can't we keep a dynamic RNN. Where it will add units according to the length of input sequence in the encoder. Then encoder will pass the last hidden state to the decoder. Will it work?
I went through this. Bucketing is help to speed up the training process. We first divide examples in to buckets. Then we can reduce the number of units we have to pad.
In the training iterations we select one bucket at a time and train the whole network.
In the validation we check the perplexity of the test examples in each bucket.
Tensorflow support this bucketing.
Dynamic RNN is different here we don't have a bucketing mechanism. It we input data as a tensor with the shape of [batch_size,hidden size,max_seq_length].
Here sequences shorter than the maximum length should padded with zeros.
Then it will create dynamic RNNs that their length equal to actual inputs (without padded zeros). This uses a while loop in the tensorflow.

LSTM one-step-ahead prediction with Tensorflow

I am using Tensorflow's combination of GRUCell + MultiRNNCell + dynamic_rnn to generate a multi-layer LSTM to predict a sequence of elements.
In the few examples I have seen, like character-level language models, once the Training stage is done, the Generation seems to be done by feeding only ONE 'character' (or whatever element) at a time to get the next prediction, and then getting the following 'character' based on the first prediction, etc.
My question is, since Tensorflow's dynamic_rnn unrolls the RNN graph into an arbitrary number of steps of whatever sequence length is fed into it, what is the benefit of feeding only one element at a time, once a prediction is gradually being built out? Doesn't it make more sense to be gradually collecting a longer sequence with each predictive step and re-feeding it into the graph? I.e. after generating the first prediction, feed back a sequence of 2 elements, and then 3, etc.?
I am currently trying out the prediction stage by initially feeding in a sequence of 15 elements (actual historic data), getting the last element of the prediction, and then replacing one element in the original input with that predicted value, and so on in a loop of N predictive steps.
What is the disadvantage of this approach versus feeding just one element at-a-time?
I'm not sure your approach is actually doing what you want it to do.
Let's say we have an LSTM network trained to generate the alphabet. Now, in order to have the network generate a sequence, we start with a clean state h0 and feed in the first character, a. The network outputs a new state, h1, and its prediction, b, which we append to our output. Next, we want the network to predict the next character based on the current output, ab. If we would feed the network ab with the state being h1 at this step, its perceived sequence would be aab, because h1 was calculated after the first a, and now we put in another a and a b. Alternatively, we could feed ab and a clean state h0 into the network, which would provide a proper output (based on ab), but we would perform unnecessary calculations for the whole sequence except b, because we already calculated the state h1 which corresponds to the network reading the sequence a, so in order to get the next prediction and state we only have to feed in the next character, b.
So to answer your question, feeding the network one character at a time makes sense because the network needs to see each character only once, and feeding the same character multiple times would just be unnecessary calculations.
This is an great question, I asked something very similar here.
The idea being instead of sharing weights across time (one element at-a-time as you describe it), each time step gets it's own set of weights.
I believe there are several reasons for training one-step at a time, mainly computational complexity and training difficulty. The number of weights you'll need to train grows linearly for each time step. You'd need some pretty sporty hardware to train long sequences. Also for long sequences you'll need a very large data set to train all those weights. But imho, I am still optimistic that for the right problem, with sufficient resources, it would show improvement.