I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with samples a1,a2,a3,a4), B(b1,b2,b3,b4) and C(c1,c2,c3,c4), then I will feed the LSTM with batches of (a1,b1,c1), then (a2,b2,c2) etc.? Meaning that all time series needs to be of the same size/number of samples?
If so, can anynone more experienced be so kind and describe me very simply how to approach the whole process of learning the LSTM and creating the classifier?
My intention is to use TensorFlow, but I am still new to this.
If your goal is classification, then your data should be a a time series and a label. During training, you feed each into the lstm, and look only at the last output and backprop as necessary.
Judging from your question, you are probably confused about batching -- you can train multiple items at once. However, each item in the batch would get its own hidden state, and only the parameters of the layers are updated.
The time series in a single batch should be of the same length. You should terminate each sequence with a END token and pad items that are too short with a special token PAD -- the lstm should learn that PAD's after and END are useless.
There is no need for different batches to have the same number of items, nor to have items of the same length.
Related
Correct me if I am wrong but according to the official Keras documentation, by default, the fit function has the argument 'shuffle=True', hence it shuffles the whole training dataset on each epoch.
However, the point of using recurrent neural networks such as LSTM or GRU is to use the precise order of each data so that the state of the previous data influence the current one.
If we shuffle all the data, all the logical sequences are broken. Thus I don't understand why there are so much examples of LSTM where the argument is not set to False. What is the point of using RNN without sequences ?
Also, when I set the shuffle option to False, my LSTM model is less performant eventhought there are dependencies between the data: I use the KDD99 dataset where the connections are linked.
If we shuffle all the data, all the logical sequences are broken.
No, the shuffling happens on the batches axis, not on the time axis.
Usually, your data for an RNN has a shape like this: (batch_size, timesteps, features)
Usually, you give your network not only one sequence to learn from, but many sequences. Only the order in which these many sequences are being trained on gets shuffled. The sequences themselves stay intact.
Shuffling is usually always a good idea because your network shall only learn the training examples themselves, not their order.
This being said, there are cases where you have indeed only one huge sequence to learn from. In that case you have the option to still divide your sequence into several batches. If this is the case, you are absolutely right with your concern that shuffling would have a huge negative impact, so don't do that in this case!
Note: RNNs have a stateful parameter that you can set to True. In that case the last state of the previous batch will be passed to the following one which effectively makes your RNN see all batches as one huge sequence. So, absolutely do this, if you have a huge sequence over multiple batches.
I am working on a project for price movement forecasting and I am stuck with poor quality predictions.
At every time-step I am using an LSTM to predict the next 10 time-steps. The input is the sequence of the last 45-60 observations. I tested several different ideas, but they all seems to give similar results. The model is trained to minimize MSE.
For each idea I tried a model predicting 1 step at a time where each prediction is fed back as an input for the next prediction, and a model directly predicting the next 10 steps(multiple outputs). For each idea I also tried using as input just the moving average of the previous prices, and extending the input to input the order book at those time-steps.
Each time-step corresponds to a second.
These are the results so far:
1- The first attempt was using as input the moving average of the last N steps, and predict the moving average of the next 10.
At time t, I use the ground truth value of the price and use the model to predict t+1....t+10
This is the result
Predicting moving average
On closer inspection we can see what's going wrong:
Prediction seems to be a flat line. Does not care much about the input data.
2) The second attempt was trying to predict differences, instead of simply the price movement. The input this time instead of simply being X[t] (where X is my input matrix) would be X[t]-X[t-1].
This did not really help.
The plot this time looks like this:
Predicting differences
But on close inspection, when plotting the differences, the predictions are always basically 0.
Plot of differences
At this point, I am stuck here and running our of ideas to try. I was hoping someone with more experience in this type of data could point me in the right direction.
Am I using the right objective to train the model? Are there any details when dealing with this type of data that I am missing?
Are there any "tricks" to prevent your model from always predicting similar values to what it last saw? (They do incur in low error, but they become meaningless at that point).
At least just a hint on where to dig for further info would be highly appreciated.
Thanks!
Am I using the right objective to train the model?
Yes, but LSTM are always very tricky for forecasting time series. And are very prone to overfitting compared to other time series models.
Are there any details when dealing with this type of data that I am missing?
Are there any "tricks" to prevent your model from always predicting similar values to what it last saw?
I haven't seen your code, or the details of the LSTM you are using. Make sure you are using a very small network, and you are avoiding overfitting. Make sure that after you differenced the data - you then reintegrate it before evaluating the final forecast.
On trick to try to build a model that forecasts 10 steps ahead directly instead of building a one-step ahead model and then forecasting recursively.
I am trying Forex prediction with Keras and Tensorflow using a LSTM Network.
I of course want it to train on many days of trading but to do that I would have to give it sequential data with big jumps and phases without movement... when the market is closed... This isn't ideal as it gets "confused" because of these jumps and phases of no movement. Alternatively I an use one day of minute per minute data but this way I have very limited time of training data and the model won't be very good.
Do you have Ideas on how to fix this?
here is my current code:
CODE
Thanks
If you plan on fitting multiple datasets as data slices, sequentially, something like this would work:
for _ in range(10):
#somehow cut the data into slices and fit them one by one
model.fit(data_slice, label_slice ......)
As successive calls to fit will train the single model incrementally .
To be more specific, I’m dealing with a NLP problem and training a LSTM to word prediction given a initial sequence of words. My dataset is 200k reddit comments.
Does it matter if I randomly feed the examples one at a time (allowing repeated inputs) or if I feed them in a sequence (not allowing repetitions)?
Since your data is actually a set of comments, there is no need to process them in a sequence. In fact, typically it is better to process data in random order to make sure that network does not learn something which is order dependent. Repeats do not matter at all, as long as you sample uniformly.
I am using Tensorflow's combination of GRUCell + MultiRNNCell + dynamic_rnn to generate a multi-layer LSTM to predict a sequence of elements.
In the few examples I have seen, like character-level language models, once the Training stage is done, the Generation seems to be done by feeding only ONE 'character' (or whatever element) at a time to get the next prediction, and then getting the following 'character' based on the first prediction, etc.
My question is, since Tensorflow's dynamic_rnn unrolls the RNN graph into an arbitrary number of steps of whatever sequence length is fed into it, what is the benefit of feeding only one element at a time, once a prediction is gradually being built out? Doesn't it make more sense to be gradually collecting a longer sequence with each predictive step and re-feeding it into the graph? I.e. after generating the first prediction, feed back a sequence of 2 elements, and then 3, etc.?
I am currently trying out the prediction stage by initially feeding in a sequence of 15 elements (actual historic data), getting the last element of the prediction, and then replacing one element in the original input with that predicted value, and so on in a loop of N predictive steps.
What is the disadvantage of this approach versus feeding just one element at-a-time?
I'm not sure your approach is actually doing what you want it to do.
Let's say we have an LSTM network trained to generate the alphabet. Now, in order to have the network generate a sequence, we start with a clean state h0 and feed in the first character, a. The network outputs a new state, h1, and its prediction, b, which we append to our output. Next, we want the network to predict the next character based on the current output, ab. If we would feed the network ab with the state being h1 at this step, its perceived sequence would be aab, because h1 was calculated after the first a, and now we put in another a and a b. Alternatively, we could feed ab and a clean state h0 into the network, which would provide a proper output (based on ab), but we would perform unnecessary calculations for the whole sequence except b, because we already calculated the state h1 which corresponds to the network reading the sequence a, so in order to get the next prediction and state we only have to feed in the next character, b.
So to answer your question, feeding the network one character at a time makes sense because the network needs to see each character only once, and feeding the same character multiple times would just be unnecessary calculations.
This is an great question, I asked something very similar here.
The idea being instead of sharing weights across time (one element at-a-time as you describe it), each time step gets it's own set of weights.
I believe there are several reasons for training one-step at a time, mainly computational complexity and training difficulty. The number of weights you'll need to train grows linearly for each time step. You'd need some pretty sporty hardware to train long sequences. Also for long sequences you'll need a very large data set to train all those weights. But imho, I am still optimistic that for the right problem, with sufficient resources, it would show improvement.