How to deal with multi step time series forecasting in multivariate LSTM in keras - tensorflow

I am trying to do multi-step time series forecasting using multivariate LSTM in Keras. Specifically, I have two variables (var1 and var2) for each time step originally. Having followed the online tutorial here, I decided to use data at time (t-2) and (t-1) to predict the value of var2 at time step t. As sample data table shows, I am using the first 4 columns as input, Y as output. The code I have developed can be seen here, but I have got three questions.
var1(t-2) var2(t-2) var1(t-1) var2(t-1) var2(t)
2 1.5 -0.8 0.9 -0.5 -0.2
3 0.9 -0.5 -0.1 -0.2 0.2
4 -0.1 -0.2 -0.3 0.2 0.4
5 -0.3 0.2 -0.7 0.4 0.6
6 -0.7 0.4 0.2 0.6 0.7
Q1: I have trained an LSTM model with the data above. This model does
well in predicting the value of var2 at time step t. However, what
if I want to predict var2 at time step t+1. I feel it is hard
because the model cannot tell me the value of var1 at time step t. If I want to do it, how should I modify the code to build the model?
Q2: I have seen this question asked a lot, but I am still confused. In
my example, what should be the correct time step in [samples, time
steps, features] 1 or 2?
Q3: I just started studying LSTMs. I have
read here that one of the biggest advantages of LSTM is that it
learns the temporal dependence/sliding window size by itself, then
why must we always covert time series data into format like the
table above?
Update: LSTM result (blue line is the training seq, orange line is the ground truth, green is the prediction)

Question 1:
From your table, I see you have a sliding window over a single sequence, making many smaller sequences with 2 steps.
For predicting t, you take first line of your table as input
For predicting t+1, you take the second line as input.
If you're not using the table: see question 3
Question 2:
Assuming you're using that table as input, where it's clearly a sliding window case taking two time steps as input, your timeSteps is 2.
You should probably work as if var1 and var2 were features in the same sequence:
input_shape = (2,2) - Two time steps and two features/vars.
Question 3:
We do not need to make tables like that or build a sliding window case. That is one possible approach.
Your model is actually capable of learning things and deciding the size of this window itself.
If on one hand your model is capable of learning long time dependencies, allowing you not to use windows, on the other hand, it may learn to identify different behaviors at the beginning and at the middle of a sequence. In this case, if you want to predict using sequences that start from the middle (not including the beginning), your model may work as if it were the beginning and predict a different behavior. Using windows eliminate this very long influence. Which is better may depend on testing, I guess.
Not using windows:
If your data has 800 steps, feed all the 800 steps at once for training.
Here, we will need to separate two models, one for training, another for predicting. In training, we will take advantage of the parameter return_sequences=True. This means that for each input step, we will get an output step.
For predicting later, we will want only one output, then we will use return_sequences= False. And in case we are going to use the predicted outputs as inputs for following steps, we are going to use a stateful=True layer.
Training:
Have your input data shaped as (1, 799, 2), 1 sequence, taking the steps from 1 to 799. Both vars in the same sequence (2 features).
Have your target data (Y) shaped also as (1, 799, 2), taking the same steps shifted, from 2 to 800.
Build a model with return_sequences=True. You may use timeSteps=799, but you may also use None (allowing variable amount of steps).
model.add(LSTM(units, input_shape=(None,2), return_sequences=True))
model.add(LSTM(2, return_sequences=True)) #it could be a Dense 2 too....
....
model.fit(X, Y, ....)
Predicting:
For predicting, create a similar model, now with return_sequences=False.
Copy the weights:
newModel.set_weights(model.get_weights())
You can make an input with length 800, for instance (shape: (1,800,2)) and predict just the next step:
step801 = newModel.predict(X)
If you want to predict more, we are going to use the stateful=True layers. Use the same model again, now with return_sequences=False (only in the last LSTM, the others keep True) and stateful=True (all of them). Change the input_shape by batch_input_shape=(1,None,2).
#with stateful=True, your model will never think that the sequence ended
#each new batch will be seen as new steps instead of new sequences
#because of this, we need to call this when we want a sequence starting from zero:
statefulModel.reset_states()
#predicting
X = steps1to800 #input
step801 = statefulModel.predict(X).reshape(1,1,2)
step802 = statefulModel.predict(step801).reshape(1,1,2)
step803 = statefulModel.predict(step802).reshape(1,1,2)
#the reshape is because return_sequences=True eliminates the step dimension
Actually, you could do everything with a single stateful=True and return_sequences=True model, taking care of two things:
When training, reset_states() for every epoch. (Train with a manual loop and epochs=1)
When predicting from more than one step, take only the last step of the output as the desired result.

Actually you can't just feed in the raw time series data, as the network won't fit to it naturally. The current state of RNNs still requires you to input multiple 'features' (manually or automatically derived) for it to properly learn something useful.
Usually the prior steps needed are:
Detrend
Deseasonalize
Scale (normalize)
A great source of information is this post from a Microsoft researcher which won a time series forecasting competition by the means of a LSTM Network.
Also this post: CNTK - Time series Prediction

Related

Keras class weight for multi-label binary classification on temporal data

I'm training a network with temporal data, and determine which of ~60 outputs are "active" at any given timestep (classified as 1 or 0 in the label data) - so I have an output of 60x1 floats that should represent a probability.
My input data is shaped as (X, 1, frames, dataPoints) - where X is the number of recorded sequences I have (I'm new to ML, I think this is 'batches'), frames is how long the longest sequence is (the rest are -1 padded and masked), and dataPoints is the actual input data for any given frame.
This is mostly an LTSM layer with return_sequences, but my input data is unbalanced.
For any given timestep, odds are ~85% that AN output is activated - but for any given output it's likely active at most 5% of the time.
When I attempted to apply a class weight of {0: 0.01, 1:0.99} (pending tuning), I get an error stating "class_weight not supported for 3+ dimensional targets". I've done some googling and people are suggesting compiling with sample_weight_mode of temporal and modifying sample weight, but (A) that doesn't seem right for my data (no individual sample is more important, but each 1 classification within all the samples is important), and (B) I don't understand the dimensionality of what that's doing.
How can I apply the class weighting to help balance each 1 classification with this data structure?
Side note: I'm rescaling the output of the LSTM to 0->1 since it uses tanh activation (and must use tanh activation for CUDA acceleration), and from_logits=False in my binary cross entropy loss.
Extra points if I can just use built-in tf/keras stuff and not have to write a custom loss function.
EDIT to include some code:
I have a data generator that outputs x and y in the shape of:
x.shape == (1, frameCount, inputFeatureLength) where frameCount is the number of frames in the temporal sequence, and inputFeatureLength is the size of the input data (around 100).
y.shape == (1, frameCount, outputSize) where outputSize is about 60 features.
I can successfully compile the mode, but when I try to model.fit with class_weight={0:0.01, 1:0.99} as an argument, I get the error ValueError: class_weight not supported for 3+ dimensional targets.
I've looked into sample weights, but as far as I can tell even using sample_weight_mode="temporal" on model.fit it'll let me give sample weights per frame of output, but not per each of the ~60 outputs per frame.

Keras predict() doesn't work as expected for a future timestep

I'm trying to do some LSTM time-series prediction for one timestep ahead using Keras. But when looking at examples on the web or implementing it myself it doesn't predict the next timestep but just predicts the current timestep which is no prediction. Shouldn't be the prediction one timestep ahead the test-data? See here what I mean:
I'm using:
self.model.predict(data)
Or is this intended and you have to manually shift your prediction array for one index which makes the prediction really bad.
I was thinking wrong. The problem is that the testdata get's splitted into samples and labels. If there is for example a window with 10, we have 9 samples and 1 label. Therefore the last value is missing for predicting a real-world future timestep on the last window. I have to create a third samples subset (next to samples, labels) which is shifted by 1 index and will be used to predict values so it's a real prediction.

Binary classification of every time series step based on past and future values

I'm currently facing a Machine Learning problem and I've reached a point where I need some help to proceed.
I have various time series of positional (x, y, z) data tracked by sensors. I've developed some more features. For example, I rasterized the whole 3D space and calculated a cell_x, cell_y and cell_z for every time step. The time series itself have variable lengths.
My goal is to build a model which classifies every time step with the labels 0 or 1 (binary classification based on past and future values). Therefore I have a lot of training time series where the labels are already set.
One thing which could be very problematic is that there are very few 1's labels in the data (for example only 3 of 800 samples are labeled with 1).
It would be great if someone can help me in the right direction because there are too many possible problems:
Wrong hyperparameters
Incorrect model
Too few 1's labels, but I think that's not a big problem because I only need the model to suggests the right time steps. So I would only use the peaks of the output.
Bad or too less training data
Bad features
I appreciate any help and tips.
Your model seems very strange. Why only use 2 units in lstm layer? Also your problem is a binary classification. In this case you should choose only one neuron in your output layer (try to insert one additional dense layer between and lstm layer and try dropout layers between them).
Binary crossentropy does not make much sense with 2 output neurons, if you don't have a multi label problem. But if you're switching to one output neuron it's the right one. You also need sigmoid then as activation function.
As last advice: Try class weights.
http://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
This can make a huge difference, if you're label are unbalanced.
You can create the model using tensorflow BasicLSTMCell, the shape of your data fits for BasicLSTMCell in TensorFlow you can find Documentation for BasicLSTMCell here and for creating the model this Documentation contain code that will help to build BasicLstmCell model . Hope this will help you, Cheers.

LSTM Sequence Prediction in Keras just outputs last step in the input

I am currently working with Keras using Tensorflow as the backend. I have a LSTM Sequence Prediction model shown below that I am using to predict one step ahead in a data series (input 30 steps [each with 4 features], output predicted step 31).
model = Sequential()
model.add(LSTM(
input_dim=4,
output_dim=75,
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
150,
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
output_dim=4))
model.add(Activation("linear"))
model.compile(loss="mse", optimizer="rmsprop")
return model
The issue I'm having is that after training the model and testing it - even with the same data it trained on - what it outputs is essentially the 30th step in the input. My first thought is the patterns of my data must be too complex to accurately predict, at least with this relatively simple model, so the best answer it can return is essentially the last element of the input. To limit the possibility of over-fitting I've tried turning training epochs down to 1 but the same behavior appears. I've never observed this behavior before though and I have worked with this type of data before with successful results (for context, I'm using vibration data taken from 4 points on a complex physical system that has active stabilizers; the prediction is used in a pid loop for stabilization hence why, at least for now, I'm using a simpler model to keep things fast).
Does that sound like the most likely cause, or does anyone have another idea? Has anyone seen this behavior before? In case it helps with visualization here is what the prediction looks like for one vibration point compared to the desired output (note, these screenshots are zoomed in smaller selections of a very large dataset - as #MarcinMożejko noticed I did not zoom quite the same both times so any offset between the images is due to that, the intent is to show the horizontal offset between the prediction and true data within each image):
...and compared to the 30th step of the input:
Note: Each data point seen by the Keras model is an average over many actual measurements with the window of the average processed along in time. This is done because the vibration data is extremely chaotic at the smallest resolution I can measure so instead I use this moving average technique to predict the larger movements (which are the more important ones to counteract anyway). That is why the offset in the first image appears as many points off instead of just one, it is 'one average' or 100 individual points of offset.
.
-----Edit 1, code used to get from the input datasets 'X_test, y_test' to the plots shown above-----
model_1 = lstm.build_model() # The function above, pulled from another file 'lstm'
model_1.fit(
X_test,
Y_test,
nb_epoch=1)
prediction = model_1.predict(X_test)
temp_predicted_sensor_b = (prediction[:, 0] + 1) * X_b_orig[:, 0]
sensor_b_y = (Y_test[:, 0] + 1) * X_b_orig[:, 0]
plot_results(temp_predicted_sensor_b, sensor_b_y)
plot_results(temp_predicted_sensor_b, X_b_orig[:, 29])
For context:
X_test.shape = (41541, 30, 4)
Y_test.shape = (41541, 4)
X_b_orig is the raw (averaged as described above) data from the b sensor. This is multiplied by the prediction and input data when plotting to undo normalization I do to improve the prediction. It has shape (41541, 30).
----Edit 2----
Here is a link to a complete project setup to demonstrate this behavior:
https://github.com/ebirck/lstm_sequence_prediction
That is because for your data(stock data?), the best prediction for 31st value is the 30th value itself.The model is correct and fits the data.
I also have similar experience predicting the stock data.
I feel I should post a follow-up, since it seems this post has been getting more attention than my other questions.
Ferret Zhang's answer is correct (and has been accepted), and I find this discovery is actually quite funny when you understand it in relation to stock / cryptocurrency data which some have commented about. What sequence prediction is ultimately doing is assigning statistical weights to different moves, to pick the highest probability move and 'predict' it will happen. In the case of stock data, in a vacuum it is (at least at this scale) completely random, there is equal probability of moving up or down, and hence the model predicts that it will stay the exact same.
The model, in a sense, learned that the best way to play is to not play at all :)

Multilabel/ Multitask/ Multiclass Regression in machine learning

My challenge is to train a neural network to recognize certain actions and events for different classes of task or how you want to call it given the input.
I see that most of the input/output when training neural networks is either 0 or 1 or [0,1]. But in my scenario I want my input to be in the form of integers which are arbitrarily big and the same form is expected for the output.
Let me give you an example:
Input
X = [ 23, 4, 0, 1233423, 1, 0, 0] ->
Y = [ 2, 1, 1]
Now each element in X[i] represent different properties of the same entity.
Let's say it want to describe a human being:
23 -> maps to a place he/she was born
4 -> maps to a school they graduated
etc.
Each entry in Y[i], on the other hand, means what is more likely the human to do in 3 different categories ( as len(Y) is 3 in this case ):
Y[0] = 2 -> maps to eating icecream ( from a variety of other choices )
Y[1] = 1 -> maps to a time of day moment ( morning, noon, afternoon, evening, etc...)
Y[2] = 1 -> maps to a day of the week for example
Now of course if the task was just a multi label problem I would apply a sigmoid on the output layer and do a binary_crossentropy as the loss function but that of course does not work.
Here because my output is obviously not between [0,1].
Also I am not really sure what loss function to apply since I want all classes/subclasses in Y to be correctly predicted. What I am basically saying is that each Y[i] is itself is a class of its own.
It would be more accurate if my output was in the shape of (3, labels_per_class)
and the loss function would calculate a loss for each of the 3 different classes
trying to optimize the result in such a way that each of the 3 classes would have the correct labels.
I am not sure if that is possible or how at least.
I am really still in the beginnings with my neural network knowledge and learning so clearly I am struggling with this problem.
But really to put it more simply I have a better idea how to describe it. It is more or less like an auto-encoder but the inputs and outputs are integers. The difference is that in my case the output has a different size from the input where in the auto-encoder they are the same.
My solution was to apply a relu at the output layer, ( and of course relu-like activations on all other layers as well ) and binary_crossentropy as the loss functions but the accuracy of the network is very low, around 15%.
For a standard classification you would probably do a dense layer with a number of nodes equal to the number of classes then apply softmax. The loss would be tf.losses.softmax_cross_entropy. You would do a sigmoid if you want to allow multiple classes, not just one.
Now you have multiple classification tasks. One way to do it is to take the last hidden layer (the one before the one where you do softmax). For each task do a dense layer with a number of nodes equals to the number of classes for that task and apply softmax. To compute the loss just add the losses together.
If the tasks are too different you may want to have more than one layer for each prediction.
You can also put some weights on the different losses if, say, eating ice-cream is a lot more important than getting the time of day right.
Only use relu if the prediction space is continous. Say time of day is continous but the choice between eating ice-cream, going to work, watching TV is not. If you use relu use a loss like L1(tf.losses.absolut_difference) or L2 (tf.losses.mean_squared_error).