Keras predict() doesn't work as expected for a future timestep - tensorflow

I'm trying to do some LSTM time-series prediction for one timestep ahead using Keras. But when looking at examples on the web or implementing it myself it doesn't predict the next timestep but just predicts the current timestep which is no prediction. Shouldn't be the prediction one timestep ahead the test-data? See here what I mean:
I'm using:
self.model.predict(data)
Or is this intended and you have to manually shift your prediction array for one index which makes the prediction really bad.

I was thinking wrong. The problem is that the testdata get's splitted into samples and labels. If there is for example a window with 10, we have 9 samples and 1 label. Therefore the last value is missing for predicting a real-world future timestep on the last window. I have to create a third samples subset (next to samples, labels) which is shifted by 1 index and will be used to predict values so it's a real prediction.

Related

Keras Masking layer for LSTM input to mask features instead of timesteps

I gather that Masking layers in Keras are commonly used for handling data inputs with varying timesteps. Based on the documentation, I understand that if all of the features for a given timestep equal the mask value, then that timestep will be skipped in downstream layers.
For my problem, I am instead interested in using masking for features, where the data input shape to the network is (batch_size, num_timesteps, num_features). Essentially, I want to be able to predict a timeseries one step into the future with num_features features, but assuming that I won't always have all the features from the previous timestep to base my prediction on.
For example, one could predict RGB values one timestep into the future for a pixel in a video stream based on partial data from a previous timestep. At every timestep the output should be all RGB, but some timesteps you may get only RG, or only RB, or only BG, but you never know which partial data you'll have at each timestep to make your prediction. This is why I want to somehow be able to indicate a feature as masked during training to accommodate this kind of prediction.
It may be that Masking in Keras is not the correct mechanism to achieve this. What is the correct type of network layer that would give me this behavior?

Binary classification of every time series step based on past and future values

I'm currently facing a Machine Learning problem and I've reached a point where I need some help to proceed.
I have various time series of positional (x, y, z) data tracked by sensors. I've developed some more features. For example, I rasterized the whole 3D space and calculated a cell_x, cell_y and cell_z for every time step. The time series itself have variable lengths.
My goal is to build a model which classifies every time step with the labels 0 or 1 (binary classification based on past and future values). Therefore I have a lot of training time series where the labels are already set.
One thing which could be very problematic is that there are very few 1's labels in the data (for example only 3 of 800 samples are labeled with 1).
It would be great if someone can help me in the right direction because there are too many possible problems:
Wrong hyperparameters
Incorrect model
Too few 1's labels, but I think that's not a big problem because I only need the model to suggests the right time steps. So I would only use the peaks of the output.
Bad or too less training data
Bad features
I appreciate any help and tips.
Your model seems very strange. Why only use 2 units in lstm layer? Also your problem is a binary classification. In this case you should choose only one neuron in your output layer (try to insert one additional dense layer between and lstm layer and try dropout layers between them).
Binary crossentropy does not make much sense with 2 output neurons, if you don't have a multi label problem. But if you're switching to one output neuron it's the right one. You also need sigmoid then as activation function.
As last advice: Try class weights.
http://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
This can make a huge difference, if you're label are unbalanced.
You can create the model using tensorflow BasicLSTMCell, the shape of your data fits for BasicLSTMCell in TensorFlow you can find Documentation for BasicLSTMCell here and for creating the model this Documentation contain code that will help to build BasicLstmCell model . Hope this will help you, Cheers.

What are the effects of padding a tensor?

I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.

LSTM Sequence Prediction in Keras just outputs last step in the input

I am currently working with Keras using Tensorflow as the backend. I have a LSTM Sequence Prediction model shown below that I am using to predict one step ahead in a data series (input 30 steps [each with 4 features], output predicted step 31).
model = Sequential()
model.add(LSTM(
input_dim=4,
output_dim=75,
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
150,
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
output_dim=4))
model.add(Activation("linear"))
model.compile(loss="mse", optimizer="rmsprop")
return model
The issue I'm having is that after training the model and testing it - even with the same data it trained on - what it outputs is essentially the 30th step in the input. My first thought is the patterns of my data must be too complex to accurately predict, at least with this relatively simple model, so the best answer it can return is essentially the last element of the input. To limit the possibility of over-fitting I've tried turning training epochs down to 1 but the same behavior appears. I've never observed this behavior before though and I have worked with this type of data before with successful results (for context, I'm using vibration data taken from 4 points on a complex physical system that has active stabilizers; the prediction is used in a pid loop for stabilization hence why, at least for now, I'm using a simpler model to keep things fast).
Does that sound like the most likely cause, or does anyone have another idea? Has anyone seen this behavior before? In case it helps with visualization here is what the prediction looks like for one vibration point compared to the desired output (note, these screenshots are zoomed in smaller selections of a very large dataset - as #MarcinMożejko noticed I did not zoom quite the same both times so any offset between the images is due to that, the intent is to show the horizontal offset between the prediction and true data within each image):
...and compared to the 30th step of the input:
Note: Each data point seen by the Keras model is an average over many actual measurements with the window of the average processed along in time. This is done because the vibration data is extremely chaotic at the smallest resolution I can measure so instead I use this moving average technique to predict the larger movements (which are the more important ones to counteract anyway). That is why the offset in the first image appears as many points off instead of just one, it is 'one average' or 100 individual points of offset.
.
-----Edit 1, code used to get from the input datasets 'X_test, y_test' to the plots shown above-----
model_1 = lstm.build_model() # The function above, pulled from another file 'lstm'
model_1.fit(
X_test,
Y_test,
nb_epoch=1)
prediction = model_1.predict(X_test)
temp_predicted_sensor_b = (prediction[:, 0] + 1) * X_b_orig[:, 0]
sensor_b_y = (Y_test[:, 0] + 1) * X_b_orig[:, 0]
plot_results(temp_predicted_sensor_b, sensor_b_y)
plot_results(temp_predicted_sensor_b, X_b_orig[:, 29])
For context:
X_test.shape = (41541, 30, 4)
Y_test.shape = (41541, 4)
X_b_orig is the raw (averaged as described above) data from the b sensor. This is multiplied by the prediction and input data when plotting to undo normalization I do to improve the prediction. It has shape (41541, 30).
----Edit 2----
Here is a link to a complete project setup to demonstrate this behavior:
https://github.com/ebirck/lstm_sequence_prediction
That is because for your data(stock data?), the best prediction for 31st value is the 30th value itself.The model is correct and fits the data.
I also have similar experience predicting the stock data.
I feel I should post a follow-up, since it seems this post has been getting more attention than my other questions.
Ferret Zhang's answer is correct (and has been accepted), and I find this discovery is actually quite funny when you understand it in relation to stock / cryptocurrency data which some have commented about. What sequence prediction is ultimately doing is assigning statistical weights to different moves, to pick the highest probability move and 'predict' it will happen. In the case of stock data, in a vacuum it is (at least at this scale) completely random, there is equal probability of moving up or down, and hence the model predicts that it will stay the exact same.
The model, in a sense, learned that the best way to play is to not play at all :)

How to deal with multi step time series forecasting in multivariate LSTM in keras

I am trying to do multi-step time series forecasting using multivariate LSTM in Keras. Specifically, I have two variables (var1 and var2) for each time step originally. Having followed the online tutorial here, I decided to use data at time (t-2) and (t-1) to predict the value of var2 at time step t. As sample data table shows, I am using the first 4 columns as input, Y as output. The code I have developed can be seen here, but I have got three questions.
var1(t-2) var2(t-2) var1(t-1) var2(t-1) var2(t)
2 1.5 -0.8 0.9 -0.5 -0.2
3 0.9 -0.5 -0.1 -0.2 0.2
4 -0.1 -0.2 -0.3 0.2 0.4
5 -0.3 0.2 -0.7 0.4 0.6
6 -0.7 0.4 0.2 0.6 0.7
Q1: I have trained an LSTM model with the data above. This model does
well in predicting the value of var2 at time step t. However, what
if I want to predict var2 at time step t+1. I feel it is hard
because the model cannot tell me the value of var1 at time step t. If I want to do it, how should I modify the code to build the model?
Q2: I have seen this question asked a lot, but I am still confused. In
my example, what should be the correct time step in [samples, time
steps, features] 1 or 2?
Q3: I just started studying LSTMs. I have
read here that one of the biggest advantages of LSTM is that it
learns the temporal dependence/sliding window size by itself, then
why must we always covert time series data into format like the
table above?
Update: LSTM result (blue line is the training seq, orange line is the ground truth, green is the prediction)
Question 1:
From your table, I see you have a sliding window over a single sequence, making many smaller sequences with 2 steps.
For predicting t, you take first line of your table as input
For predicting t+1, you take the second line as input.
If you're not using the table: see question 3
Question 2:
Assuming you're using that table as input, where it's clearly a sliding window case taking two time steps as input, your timeSteps is 2.
You should probably work as if var1 and var2 were features in the same sequence:
input_shape = (2,2) - Two time steps and two features/vars.
Question 3:
We do not need to make tables like that or build a sliding window case. That is one possible approach.
Your model is actually capable of learning things and deciding the size of this window itself.
If on one hand your model is capable of learning long time dependencies, allowing you not to use windows, on the other hand, it may learn to identify different behaviors at the beginning and at the middle of a sequence. In this case, if you want to predict using sequences that start from the middle (not including the beginning), your model may work as if it were the beginning and predict a different behavior. Using windows eliminate this very long influence. Which is better may depend on testing, I guess.
Not using windows:
If your data has 800 steps, feed all the 800 steps at once for training.
Here, we will need to separate two models, one for training, another for predicting. In training, we will take advantage of the parameter return_sequences=True. This means that for each input step, we will get an output step.
For predicting later, we will want only one output, then we will use return_sequences= False. And in case we are going to use the predicted outputs as inputs for following steps, we are going to use a stateful=True layer.
Training:
Have your input data shaped as (1, 799, 2), 1 sequence, taking the steps from 1 to 799. Both vars in the same sequence (2 features).
Have your target data (Y) shaped also as (1, 799, 2), taking the same steps shifted, from 2 to 800.
Build a model with return_sequences=True. You may use timeSteps=799, but you may also use None (allowing variable amount of steps).
model.add(LSTM(units, input_shape=(None,2), return_sequences=True))
model.add(LSTM(2, return_sequences=True)) #it could be a Dense 2 too....
....
model.fit(X, Y, ....)
Predicting:
For predicting, create a similar model, now with return_sequences=False.
Copy the weights:
newModel.set_weights(model.get_weights())
You can make an input with length 800, for instance (shape: (1,800,2)) and predict just the next step:
step801 = newModel.predict(X)
If you want to predict more, we are going to use the stateful=True layers. Use the same model again, now with return_sequences=False (only in the last LSTM, the others keep True) and stateful=True (all of them). Change the input_shape by batch_input_shape=(1,None,2).
#with stateful=True, your model will never think that the sequence ended
#each new batch will be seen as new steps instead of new sequences
#because of this, we need to call this when we want a sequence starting from zero:
statefulModel.reset_states()
#predicting
X = steps1to800 #input
step801 = statefulModel.predict(X).reshape(1,1,2)
step802 = statefulModel.predict(step801).reshape(1,1,2)
step803 = statefulModel.predict(step802).reshape(1,1,2)
#the reshape is because return_sequences=True eliminates the step dimension
Actually, you could do everything with a single stateful=True and return_sequences=True model, taking care of two things:
When training, reset_states() for every epoch. (Train with a manual loop and epochs=1)
When predicting from more than one step, take only the last step of the output as the desired result.
Actually you can't just feed in the raw time series data, as the network won't fit to it naturally. The current state of RNNs still requires you to input multiple 'features' (manually or automatically derived) for it to properly learn something useful.
Usually the prior steps needed are:
Detrend
Deseasonalize
Scale (normalize)
A great source of information is this post from a Microsoft researcher which won a time series forecasting competition by the means of a LSTM Network.
Also this post: CNTK - Time series Prediction