test set for time series forecasting - tensorflow

The example in the link below has a training and validation set from time series data. There is no mention of a test set. Why isn't there one and what would it entail to have one for a dataset whose time series data is being generated on the fly in real time?
I have 3hrs of data collected at 1s interval. I would like to predict the next 30 min before it becomes available. What should be the train/validate/test split look like? Can test set be skipped?
https://www.tensorflow.org/tutorials/structured_data/time_series

It is never recommended to skip the test set. In the TensorFlow example, the purpose was to demonstrate how you can play with time series; you can test on the 'test set' just like you do with your validation, with the constraint that the test set is completely unknown: here we come to your second question.
With regard to the test set, in your use case, like you said, the test set is the data generated on the fly.
You can, of course, split your initial dataset into train/val/test. But the second test set which evidently coincides with your model 'live deployment' would be to predict on 'on-the-fly-generated-dataset' => this means you would feed the data real-time to your model.
The train-val-test split depends on how you want to create your model: how many time-steps you want to use(how many seconds to take into account when prediction the next step etc, how many variables you are trying to predict, how many time-steps ahead you want to predict(in your case 30 minutes would be 30*60 = 1800, since your dataset signals frequency is in seconds). It's a very broad question and refers more on how to create a dataset for time series analysis for multi-step prediction.

Related

What’s the advantage of using LSTM for time series predict as opposed to Regression?

In neural networks, in general, which model should yield a better and accurate output between both for time series?
As you rightly mentioned, We can use linear regression with time series data as long as:
The inclusion of lagged terms as regressors does not create a collinearity problem.
Both the regressors and the explained variable are stationary.
Your errors are not correlated with each other.
The other linear regression assumptions apply.
No autocorrelation is the single most important assumption in linear regression. If autocorrelation is present the consequences are the following:
Bias: Your “best fit line” will likely be way off because it will be pulled away from the “true line” by the effect of the lagged errors.
Inconsistency: Given the above, your sample estimators are unlikely to converge to the population parameters.
Inefficiency: While it is theoretically possible, your residuals are unlikely to be homoskedastic if they are autocorrelated. Thus, your confidence intervals and your hypothesis tests will be unreliable.
While, The Long Short Term Memory neural network is a type of a Recurrent Neural Network (RNN). RNNs use previous time events to inform the later ones. For example, to classify what kind of event is happening in a movie, the model needs to use information about previous events. RNNs work well if the problem requires only recent information to perform the present task. If the problem requires long term dependencies, RNN would struggle to model it. The LSTM was designed to learn long term dependencies. It remembers the information for long periods.
To focus on the 1st sequence. The model takes the feature of the time bar at index 0 and it tries to predict the target of the time bar at index 1. Then it takes the feature of the time bar at index 1 and it tries to predict the target of the time bar at index 2, etc. The feature of 2nd sequence is shifted by 1 time bar from the feature of 1st sequence, the feature of 3rd sequence is shifted by 1 time bar from 2nd sequence, etc. With this procedure, we get many shorter sequences that are shifted by a single time bar.

should I shift a dataset to use it for regression with LSTM?

Maybe this is a silly question but I didn't find much about it when I google it.
I have a dataset and I use it for regression but a normal regression with FFNN didn't worked so I thought why not try an LSTM since my data is time dependent I think because it was token from a vehicle while driving so the data is monotonic and maybe I can use LSTM in this Case to do a regression to predict a continuous value (if this doesn't make sense please tell me).
Now the first step is to prepare my data for using LSTM, since I ll predict the future I think my target(Ground truth or labels) should be shifted to the up, am I right?
So if I have a pandas dataframe where each row hold the features and the target(at the end of the row), I assume that the features should stay where they are and the target would be shifted it one step up so that the features in the first row will correspond to the target of the second row (am I wrong).
This way the LSTM will be able to predict the future value from those features.
I didn't find much about this in the internet so please can you provide me how can I do this with some Code?
I also know what I can use pandas.DataFrame.shift to shift a dataset but the last value will hold a NaN I think! how to deal with this? it would be great if you show me some examples or code.
We might need a bit more information regarding the data you are using. Also, I would suggest starting with a more simple recurrent neural network before you start going for LSTMs. The way these networks work is by you feeding the first bit of information, then the next bit of information, then the next bit etc. Let's say that when you feed the first bit of information in, it occurs at time t, then the second bit of information is fed at time t+1 ... etc. up until time t+n.
You can have the neural network output a value at each time step (so a value is outputted at time t, t+1... t+n after each respective input has been fed in). This is a many-to-many network. Or you can have the neural network output a value after all inputs have been provided (i.e. the value is outputted at time t+n). This is called a many-to-one network. What you need is dependednt on your use-case.
For example, say you were recording vehicle behaviour every 100ms and after 10 seconds (i.e. the 100th time step), you wanted to predict the likelihood that the driver was under the influence of alcohol. In this case, you would use a many-to-one network where you put in subsequent vehicle behaviour recordings at subsequent time steps (the first recording at time t, then the next recording at time t+1 etc.) and then the final timestep has the probability value outputted.
If you want a value outputted after every time step, you use a many-to-many design. It's also possible to output a value every k timesteps.

How can I combine two time-series datasets with different time-steps?

I want to train a Multivariate LSTM model by using data from 2 datasets MIMIC-1.0 and MIMIC-3. The problem is that the vital signs recorded in the first data set is minute by minute while in MIMIC-III the data is recorded hourly. There is a interval difference between recording of data in both data sets.
I want to predict diagnosis from the vital signs by giving streams/sequences of vital signs to my model every 5 minutes. How can I merge both data sets for my model?
You need to be able to find a common field using which you can do a merge. For e.g. patient_ids or it's like. You can do the same with ICU episode identifiers. It's a been a while since I've worked on the MIMIC dataset to recall exactly what those fields were.
Dataset
Granularity
Subsampling for 5-minutely
MIMIC-I
Minutely
Subsample every 5th reading
MIMIC-III
Hourly
Interpolate the 10 5-minutely readings between each pair of consecutive hourly readings
The interpolation method you choose to get the between hour readings could be as simple as forward-filling the last value. If the readings are more volatile, a more complex method may be appropriate.

I used the train and test set in the SARIMA to predict the supposed to be current value, but how am i going to predict the value beyond my timestamp

I already successfully build SARIMA model based on my data set. My question here is how am I going to predict the future value which are beyond my data set(in timestamp). Any suggestion ?
Here is my SARIMA model that i able to make with graph.SARIMA. The blue is on current, red is forecasted.
Here is my training and test set look like. Training Set. Test Set
You can use
forecast=model.forecast(steps=number of steps to go in future)
This will return forecasts along with standard error for given confidence interval.
From that, you can access forecasts like forecast[0].
Reference: http://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMAResults.forecast.html

Does Google's AutoML Table shuffle my data samples before training/evaluation?

I sought through the documentation but still have no clue whether or not the service shuffles data before training/evaluation. I need to know this because by data is time-series which would be realistic to evaluate a trained model on samples of earlier period of time.
Can someone please let me know the answer or guide me how to figure this out?
I know that I can export evaluation result and tweak on it but BigQuery seems to not respect the order of original data and there's no absolute time feature in the data.
It doesn't shuffle but split it.
Take a look here: About controlling data split. It says:
By default, AutoML Tables randomly selects 80% of your data rows for training, 10% for validation, and 10% for testing.
If your data is time-sensitive, you should use the Time column.
By using it, AutoML Tables will use the earliest 80% of the rows for training, the next 10% of rows for validation, and the latest 10% of rows for testing.