Time Series data Augmentation in pytorch forecasting - data-wrangling

I have count time series of demand data and some covariates like weather information every hour. I have used 168 hours (7 days) for encoder and 24 hours (next day) for decoder in DeepAR pytorch forecasting. E.G. using MTWTFSS for encoder to predict M (Monday)
After doing much testing I find that the 24 hours in prediction is more correlated with NOT the previous 7 days. It is more correlated with the same days in the past week. So I would need to use
MMMMMMM (mondays of previous weeks) to predict M (next monday).
Is it possible to tell TimeSeriesDataset object to train using this type of inputs?
I cannot manually create this like below
MMMMMMM TTTTTTTT WWWWWWWW…
because it will take any subsequence inside this time series like MMMMTTTT to use for encoder (MMMMTTT) and decoder (T). I do not want this. so is there a way to tell TimeSeriesDataset object to only sample the time series sequentially from the beginining without any overlaps during training? So that I can just feed the input time series as
MMMMMMM TTTTTTTT WWWWWWWW…

Related

What is the best way to encode time features for a time series forecasting model?

Reference: https://github.com/zhengchuanpan/GMAN/blob/master/PeMS/utils.py
I am trying to do a time series forecasting model with time features encoded like the way they did in the file above. (Lines 70-76)
If time features are encoded this way, doesn't that introduce unnecessary ordinality for the model?
For example, if Sunday is encoded as 1 and a Thursday is encoded as 5 - will that make Thursday more important than Sunday while these features are nominal and not ordinal.
Is this understanding correct? If yes, could you help to understand why that decision was taken during model design? And, how to avoid this ordinality?
I have tried using One Hot Encoding (day of the week - 7, minute of the hour - 12 (5 min interval), hour of the day - 24) resulting in 43 features which is pushing the dataset with time series sequences to be 75 Gb and my computer is not liking that. Can someone point me to the right understanding/design here?

How can I combine two time-series datasets with different time-steps?

I want to train a Multivariate LSTM model by using data from 2 datasets MIMIC-1.0 and MIMIC-3. The problem is that the vital signs recorded in the first data set is minute by minute while in MIMIC-III the data is recorded hourly. There is a interval difference between recording of data in both data sets.
I want to predict diagnosis from the vital signs by giving streams/sequences of vital signs to my model every 5 minutes. How can I merge both data sets for my model?
You need to be able to find a common field using which you can do a merge. For e.g. patient_ids or it's like. You can do the same with ICU episode identifiers. It's a been a while since I've worked on the MIMIC dataset to recall exactly what those fields were.
Dataset
Granularity
Subsampling for 5-minutely
MIMIC-I
Minutely
Subsample every 5th reading
MIMIC-III
Hourly
Interpolate the 10 5-minutely readings between each pair of consecutive hourly readings
The interpolation method you choose to get the between hour readings could be as simple as forward-filling the last value. If the readings are more volatile, a more complex method may be appropriate.

HOW to train LSTM for Multiple time series data - both for Univariate and Multivariate scenario?

I have data for hundreds of devices(pardon me, I am not specifying much detail about device and data recorded for devices). For each device, data is recorded per hour basis.
Data recorded are of 25 dimensions.
I have few prediction tasks
time series forecasting
where I am using LSTM. As because I have hundreds of devices, and each device is a time series(multivariate data), so all total my data is a Multiple time series with multivariate data.
To deal with multiple time series - my first approach is to concatenate data one after another and treat them as one time series (it can be both uni variate or multi variate) and apply LSTM and train my LSTM model.
But by this above approach(by concatenating time series data), actually I am loosing my time property of my data, so I need a better approach.
Please suggest some ideas, or blog posts.
Kindly don't confuse with Multiple time series with Multi variate time series data.
You may consider a One-fits-all model or Seq2Seq as e.g. this Google paper suggests. The approach works as follows:
Let us assume that you wanna make a 1-day ahead forecast (24 values) and you are using last 7 days (7 * 24 = 168 values) as input.
In time series analysis data is time dependent, such that you need a validation strategy that considers this time dependence, e.g. by rolling forecast approach. Separate hold-out data for testing your final trained model.
In the first step you will generate out of your many time series 168 + 24 slices (see the Google paper for an image). The x input will have length 168 and the y input 24. Use all of your generated slices for training the LSTM/GRU network and finally do prediction on your hold-out set.
Good papers on this issue:
Foundations of Sequence-to-Sequence Modeling for Time Series
Deep and Confident Prediction for Time Series at Uber
more
Kaggle Winning Solution
Kaggle Web Traffic Time Series Forecasting
List is not comprehensive, but you can use it as a starting point.

Preparing time series data for building an RNN

I am preparing time series data to build an RNN model (LSTM). The data is collected from sensors installed in a mechanical plant. Consider I have data for input and output temperature of a compressor along with the time stamps.
Like this there is data for around 20 parameters recorded along with their time stamps. Problem is there is a difference in the time stamps at which data is collected.
So how do I ideally match the time stamps to create a single dataframe with all the parameters and a single time stamp?
Since an RNN doesn't know anything about time deltas but only about time steps, you will need to quantify / interpolate your data.
Find the smallest time delta Δt in all of your series
Resample all of your 20 series to Δt/2* or smaller (Nyquist-Theorem)
* Actually you'd need to do a Fourier transform and then use twice the cutoff frequency as sampling rate. Δt/2 might IMHO be a good approximation.

Using Torch for Time Series prediction using LSTMs

My main problem is how should I pre-process my dataset that is basically a 60 minutely sequenced numbers inputs that will result in a 1 hourly output. Knowing that each input vector every minute is producing some output, but unfortunately this output can't be observed until 1 hour is passed.
I thought about considering putting 60 inputs as one big input vector which corresponds to 1 hourly output on a normal ML classfier, hence having 1 sample at a time. But I don't think it would be time series anymore.
How can I represent that to be doable in an LSTM environment?