What is the best way to encode time features for a time series forecasting model? - numpy

Reference: https://github.com/zhengchuanpan/GMAN/blob/master/PeMS/utils.py
I am trying to do a time series forecasting model with time features encoded like the way they did in the file above. (Lines 70-76)
If time features are encoded this way, doesn't that introduce unnecessary ordinality for the model?
For example, if Sunday is encoded as 1 and a Thursday is encoded as 5 - will that make Thursday more important than Sunday while these features are nominal and not ordinal.
Is this understanding correct? If yes, could you help to understand why that decision was taken during model design? And, how to avoid this ordinality?
I have tried using One Hot Encoding (day of the week - 7, minute of the hour - 12 (5 min interval), hour of the day - 24) resulting in 43 features which is pushing the dataset with time series sequences to be 75 Gb and my computer is not liking that. Can someone point me to the right understanding/design here?

Related

Time Series data Augmentation in pytorch forecasting

I have count time series of demand data and some covariates like weather information every hour. I have used 168 hours (7 days) for encoder and 24 hours (next day) for decoder in DeepAR pytorch forecasting. E.G. using MTWTFSS for encoder to predict M (Monday)
After doing much testing I find that the 24 hours in prediction is more correlated with NOT the previous 7 days. It is more correlated with the same days in the past week. So I would need to use
MMMMMMM (mondays of previous weeks) to predict M (next monday).
Is it possible to tell TimeSeriesDataset object to train using this type of inputs?
I cannot manually create this like below
MMMMMMM TTTTTTTT WWWWWWWW…
because it will take any subsequence inside this time series like MMMMTTTT to use for encoder (MMMMTTT) and decoder (T). I do not want this. so is there a way to tell TimeSeriesDataset object to only sample the time series sequentially from the beginining without any overlaps during training? So that I can just feed the input time series as
MMMMMMM TTTTTTTT WWWWWWWW…

Using each individual time step of the input as feature in a RNN

Suppose I want to create a RNN-model that will learn to predict 24 hours into the future, given 24 hours of the past. Traditionally, if I want to create such a multistep model, I would have an input like a time series (24 hours long) with one feature e.g. temperature. What if I regard each time step of the 24 hours in the time series as an individual feature so that I have 24 features with one input to predict 24 features with one output. Would such a model be superior to a traditional model?
If you always have a fixed number of inputs (like 24 hours of temperatures) it might be a good way to not use a RNN but instead go with a traditional feed forward structure. The main reason why one uses RNNs is that they can handle time series of variable length, so for all applications that need this property they are pretty much the only choice. In your case, a feed forward architecture will probably work fine, but it should not be that difficult to just try out both.

HOW to train LSTM for Multiple time series data - both for Univariate and Multivariate scenario?

I have data for hundreds of devices(pardon me, I am not specifying much detail about device and data recorded for devices). For each device, data is recorded per hour basis.
Data recorded are of 25 dimensions.
I have few prediction tasks
time series forecasting
where I am using LSTM. As because I have hundreds of devices, and each device is a time series(multivariate data), so all total my data is a Multiple time series with multivariate data.
To deal with multiple time series - my first approach is to concatenate data one after another and treat them as one time series (it can be both uni variate or multi variate) and apply LSTM and train my LSTM model.
But by this above approach(by concatenating time series data), actually I am loosing my time property of my data, so I need a better approach.
Please suggest some ideas, or blog posts.
Kindly don't confuse with Multiple time series with Multi variate time series data.
You may consider a One-fits-all model or Seq2Seq as e.g. this Google paper suggests. The approach works as follows:
Let us assume that you wanna make a 1-day ahead forecast (24 values) and you are using last 7 days (7 * 24 = 168 values) as input.
In time series analysis data is time dependent, such that you need a validation strategy that considers this time dependence, e.g. by rolling forecast approach. Separate hold-out data for testing your final trained model.
In the first step you will generate out of your many time series 168 + 24 slices (see the Google paper for an image). The x input will have length 168 and the y input 24. Use all of your generated slices for training the LSTM/GRU network and finally do prediction on your hold-out set.
Good papers on this issue:
Foundations of Sequence-to-Sequence Modeling for Time Series
Deep and Confident Prediction for Time Series at Uber
more
Kaggle Winning Solution
Kaggle Web Traffic Time Series Forecasting
List is not comprehensive, but you can use it as a starting point.

Accord.Net Implementation of weather scenario

I am trying to implement a prediction application for weather based on hidden markov models, using the accord framework. I am having some trouble on how to map the concepts into HMM structures and would like to get some insights. I am starting off with their sample application that can be found here: https://github.com/accord-net/framework/tree/master/Samples/Statistics/Gestures%20(HMMs)
Imagine the following scenario:
I am told every 6 hours what the weather is like: Cloudy, Sunny or Rainy. These would be my states in the framework correct?
Besides that, I have access to results of two different instruments, which are an air humidity meter and a wind speed meter. For simplicity, let's assume that both instruments provide a measure from 0 to 100, with 4 ranges. I would have something like 0, 1, 2 and 3 for observations regarding the humidity (0-25, 26-50, 51-75, 76-100) and the same ranges for wind would have values 4, 5, 6 and 7. These would be my observable values for sequences.
For a couple of days, I store the observations made regarding those two instruments, and based on that I will save the data for future usage, for learning purposes.
One of the questions I have is regarding timing. Since I plan to know states at every 6 hours, does it make sense or is it possible to store observations regarding instruments at a different rate? For example, if I stored observations of the instruments at every hour, I would end up with a 12 element sequence and the corresponding state, something like this for the first 12 hours:
0-4-0-5-0-4-1-7-1-6-0-4 - Cloudy
0-4-0-5-0-4-0-4-0-5-0-4 - Sunny
The 12 element sequence would be:
First hour observation of humidity - observation of wind speed (0-4)
Second hour observation of humidity - observation of wind speed (0-5)
and so on...
Should I, besides observation sequences and states, use labels for each of the instruments? Something like this:
0-0-0-1-1-0 - Humidity - Cloudy
4-5-4-4-5-4 - Wind Sp - Sunny
0-0-0-1-1-0 - Humidity - Cloudy
4-5-4-4-5-4 - Wind Sp - Sunny
The labels would be the instruments that were being measured, the sequences would be the observed values, and the states would be the state at the end of each 6 hours.
With this, I would like to be able to feed this information back to the model and predict the next state. Am I approaching this problem correctly? Would I be able to do what I need like this?
Thank you.

Correct way to feed timestamps into RNN?

I built an RNN that predicts query execution time for an unseen query. I want to add a timestamp as a feature, as it probably helps to estimate whether the server is busy or not. How can I combine a date/time variable with my query vector and feed it into my RNN model?
Yes, I could calculate the time delta by hand and feed it as a float, but that feels like cheating.
Regardless of the model you are using, your goal is to translate date-time stamps into numerical features that can give some insight into when the server is busy.
If you have periodic server usage, then you might want to create a periodic numerical feature. E.g. Hour # (0-23), or minutes, or maybe even week day # (0-6). If you have a linear trend over time (think server usage is slowly going up on average), then you might want to also translate the date-time stamps into a correctly scaled feature of "time since ...". E.g. number of days since first observation, or # of weeks, etc...
I hope that helps.