LIBSVM Data Preparation: Excel data to LIBSVM format - libsvm

I want to study how to perform LIBSVM for regression and I'm currently stuck in preparing my data. Currently I have this form of data in .csv and .xlsx format and I want to convert it into libsvm data format.
So far, I understand that the data should be in this format so that it can be used in LIBSVM:
Based on what I read, for regression, "label" is the target value which can be any real number.
I am doing a electric load prediction study. Can anyone tell me what it is? And finally, how should I organized my columns and rows?

The LIBSVM data format is given by:
<label> <index1>:<value1> <index2>:<value2> .........
As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>, only the indices 5 and 8 and of course label will have a custom value, all other values are set to 0. This is just for notational simplicity or to save space, since datasets can be huge.
For the meanig of the tags, I cite the ReadMe file:
<label> is the target value of the training data. For classification,
it should be an integer which identifies a class (multi-class
classification is supported). For regression, it's any real
number. For one-class SVM, it's not used so can be any number.
is an integer starting from 1, <value> is a real number. The indices
must be in an ascending order.
As you can see, the label is the data you want to predict. The index marks a feature of your data and its value. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.
Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature with index 1 as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0 to 6) also affects his load, so he selects it as a second feature with index 2. A matrix row for LIBSVM now has the following format:
<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>
Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):
0.72 1:25 2:0
0.65 1:21 2:1
0.68 2:29 2:2
...
Notice, that we could leave out 2:0, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23°C and today is Tuesday, which is 1, so tomorrow is 2. So, this is the line or vector to use with the model:
0 1:23 2:2
Here, you can set the <label> value arbitrarily. It will be overwritten with the predicted value. I hope this helps.

Related

should I shift a dataset to use it for regression with LSTM?

Maybe this is a silly question but I didn't find much about it when I google it.
I have a dataset and I use it for regression but a normal regression with FFNN didn't worked so I thought why not try an LSTM since my data is time dependent I think because it was token from a vehicle while driving so the data is monotonic and maybe I can use LSTM in this Case to do a regression to predict a continuous value (if this doesn't make sense please tell me).
Now the first step is to prepare my data for using LSTM, since I ll predict the future I think my target(Ground truth or labels) should be shifted to the up, am I right?
So if I have a pandas dataframe where each row hold the features and the target(at the end of the row), I assume that the features should stay where they are and the target would be shifted it one step up so that the features in the first row will correspond to the target of the second row (am I wrong).
This way the LSTM will be able to predict the future value from those features.
I didn't find much about this in the internet so please can you provide me how can I do this with some Code?
I also know what I can use pandas.DataFrame.shift to shift a dataset but the last value will hold a NaN I think! how to deal with this? it would be great if you show me some examples or code.
We might need a bit more information regarding the data you are using. Also, I would suggest starting with a more simple recurrent neural network before you start going for LSTMs. The way these networks work is by you feeding the first bit of information, then the next bit of information, then the next bit etc. Let's say that when you feed the first bit of information in, it occurs at time t, then the second bit of information is fed at time t+1 ... etc. up until time t+n.
You can have the neural network output a value at each time step (so a value is outputted at time t, t+1... t+n after each respective input has been fed in). This is a many-to-many network. Or you can have the neural network output a value after all inputs have been provided (i.e. the value is outputted at time t+n). This is called a many-to-one network. What you need is dependednt on your use-case.
For example, say you were recording vehicle behaviour every 100ms and after 10 seconds (i.e. the 100th time step), you wanted to predict the likelihood that the driver was under the influence of alcohol. In this case, you would use a many-to-one network where you put in subsequent vehicle behaviour recordings at subsequent time steps (the first recording at time t, then the next recording at time t+1 etc.) and then the final timestep has the probability value outputted.
If you want a value outputted after every time step, you use a many-to-many design. It's also possible to output a value every k timesteps.

HOW to train LSTM for Multiple time series data - both for Univariate and Multivariate scenario?

I have data for hundreds of devices(pardon me, I am not specifying much detail about device and data recorded for devices). For each device, data is recorded per hour basis.
Data recorded are of 25 dimensions.
I have few prediction tasks
time series forecasting
where I am using LSTM. As because I have hundreds of devices, and each device is a time series(multivariate data), so all total my data is a Multiple time series with multivariate data.
To deal with multiple time series - my first approach is to concatenate data one after another and treat them as one time series (it can be both uni variate or multi variate) and apply LSTM and train my LSTM model.
But by this above approach(by concatenating time series data), actually I am loosing my time property of my data, so I need a better approach.
Please suggest some ideas, or blog posts.
Kindly don't confuse with Multiple time series with Multi variate time series data.
You may consider a One-fits-all model or Seq2Seq as e.g. this Google paper suggests. The approach works as follows:
Let us assume that you wanna make a 1-day ahead forecast (24 values) and you are using last 7 days (7 * 24 = 168 values) as input.
In time series analysis data is time dependent, such that you need a validation strategy that considers this time dependence, e.g. by rolling forecast approach. Separate hold-out data for testing your final trained model.
In the first step you will generate out of your many time series 168 + 24 slices (see the Google paper for an image). The x input will have length 168 and the y input 24. Use all of your generated slices for training the LSTM/GRU network and finally do prediction on your hold-out set.
Good papers on this issue:
Foundations of Sequence-to-Sequence Modeling for Time Series
Deep and Confident Prediction for Time Series at Uber
more
Kaggle Winning Solution
Kaggle Web Traffic Time Series Forecasting
List is not comprehensive, but you can use it as a starting point.

Tensorflow-Deeplearning - Correlation between input and output

I'm experimenting with tensorflow for speech recognition.
I have inputs as waveforms and words as output.
The waveform would look like this
[0,0,0,-2,3,-4,-1,7,0,0,0...0,0,0,20,-11,4,0,0,1,...]
The words would be an array of numbers while each number represents a word:
[12,4,2,3]
After training I also want to find out the correlation between input and output for each output label.
For example I want to know which input neurons | samples are responsible for the first label (here 12).
[0,0.01,0.10,0.99,0.77,0.89,0.99,0.79,0.22,0.11,0...0,0,0,0,0,0,0,0,0,...]
The original values of the input would be replaced with the correlation while 0 means no correlation and 1 means total correlation.
The goal is to get the position when a word starts.
Is there a function in tensorflow to get this correlation?
Question
I have a sequence of data (X) that I want to translate into another sequence of data (Y) as well as report what part of (X) contributed to (Y).
Answer
This is a well known problem and Tensorflow.org actually has a fantastic example neural machine translation with attention
The example code show how to translate X (Spanish) into Y (English) and report what part of X contributes to the decision of each part of Y (attention)
The exact same principle and code can be used to translate X (wave data) into Y (words) and report what part of the wave data contributes to each word via the attention readout.
The attention layer in the example is called attention_layer.

Tensorflow RNN Input shape

Updated question: This is a good resource: http://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/
See the section on "LSTM State Within A Batch".
If I interpret this correctly, the author did not need to reshape the data as x,y,z (as he did in the preceding example); he just increased the batch size. So an LSTM cells hidden state (the one that gets passed from one time step to the next) started at row 0, and keeps getting updated until all rows in the batch have finished? is that right?
If that is correct then why does one ever need to have a time step greater than 1? Could I not just stack all my time-series rows in order, and feed them as a single batch?
Original question:
I'm getting myself into an absolute muddle trying to understand the correct way to shape my data for tensorflow, particularly around time_steps. Reading around has only confused me further, so I thought I'd cave in and ask.
I'm trying to model time series data in which the data at time t is a 5 columns in width (5 features , 1 label).
So then t-1 will also have another 5 features, and 1 label
Here is an example with 2 rows.
x=[1,1,1,1,1] y=[5]
x=[2,2,2,2,2] y=[15]
I've got an RNN model to work by feeding in a 1x1x5 matrix into my x variable. Which implies my 'time step' has a dimension of 1. However as with the above example, the second line I feed in is correlated to the first (15 = 5 +(2+2+2+2+20 in case you haven't spotted it)
So is the way I'm currently entering it correct? How does the time stamp dimension work?
Or should I be thinking of it as batch size, rows, cols in my head?
Either way can someone tell me what are the dimensions are I should be reshaping my input data to? For sake of argument assume I've split the data into batches of 1000. So within those 1000 rows I want a prediction for every row, but the RNN should be look to the row above it in my batch to figure out the answer.
x1=[1,1,1,1,1] y=[5]
x2=[2,2,2,2,2] y=[15]
...
etc.

Using Torch for Time Series prediction using LSTMs

My main problem is how should I pre-process my dataset that is basically a 60 minutely sequenced numbers inputs that will result in a 1 hourly output. Knowing that each input vector every minute is producing some output, but unfortunately this output can't be observed until 1 hour is passed.
I thought about considering putting 60 inputs as one big input vector which corresponds to 1 hourly output on a normal ML classfier, hence having 1 sample at a time. But I don't think it would be time series anymore.
How can I represent that to be doable in an LSTM environment?