Does deeper LSTM need more units? - tensorflow

I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.

It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).

Related

Tensorflow / Keras: Normalize train / test / realtime Data or how to handle reality?

I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!
Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.

I want train_test_split to train mainly on one specific number range

I am running some regression models in jupyter/python to predict the cycle time of certain projects. I used train_test_split from sklearn to randomly divide my data set.
The models tend to work pretty well for projects with high cycle times (between 150 - 300 days), but I care more about the lower cycle times between 0 and 50 days.
I believe the model is more accurate for the higher range because most of the projects (about 60-70%) have cycle times over 100 days. I want my model to mainly get the lower cycle times right, because for the purposes of what I'm doing, a project with a cycle time of 120 days is the same as a project with 300 day cycle time.
In my mind, I need to train more on the projects with shorter cycle times? I feel like this might help?
Is there a way to split the data less randomly? Aka train on a higher ratio of shorter cycle time projects
Is there a better or different approach I should consider?

Building a mutlivariate, multi-task LSTM with Keras

Preamble
I am currently working on a Machine Learning problem where we are tasked with using past data on product sales in order to predict sales volumes going forward (so that shops can better plan their stocks). We essentially have time series data, where for each and every product we know how many units were sold on which days. We also have information like what the weather was like, whether there was a public holiday, if any of the products were on sales etc.
We've been able to model this with some success using an MLP with dense layers, and just using a sliding window approach to include sales volumes from the surrounding days. However, we believe we'll be able to get much better results with a time-series approach such as an LSTM.
Data
The data we have essentially is as follows:
(EDIT: for clarity the "Time" column in the picture above is not correct. We have inputs once per day, not once per month. But otherwise the structure is the same!)
So the X data is of shape:
(numProducts, numTimesteps, numFeatures) = (50 products, 1096 days, 90 features)
And the Y data is of shape:
(numProducts, numTimesteps, numTargets) = (50 products, 1096 days, 3 binary targets)
So we have data for three years (2014, 2015, 2016) and want to train on this in order to make predictions for 2017. (That's of course not 100% true, since we actually have data up to Oct 2017, but let's just ignore that for now)
Problem
I would like to build an LSTM in Keras that allows me to make these predictions. There are a few places where I am getting stuck though. So I have six concrete questions (I know one is supposed to try to limit a Stackoverflow post to one question, but these are all intertwined).
Firstly, how would I slice up my data for the batches? Since I have three full years, does it make sense to simply push through three batches, each time of size one year? Or does it make more sense to make smaller batches (say 30 days) and also to using sliding windows? I.e. instead of 36 batches of 30 days each, I use 36 * 6 batches of 30 days each, each time sliding with 5 days? Or is this not really the way LSTMs should be used? (Note that there is quite a bit of seasonality in the data, to I need to catch that kind of long-term trend as well).
Secondly, does it make sense to use return_sequences=True here? In other words, I keep my Y data as is (50, 1096, 3) so that (as far as I've understood it) there is a prediction at every time step for which a loss can be calculated against the target data? Or would I be better off with return_sequences=False, so that only the final value of each batch is used to evaluate the loss (i.e. if using yearly batches, then in 2016 for product 1, we evaluate against the Dec 2016 value of (1,1,1)).
Thirdly how should I deal with the 50 different products? They are different, but still strongly correlated and we've seen with other approaches (for example an MLP with simple time-windows) that the results are better when all products are considered in the same model. Some ideas that are currently on the table are:
change the target variable to be not just 3 variables, but 3 * 50 = 150; i.e. for each product there are three targets, all of which are trained simultaneously.
split up the results after the LSTM layer into 50 dense networks, which take as input the ouputs from the LSTM, plus some features that are specific to each product - i.e. we get a multi-task network with 50 loss functions, which we then optimise together. Would that be crazy?
consider a product as a single observation, and include product specific features already at the LSTM layer. Use just this one layer followed by an ouput layer of size 3 (for the three targets). Push through each product in a separate batch.
Fourthly, how do I deal with validation data? Normally I would just keep out a randomly selected sample to validate against, but here we need to keep the time ordering in place. So I guess the best is to just keep a few months aside?
Fifthly, and this is the part that is probably the most unclear to me - how can I use the actual results to perform predictions? Let's say I used return_sequences=False and I trained on all three years in three batches (each time up to Nov) with the goal of training the model to predict the next value (Dec 2014, Dec 2015, Dec 2016). If I want to use these results in 2017, how does this actually work? If I understood it correctly, the only thing I can do in this instance is to then feed the model all the data points for Jan to Nov 2017 and it will give me back a prediction for Dec 2017. Is that correct? However, if I were to use return_sequences=True, then trained on all data up to Dec 2016, would I then be able to get a prediction for Jan 2017 just by giving the model the features observed at Jan 2017? Or do I need to also give it the 12 months before Jan 2017? What about Feb 2017, do I in addition need to give the value for 2017, plus a further 11 months before that? (If it sounds like I'm confused, it's because I am!)
Lastly, depending on what structure I should use, how do I do this in Keras? What I have in mind at the moment is something along the following lines: (though this would be for only one product, so doesn't solve having all products in the same model):
Keras code
trainX = trainingDataReshaped #Data for Product 1, Jan 2014 to Dec 2016
trainY = trainingTargetReshaped
validX = validDataReshaped #Data for Product 1, for ??? Maybe for a few months?
validY = validTargetReshaped
numSequences = trainX.shape[0]
numTimeSteps = trainX.shape[1]
numFeatures = trainX.shape[2]
numTargets = trainY.shape[2]
model = Sequential()
model.add(LSTM(100, input_shape=(None, numFeatures), return_sequences=True))
model.add(Dense(numTargets, activation="softmax"))
model.compile(loss=stackEntry.params["loss"],
optimizer="adam",
metrics=['accuracy'])
history = model.fit(trainX, trainY,
batch_size=30,
epochs=20,
verbose=1,
validation_data=(validX, validY))
predictX = predictionDataReshaped #Data for Product 1, Jan 2017 to Dec 2017
prediction=model.predict(predictX)
So:
Firstly, how would I slice up my data for the batches? Since I have
three full years, does it make sense to simply push through three
batches, each time of size one year? Or does it make more sense to
make smaller batches (say 30 days) and also to using sliding windows?
I.e. instead of 36 batches of 30 days each, I use 36 * 6 batches of 30
days each, each time sliding with 5 days? Or is this not really the
way LSTMs should be used? (Note that there is quite a bit of
seasonality in the data, to I need to catch that kind of long-term
trend as well).
Honestly - modeling such data is something really hard. First of all - I wouldn't advise you to use LSTMs as they are rather designed to capture a little bit different kind of data (e.g. NLP or speech where it's really important to model long-term dependencies - not seasonality) and they need a lot of data in order to be learned. I would rather advise you to use either GRU or SimpleRNN which are way easier to learn and should be better for your task.
When it comes to batching - I would definitely advise you to use a fixed window technique as it will end up in producing way more data points than feeding a whole year or a whole month. Try to set a number of days as meta parameter which will be also optimized by using different values in training and choosing the most suitable one.
When it comes to seasonality - of course, this is a case but:
You might have way too few data points and years collected to provide a good estimate of season trends,
Using any kind of recurrent neural network to capture such seasonalities is a really bad idea.
What I advise you to do instead is:
try adding seasonal features (e.g. the month variable, day variable, a variable which is set to be true if there is a certain holiday that day or how many days there are to the next important holiday - this is a room where you could be really creative)
Use an aggregated last year data as a feature - you could, for example, feed last year results or aggregations of them like running average of the last year's results, maximum, minimum - etc.
Secondly, does it make sense to use return_sequences=True here? In
other words, I keep my Y data as is (50, 1096, 3) so that (as far as
I've understood it) there is a prediction at every time step for which
a loss can be calculated against the target data? Or would I be better
off with return_sequences=False, so that only the final value of each
batch is used to evaluate the loss (i.e. if using yearly batches, then
in 2016 for product 1, we evaluate against the Dec 2016 value of
(1,1,1)).
Using return_sequences=True might be useful but only in following cases:
When a given LSTM (or another recurrent layer) will be followed by yet another recurrent layer.
In a scenario - when you feed a shifted original series as an output by what you are simultaneously learning a model in different time windows, etc.
The way described in a second point might be an interesting approach but keep the mind in mind that it might be a little bit hard to implement as you will need to rewrite your model in order to obtain a production result. What also might be harder is that you'll need to test your model against many types of time instabilities - and such approach might make this totally unfeasible.
Thirdly how should I deal with the 50 different products? They are
different, but still strongly correlated and we've seen with other
approaches (for example an MLP with simple time-windows) that the
results are better when all products are considered in the same model.
Some ideas that are currently on the table are:
change the target variable to be not just 3 variables, but 3 * 50 = 150; i.e. for each product there are three targets, all of which are trained simultaneously.
split up the results after the LSTM layer into 50 dense networks, which take as input the ouputs from the LSTM, plus some features that
are specific to each product - i.e. we get a multi-task network with
50 loss functions, which we then optimise together. Would that be
crazy?
consider a product as a single observation, and include product-specific features already at the LSTM layer. Use just this one layer
followed by an ouput layer of size 3 (for the three targets). Push
through each product in a separate batch.
I would definitely go for a first choice but before providing a detailed explanation I will discuss disadvantages of 2nd and 3rd ones:
In the second approach: It wouldn't be mad but you will lose a lot of correlations between products targets,
In third approach: you'll lose a lot of interesting patterns occuring in dependencies between different time series.
Before getting to my choice - let's discuss yet another issue - redundancies in your dataset. I guess that you have 3 kinds of features:
product specific ones (let's say that there is 'm' of them)
general features - let's say that there is 'n` of them.
Now you have table of size (timesteps, m * n, products). I would transform it into table of shape (timesteps, products * m + n) as general features are the same for all products. This will save you a lot of memory and also make it feasible to feed to recurrent network (keep in mind that recurrent layers in keras have only one feature dimension - whereas you had two - product and feature ones).
So why the first approach is the best in my opinion? Becasue it takes advantage of many interesting dependencies from data. Of course - this might harm the training process - but there is an easy trick to overcome this: dimensionality reduction. You could e.g. train PCA on your 150 dimensional vector and reduce it size to a much smaller one - thanks to what you have your dependencies modeled by PCA and your output has a much more feasible size.
Fourthly, how do I deal with validation data? Normally I would just
keep out a randomly selected sample to validate against, but here we
need to keep the time ordering in place. So I guess the best is to
just keep a few months aside?
This is a really important question. From my experience - you need to test your solution against many types of instabilities in order to be sure that it works fine. So a few rules which you should keep in mind:
There should be no overlap between your training sequences and test sequences. If there would be such - you will have a valid values from a test set fed to a model while training,
You need to test model time stability against many kinds of time dependencies.
The last point might be a little bit vague - so to provide you some examples:
year stability - validate your model by training it using each possible combination of two years and test it on a hold out one (e.g. 2015, 2016 against 2017, 2015, 2017 against 2016, etc.) - this will show you how year changes affects your model,
future prediction stability - train your model on a subset of weeks/months/years and test it using a following week/month/year result (e.g. train it on January 2015, January 2016 and January 2017 and test it using Feburary 2015, Feburary 2016, Feburary 2017 data, etc.)
month stability - train model when keeping a certain month in a test set.
Of course - you could try yet another hold outs.
Fifthly, and this is the part that is probably the most unclear to me
- how can I use the actual results to perform predictions? Let's say I used return_sequences=False and I trained on all three years in three
batches (each time up to Nov) with the goal of training the model to
predict the next value (Dec 2014, Dec 2015, Dec 2016). If I want to
use these results in 2017, how does this actually work? If I
understood it correctly, the only thing I can do in this instance is
to then feed the model all the data points for Jan to Nov 2017 and it
will give me back a prediction for Dec 2017. Is that correct? However,
if I were to use return_sequences=True, then trained on all data up to
Dec 2016, would I then be able to get a prediction for Jan 2017 just
by giving the model the features observed at Jan 2017? Or do I need to
also give it the 12 months before Jan 2017? What about Feb 2017, do I
in addition need to give the value for 2017, plus a further 11 months
before that? (If it sounds like I'm confused, it's because I am!)
This depends on how you've built your model:
if you used return_sequences=True you need to rewrite it to have return_sequence=False or just taking the output and considering only the last step from result,
if you used a fixed-window - then you need to just feed a window before prediction to model,
if you used a varying length - you could feed any timesteps proceding your prediction period you want (but I advice you to feed at least 7 proceding days).
Lastly, depending on what structure I should use, how do I do this in Keras? What I have in mind at the moment is something along the following lines: (though this would be for only one product, so doesn't solve having all products in the same model)
Here - more info on what kind of model you've choosed is needed.
Question 1
There are several approaches for this problem. The one that you propose seems to be a sliding window.
But in fact you don't need to slice the time dimension, you can input all 3 years at once. You may slice the products dimension, in case your batch gets too big for the memory and speed.
You can work with a single array with shape (products, time, features)
Question 2
Yes, it makes sense to use return_sequences=True.
If I understood your question correctly, you have y predictions for every day, right?
Question 3
That is really an open question. All approaches have their advantages.
But if you're considering to put all the product features together, being these features of different nature, you should probably expand all possible features as if there were a big one-hot vector considering all features of all products.
If each product has independent features that apply only to itself, the idea of creating individual models for each product doesn't seem insane to me.
You might also thing of making the product id a one-hot vector input, and use a single model.
Question 4
Depending on which approach you choose, you may:
Split some products as validation data
Leave the final portion of time steps as validation data
Try a crossvalidation method leaving different lengths for training and test (the longer the test data, the bigger the error, though, you might want to crop this test data to have a fixed length)
Question 5
There may be also many approaches.
There are approaches where you use sliding windows. You train your model for fixed time lengths.
And there are approaches where you train the LSTM layers with the entire length. In this case you'd first predict the entire known part, and then start predicting the unknown part.
My question: is the X data known for the period where you have to predict Y? Of X is also unknown in this period, so you have also to predict X?
Question 6
I recommend you to take a look at this question and its answer: How to deal with multi-step time series forecasting in multivariate LSTM in keras
See also this notebook that manages to demonstrate the idea: https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb
In this notebook, though, I used an approach that puts X and Y as inputs. And we predict future X and Y.
You can try creating a model (if that's the case) only to predict X. Then a second model to predict Y from X.
In another case (if you already have all X data, no need to predict X), you can create a model that only predicts Y from X. (You'd still follow part of the method in the notebook, where you first predict the already known Y just to make your model get adjusted to where in the sequence it is, then you predict the unknown Y) -- This can be done in one single full-length X input (which contains the training X at the beginning and the test X at the end).
Bonus answer
Knowing which approach and which kind of model to choose is probably the exact answer to win the competition... so, there isn't a best answer for this question, every competitor is trying to find out this answer.

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

Reason why setting tensorflow's variable with small stddev

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)
The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.