Tensorflow / Keras: Normalize train / test / realtime Data or how to handle reality? - tensorflow

I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!

Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.

Related

Can you forecast with multiple trajectories?

I am new to time-series machine learning and have a, perhaps, trivial question.
I would like like to forecast the temperature for a particular region. I could train a model using the hourly data points from the first 6 days of the week and then evaluate its performance on the final day. Therefore the training set would have 144 data points (6*24) and the test set would have 24 data points (24*1). Likewise, I can train a new model for regions B-Z and evaluate each of their individual performances. My question is, can you train a SINGLE model for the predictions across multiple different regions? So the region label should be an input of course since that will effect the temperature evolution.
Can you train a single model that forecasts for multiple trajectories rather than just one? Also, what might be a good metric for evaluating its performance? I was going to use mean absolute error but maybe a correlation is better?
Yes you can train with multiple series of data from different region the question that you ask is an ultimate goal of deep learning by create a 1 model to do every things, predict every region correctly and so on. However, if you want to generalize your model that much you normally need a really huge model, I'm talking about 100M++ parameter and to train that data you also need tons of Data maybe couple TB or PB, so you also need a super powerful computer to train that thing something like GOOGLE data center. Coming to your next question, the metric, you may use just simple RMS error or mean absolute error will work fine.
Here is what you need to focus Training Data, there is no super model that take garbage and turn it in to gold, same thing here garbage in garbage out. You need a pretty good datasets that can represent whole environment of what u are trying to solve. For example, you want to create model to predict that if you hammer a glass will it break, so you have maybe 10 data for each type of glass and all of them break when u hammer it. so, you train the model and it just predict break every single time, then you try to predict with a bulletproof glass and it does not break, so your model is wrong. Therefore, you need a whole data of different type of glass then your model maybe predict it correctly. Then compare this to your 144 data points, I'm pretty sure it won't work for your case.
Therefore, I would say yes you can build that 1 model fits all but there is a huge price to pay.

Isn't it dangerous to apply Min Max Scaling to the test set?

Here's the situation I am worrying about.
Let me say I have a model trained with min-max scaled data. I want to test my model, so I also scaled the test dataset with my old scaler which was used in the training stage. However, my new test data's turned out to be the newer minimum, so the scaler returned negative value.
As far as I know, minimum and maximum aren't that stable value, especially in the volatile dataset such as cryptocurrency data. In this case, should I update my scaler? Or should I retrain my model?
I happen to disagree with #Sharan_Sundar. The point of scaling is to bring all of your features onto a single scale, not to rigorously ensure that they lie in the interval [0,1]. This can be very important, especially when considering regularization techniques the penalize large coefficients (whether they be linear regression coefficients or neural network weights). The combination of feature scaling and regularization help to ensure your model generalizes to unobserved data.
Scaling based on your "test" data is not a great idea because in practice, as you pointed out, you can easily observe new data points that don't lie within the bounds of your original observations. Your model needs to be robust to this.
In general, I would recommend considering different scaling routines. scikitlearn's MinMaxScaler is one, as is StandardScaler (subtract mean and divide by standard deviation). In the case where your target variable, cryptocurrency price can vary over multiple orders of magnitude, it might be worth using the logarithm function for scaling some of your variables. This is where data science becomes an art -- there's not necessarily a 'right' answer here.
(EDIT) - Also see: Do you apply min max scaling separately on training and test data?
Ideally you should scale first and then only split into test and train. But its not preferable to use minmax scaler with data which can have dynamically varying min and max values with significant variance in realtime scenario.

Train / Test split % for Object Detection - what's the current recommendation?

Using the Tensorflow Object Detection API, what's the current recommendation / best practice around the train / test split percentage for labeled examples? I've seen a lot of conflicting info, anywhere from 70/30 to 95/5. Any recent real world experience is appreciated.
Traditional advice is ~70-75% training and the rest test data. More recent articles indeed suggest a different split. I read 95/2.5/2.5 (train / test / dev for hyperparameter tuning) a lot these days.
I guess your optimal split depends on the amount of available data and the bias/variance characteristics. Poor performance on training data may be caused by underfitting and need more training data. If your model is fitting well or even overfitting, you should be able to allocate some of the training data away to test data.
If you're stuck in the middle, you may also consider cross validation as a computationally expensive but data friendly option.
It depends on the size of the dataset as Andrew Ng suggests:
(train/ dev or Val /test)
If the size of the dataset is 100 to 10K ~ 60/20/20
If the size of the dataset is 1M to INF ==> 98/1/1 or 99.5/0.25/0.25
Note that these are not fixed and just suggestions.
The goal of the test set mentioned here is to give you an unbiased performance measurement of your work. In some works, it is OK not to have only two sets set (then they will call it train/test, though test set here is actually working as dev set ratio can be 70/30 )

Setting up training in Tensorflow with overlapping data?

I am trying to train a neural network to forecast using time series data. I'm trying to train a neural network to predict temperature 10 minutes into the future, and lets say I have data points of temperature every 5 minutes and I want to give it 15 minutes worth of data to use in the prediction and the data I have is this.
[1,2,3,4,5,6,7,8,9,10,11,12]
so if I were to train on the data one potential training sample is [1,2,3] as x and [5] as y (as it's 10 minute into the future (two 5 minute steps)).
I want a way to train on all possible inputs, these are as follow.
[1,2,3][5]
[2,3,4][6]
[3,4,5][7]
[4,5,6][8]
[5,6,7][9]
[6,7,8][10]
[7,8,9][11]
[8,9,10][12]
But I don't want to train by first saving each possible example to disk then training from that. This takes up more space than is necessary as the data is duplicated. I would like to do this in some kind of preprocessing of the data.
All the instructions and examples I have found of using the tensorflow input pipeline such as here https://www.tensorflow.org/guide/datasets all use "non overlapping" data, I can't find anything to deal with my scenario.
The problem I'm having is I really have no idea how to set this overlapping data scenario in tensorflow without saving massive amounts of duplicated data to disk. If anyone has any links or guides as to the best way to do this I'd very much appreciate it thank you.
You are probably looking for this transformation: https://www.tensorflow.org/api_docs/python/tf/contrib/data/sliding_window_batch
tf.contrib.data.sliding_window_batch(window_size=3, stride=1)

Splitting Training Data to train optimal number of n models

lets assume we have a huge Database providing us with the training data D and a dedicated smaller testing data T for a machine learning problem.
The data covers many aspects of a real world problem and thus is very diverse in its structure.
When we now train a not closer defined machine learning algorithm (Neural Network, SVM, Random Forest, ...) with D and finally test the created model against T we obtain a certain performance measure P (confusion matrix, mse, ...).
The Question: If I could achieve a better performance, by dividing the problem ito smaller sub-problems, e.g. by clustering D into several distinct training sets D1, D2, D3, ..., how could I find the optimal clusters? (number of clusters, centroids,...)
In a brute-force fashion I am thinking about using a kNN Clustering with a random number of clusters C, which leads to the training data D1, D2,...Dc.
I would now train C different models and finally test them against the training sets T1, T2, ..., Tc, where the same kNN Clustering has been used to split T into the C test sets T1,..,Tc.
The combination which gives me the best overall performance mean(P1,P2,...,Pc) would be the one I would like to choose.
I was just wondering whether you know a more sophisticated way than brute-forcing this?
Many thanks in advance
Clustering is hard.
Much harder than classification, because you don't have labels to tell you if you are doing okay, or not well at all. It can't do magic, but it requires you to carefully choose parameters and evaluate the result.
You cannot just dump your data into k-means and expect anything useful to come out. You'd first need to really really carefully clean and preprocess your data, and then you might simply figure out that it actually is only one single large clump...
Furthermore, if clustering worked well and you train classifiers on each cluster independently, then every classifier will miss crucial data. The result will likely performing really really bad!
If you want to only train on parts of the data, use a random forest.
But it sounds like you are more interested in a hierarchical classification approach. That may work, if you have good hierarchy information. You'd first train a classifier on the category, then another within the category only to get the final class.