Suppose I have a saved model that is nearly at the minimum, but with some room for improvement. For example, the loss (as reported by tf.keras.Models.model.evaluate() ) might be 11.390, and I know that the model can go down to 11.300.
The problem is that attempts to refine this model (using tf.keras.Models.model.fit()) consistently result in the weights receiving an initial 'jolt' during the first epoch, which sends the loss way upwards. After that, it starts to decrease, but it does not always converge on the correct minimum (and may not even get back to where it started.)
It looks like this:
tf.train.RMSPropOptimizer(0.0002):
0 11.982
1 11.864
2 11.836
3 11.822
4 11.809
5 11.791
(...)
15 11.732
tf.train.AdamOptimizer(0.001):
0 14.667
1 11.483
2 11.400
3 11.380
4 11.371
5 11.365
tf.keras.optimizers.SGD(0.00001):
0 12.288
1 11.760
2 11.699
3 11.650
4 11.666
5 11.601
Dataset with 30M observations, batch size 500K in all cases.
I can mitigate this by turning the learning rate way down, but then it takes forever to converge.
Is there any way to prevent training from going "wild" at the beginning, without impacting the long-term convergence rate?
As you tried decreasing the learning rate is the way to go.
E.g. learning rate = 0.00001
tf.train.AdamOptimizer(0.00001)
Especially with Adam that should be promising, since the learning rate is at the same time an upper bound for the step size.
On top of that you could try learning rate scheduling, where you set the learning rate according to your predefined schedule.
Also I feel that from what you show when you decreased the learning rate, this does not seem to be too bad, in terms of convergence rate.
Maybe another hyperparameter you could tune in your case would be to reduce the batch size, to decrease computation cost per update.
Note:
I find the term "not the right minimum" rather misleading. To further understand nonconvex optimization for artificial neural networks, I would like to Point to the deep learning book of Ian Goodfellow et al
Related
One part of my MCMC algorithm is using MH algorithm to update (n\times 1) vector of parameters $\boldsymbol{\delta}$. I think it is less computational intensive to propose a new sample from a $n\times 1$ multivariate proposal distribution (n is large). However, it seems impossible to tune the acceptance rate towards some ideal interval, such as 0.2 to 0.5.
I have tried random walk update based on multivariate normal and multivariate uniform distribution. No matter how I adjust the step size, the pattern of acceptance rate looks similar to the following figure.
enter image description here
Is there anyone have such experience? Any good suggest is welcome!
I'm using Optuna 2.5 to optimize a couple of hyperparameters on a tf.keras CNN model. I want to use pruning so that the optimization skips the less promising corners of the hyperparameters space. I'm using something like this:
study0 = optuna.create_study(study_name=study_name,
storage=storage_name,
direction='minimize',
sampler=TPESampler(n_startup_trials=25, multivariate=True, seed=123),
pruner=optuna.pruners.SuccessiveHalvingPruner(min_resource='auto',
reduction_factor=4, min_early_stopping_rate=0),
load_if_exists=True)
Sometimes the model stops after 2 epochs, some other times it stops after 12 epochs, 48 and so forth. What I want is to ensure that the model always trains at least 30 epochs before being pruned. I guess that the parameter min_early_stopping_rate might have some control on this but I've tried to change it from 0 to 30 and then the models never get pruned. Can someone explain me a bit better than the Optuna documentation, what these parameters in the SuccessiveHalvingPruner() really do (specially min_early_stopping_rate)?
Thanks
min_resource's explanation on the documentation says
A trial is never pruned until it executes min_resource * reduction_factor ** min_early_stopping_rate steps.
So, I suppose that we need to replace the value of min_resource with a specific number depending on reduction_factor and min_early_stopping_rate.
I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).
I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!
Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.
Using scikit-learn on balanced training data of around 50 millions samples (50% one class, 50% the other, 8 continuous features in interval (0,1)), all classifiers that I have been able to try so far (Linear/LogisticRegression, LinearSVC, RandomForestClassifier, ...) show a strange behavior:
When testing on the training data, the percentage of false-positives is much lower than the percentage of false-negatives (fnr). When correcting the intercept manually in order to increase false-positive rate (fpr), the accuracy actually improves considerably.
Why do the classification algorithms not find a close-to-optimal intercept (that I guess would more or less be at fpr=fnr)?
I guess the idea is that there's no single definition of "optimal"; for some applications, you'll tolerate false positives much more than false negatives (i.e. detecting fraud or disease where you don't want to miss a positive) whereas for other applications false positives are much worse (predicting equipment failures, crimes, or something else where the cost of taking action is expensive). By default, predict just chooses 0.5 as the threshold, this is usually not what you want, you need think about your application and then look at the ROC curve and the gains/lift charts to decide where you want to set the prediction threshold.