How to set a minimum number of epoch in Optuna SuccessiveHalvingPruner()? - tensorflow

I'm using Optuna 2.5 to optimize a couple of hyperparameters on a tf.keras CNN model. I want to use pruning so that the optimization skips the less promising corners of the hyperparameters space. I'm using something like this:
study0 = optuna.create_study(study_name=study_name,
storage=storage_name,
direction='minimize',
sampler=TPESampler(n_startup_trials=25, multivariate=True, seed=123),
pruner=optuna.pruners.SuccessiveHalvingPruner(min_resource='auto',
reduction_factor=4, min_early_stopping_rate=0),
load_if_exists=True)
Sometimes the model stops after 2 epochs, some other times it stops after 12 epochs, 48 and so forth. What I want is to ensure that the model always trains at least 30 epochs before being pruned. I guess that the parameter min_early_stopping_rate might have some control on this but I've tried to change it from 0 to 30 and then the models never get pruned. Can someone explain me a bit better than the Optuna documentation, what these parameters in the SuccessiveHalvingPruner() really do (specially min_early_stopping_rate)?
Thanks

min_resource's explanation on the documentation says
A trial is never pruned until it executes min_resource * reduction_factor ** min_early_stopping_rate steps.
So, I suppose that we need to replace the value of min_resource with a specific number depending on reduction_factor and min_early_stopping_rate.

Related

Is there a way to use less decimals in xgb.cv loss calculation to allow 'early_stopping_rounds' to trigger sooner?

I am using xgb.cv to determine a correct number of estimators for my problem and I am using 'multi:softprob' and 'mlogloss'. Originally in my code I set:
num_boost_round = 999
early_stopping_rounds = 10
Problem is that the loss is returned with many decimals, and even though the last decimals change, it has no practical effect on model goodness for me. This is an example of the losses from around boost round 170 of my run:
0.012855
0.012855
0.012855
0.012854666666666667
0.012854666666666667
0.012853999999999999
0.012853999999999999
0.012853666666666666
0.012853666666666666
0.012853666666666666
0.012852999999999998
You can see that there is little or no idea continuing anymore. My cv got down to these figures already after 15-20 boosting rounds.
Is there a way to use less decimals for the loss comparisons (or reporting) and that way make 'early_stopping_rounds' trigger sooner and stop the cv?
Any ideas would be appreciated.

XGBoost- Help interpreting the booster behaviour. Why is the 0th iteration always coming out to be best?

I am training an XGBoost model and having trouble interpreting the model behaviour.
early_stopping_rounds =10
num_boost_round=100
Dataset is unbalanced with 458644 1s and 7975373 0s
evaluation metric is AUCPR
param = {'max_depth':6, 'eta':0.03, 'silent':1, 'colsample_bytree': 0.3,'objective':'binary:logistic', 'nthread':6, 'subsample':1, 'eval_metric':['aucpr']}
From my understanding of "early_stopping_rounds" the training is supposed to stop after no improvement is observed in the test/evaluation dataset's eval metric(aucpr) for 10 consecutive rounds. However, in my case, even when there is a clear improvement in the AUCPR of the evaluation dataset, the training still stops after the 10th boosting stage. Please see the training log below. Additionally, the best iteration comes out to be the 0th one when clearly the 10th iteration has an AUCPR much higher than the 0th iteration.
Is this right? If not what could be going wrong? If yes then please correct my understanding about early stopping rounds and best iteration.
Very interesting!!
So it turns out that early_stopping looks to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC) - https://xgboost.readthedocs.io/en/latest/python/python_intro.html
When you use aucpr, it is actually trying to minimize it - perhaps that's the default behavior.
Try to set maximize=True when calling xgboost.train() - https://github.com/dmlc/xgboost/issues/3712

Does deeper LSTM need more units?

I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).

Initial jump in loss with TensorFlow

Suppose I have a saved model that is nearly at the minimum, but with some room for improvement. For example, the loss (as reported by tf.keras.Models.model.evaluate() ) might be 11.390, and I know that the model can go down to 11.300.
The problem is that attempts to refine this model (using tf.keras.Models.model.fit()) consistently result in the weights receiving an initial 'jolt' during the first epoch, which sends the loss way upwards. After that, it starts to decrease, but it does not always converge on the correct minimum (and may not even get back to where it started.)
It looks like this:
tf.train.RMSPropOptimizer(0.0002):
0 11.982
1 11.864
2 11.836
3 11.822
4 11.809
5 11.791
(...)
15 11.732
tf.train.AdamOptimizer(0.001):
0 14.667
1 11.483
2 11.400
3 11.380
4 11.371
5 11.365
tf.keras.optimizers.SGD(0.00001):
0 12.288
1 11.760
2 11.699
3 11.650
4 11.666
5 11.601
Dataset with 30M observations, batch size 500K in all cases.
I can mitigate this by turning the learning rate way down, but then it takes forever to converge.
Is there any way to prevent training from going "wild" at the beginning, without impacting the long-term convergence rate?
As you tried decreasing the learning rate is the way to go.
E.g. learning rate = 0.00001
tf.train.AdamOptimizer(0.00001)
Especially with Adam that should be promising, since the learning rate is at the same time an upper bound for the step size.
On top of that you could try learning rate scheduling, where you set the learning rate according to your predefined schedule.
Also I feel that from what you show when you decreased the learning rate, this does not seem to be too bad, in terms of convergence rate.
Maybe another hyperparameter you could tune in your case would be to reduce the batch size, to decrease computation cost per update.
Note:
I find the term "not the right minimum" rather misleading. To further understand nonconvex optimization for artificial neural networks, I would like to Point to the deep learning book of Ian Goodfellow et al

In a dynamic_rnn do I need to bucket samples in order to balance batches?

I am implementing a bidirectional dynamic rnn. Now I face the question whether I need to bucket my training samples.
My thought (and fear) is that if I don't bucket I might face the following situation: In a batch with 32 samples and maybe all but one samples being below 500 characters long and one samples being say 10.000 characters long the backprop will behave basically as if I had only a batch size of 1 and might result in NANs quickly or throw off the learned weights pretty badly every time that situation occurs.
Any experiences before I write code and check for days of training and debugging? Thx