XGBoost- Help interpreting the booster behaviour. Why is the 0th iteration always coming out to be best? - xgboost

I am training an XGBoost model and having trouble interpreting the model behaviour.
early_stopping_rounds =10
num_boost_round=100
Dataset is unbalanced with 458644 1s and 7975373 0s
evaluation metric is AUCPR
param = {'max_depth':6, 'eta':0.03, 'silent':1, 'colsample_bytree': 0.3,'objective':'binary:logistic', 'nthread':6, 'subsample':1, 'eval_metric':['aucpr']}
From my understanding of "early_stopping_rounds" the training is supposed to stop after no improvement is observed in the test/evaluation dataset's eval metric(aucpr) for 10 consecutive rounds. However, in my case, even when there is a clear improvement in the AUCPR of the evaluation dataset, the training still stops after the 10th boosting stage. Please see the training log below. Additionally, the best iteration comes out to be the 0th one when clearly the 10th iteration has an AUCPR much higher than the 0th iteration.
Is this right? If not what could be going wrong? If yes then please correct my understanding about early stopping rounds and best iteration.

Very interesting!!
So it turns out that early_stopping looks to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC) - https://xgboost.readthedocs.io/en/latest/python/python_intro.html
When you use aucpr, it is actually trying to minimize it - perhaps that's the default behavior.
Try to set maximize=True when calling xgboost.train() - https://github.com/dmlc/xgboost/issues/3712

Related

CNN: Unstable of model score vs iteration

I got my model score vs iteration graph is unstable. How can I improve it?
This is what I get
Here is my code
Code 1
Code 2
Code 3
Code 4
Code 5
Your network looks fairly stock/copy and pasted. I'm pretty sure I've seen this code before.
Without knowing much about your input data I'm not sure if you're solving a classification problem or not but try first switching it to softmax and negative log likelihood on the output.
The output activation and loss function are mainly for binary classification.
You can also get rid of the ReNormalizeL2PerLayer. That might hinder the network from learning depending on your data.
It's also hard to help without knowing much about your input data but sometimes unit mean zero variance may not be suitable for your data set. Consider switching to a zero to 1 scaling instead.
Lastly, for quick iteration times consider overfitting on a small amount of data first when testing. That will help you see if there's any signal in your data and if your network can learn.

Tensorflow: how to stop small values leaking through pruning?

The documentation for PolynomialDecay suggests that by default, frequency=100 so that pruning is only applied every 100 steps. This presumably means that the parameters which are pruned to 0 will drift away from 0 during the other 99/100 steps. So at the end of the pruning process, unless you are careful to have an exact multiple of 100 steps, you well end up with a model that is not perfectly pruned but which has a large number of near-zero values.
How does one stop this happening? Do you have to tweak frequency to be a divisor of the number of steps? I can't find any code samples that do that...
As per this example in the doc: while training the tfmot.sparsity.keras.UpdatePruningStep() callback must be registered:
callbacks = [
tfmot.sparsity.keras.UpdatePruningStep(),
…
]
model_for_pruning.fit(…, callbacks=callbacks)
This will ensure that the mask is applied (and so weights set to zero) when the training ends.
https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/core/sparsity/keras/pruning_callbacks.py#L64

How to set a minimum number of epoch in Optuna SuccessiveHalvingPruner()?

I'm using Optuna 2.5 to optimize a couple of hyperparameters on a tf.keras CNN model. I want to use pruning so that the optimization skips the less promising corners of the hyperparameters space. I'm using something like this:
study0 = optuna.create_study(study_name=study_name,
storage=storage_name,
direction='minimize',
sampler=TPESampler(n_startup_trials=25, multivariate=True, seed=123),
pruner=optuna.pruners.SuccessiveHalvingPruner(min_resource='auto',
reduction_factor=4, min_early_stopping_rate=0),
load_if_exists=True)
Sometimes the model stops after 2 epochs, some other times it stops after 12 epochs, 48 and so forth. What I want is to ensure that the model always trains at least 30 epochs before being pruned. I guess that the parameter min_early_stopping_rate might have some control on this but I've tried to change it from 0 to 30 and then the models never get pruned. Can someone explain me a bit better than the Optuna documentation, what these parameters in the SuccessiveHalvingPruner() really do (specially min_early_stopping_rate)?
Thanks
min_resource's explanation on the documentation says
A trial is never pruned until it executes min_resource * reduction_factor ** min_early_stopping_rate steps.
So, I suppose that we need to replace the value of min_resource with a specific number depending on reduction_factor and min_early_stopping_rate.

Tensorflow Tensorboard - should I follow the "smooth" value or the "Value"?

I am using TF tensorboard to monitor the training progress for a model. I am getting a bit confused because I am seeing the two points that represent the validation loss value showing a different direction:
Time=13:30 Smoothed=18.33 Value=15.41..........
Time=13:45 Smoothed=17.76 Value=16.92
In this case, is the validation loss increasing or decreasing? thanks!
As I cannot put figures in the comments, have a look at this graph.
If you watch the falling slope between x = 50 and x = 100, you will see that locally, the real values increase at some points (usually after downward spikes). So you could conclude that your function values are increasing. But at a larger scope you will see that the function values are decreasing. The smoothing helps you to get make the interpretation easier, but does not return exact values.
Coming back to the local example, it would give you the insight that the overall trend is a decreasing function, but it does not provide accurate loss values.

Reason why setting tensorflow's variable with small stddev

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)
The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.