xgboost using the auc metric correctly - xgboost

I have a slightly imbalanced dataset for a binary classification problem, with a positive to negative ratio of 0.6.
I recently learned about the auc metric from this answer: https://stats.stackexchange.com/a/132832/128229, and decided to use it.
But I came across another link http://fastml.com/what-you-wanted-to-know-about-auc/ which claims that, the AUC-ROC is insensitive to class imbalance, and we should use AUC for a precision-recall curve.
The xgboost docs are not clear on which AUC they use, do they use AUC-ROC?
Also the link mentions that AUC should only be used if you do not care about the probability and only care about the ranking.
However since i am using a binary:logistic objective i think i should care about probabilities since i have to set a threshold for my predictions.
The xgboost parameter tuning guide https://github.com/dmlc/xgboost/blob/master/doc/how_to/param_tuning.md
also suggests an alternate method to handle class imbalance, by not balancing positive and negative samples and using max_delta_step = 1.
So can someone explain, when is the AUC preffered over the other method for xgboost to handle class imbalance. And if i am using AUC , what is the threshold i need to set for prediction or more generally how exactly should i use AUC for handling imbalanced binary classification problem in xgboost?
EDIT:
I also need to eliminate false positives more than false negatives, how can i achieve that, apart from simply varying the threshold, with binary:logistic objective?

According the xgboost parameters section in here there is auc and aucprwhere prstands for precision recall.
I would say you could build some intuition by running both approaches and see how the metrics behave. You can include multiple metric and even optimize with respect to whichever you prefer.
You can also monitor the false positive (rate) in each boosting round by creating custom metric.

XGboost chose to write AUC (Area under the ROC Curve), but some prefer to be more explicit and say AUC-ROC / ROC-AUC.
https://xgboost.readthedocs.io/en/latest/parameter.html

Related

What is the added value of calculating AUC of a PLS-DA model built with only training data

I'm having trouble understanding the added value of calculating AUC of training sets in general but for this question i'm using an example with PLS-DA.
Let's say you've built a PLS-DA model to try and see whether this model can distinguish between patients with diabetes and patients without. After this, the plot and visualisation of the model shows that there is some kind of discriminatory power. Mind you, this PLS-DA model is built on ONLY trainingdata/ trainig set.
In this situation, what is the added value of using ROC curve to calculate the AUC?
And let's say you plot ROC curve and calculate an AUC of 0,9. What does this explicitly mean? I'm tempted that this would mean that this model is able to/ has the potential to distinguish between, people with diabetes and people without diabetes with an accuracy of 90%. But something tells me this isn't right because after all; the performance of my model can ONLY be assessed after plotting ROC curve and calculating AUC of a validation set and test set right? Or am I looking at this in the wrong way?

Is there a way to retrieve the weights from a GPflow GPR model?

Is there a way to retrieve the weights from a GPflow GPR model?
I do not necessarily need the explicit weights. However, I have two issues that may be solved using the weights:
I would like to compile and send a trained model to a third party. I
would like to do this without sending the training data and without
the third party having access to the training data.
I would like to be able to predict new mean values without
calculating new variances. Currently predict_f calculates both the
mean and the variance, but I only use the mean. I believe I could
speed up my prediction significantly if I didn't calculate the
variance.
I could resolve both of these issues if I could retrieve the weights from the GPR model after training. However, if it is possible to resolve these tasks without ever dealing with explicit weights, that would be even better.
It's not entirely clear what you mean by "explicit weights", but if you mean alpha = Kxx^{-1} y where Kxx is the evaluation of k(x,x') and y is the vector of observation targets, then you can get that by using the Posterior object (see https://github.com/GPflow/GPflow/blob/develop/gpflow/posteriors.py), which you get by calling posterior = model.posterior(). You can then access posterior.alpha.
Re 1.: However, for predictions you still need to be able to compute Kzx the covariance between new test points and the training points, so you will also need to provide the training locations and kernel hyperparameters.
This also means that you cannot rely on this to keep your training data secret, as the third party could simply compute Kxx instead of Kzx and then get back y = Kxx # alpha. You can avoid sharing exact (x,y) training set pairs by using a sparse approximation (this would remove "individual identifiability" at least). But I still wouldn't rely on it for privacy.
Re 2.: The Posterior object already provides much faster predictions; if you only ask for full_cov=False (marginal variances, the default), then you're at worst about a factor ~3 or so slower than predicting just the mean (in practice, I would guesstimate less than 1.5x as slow). As of GPflow 2.3.0, there is no implementation within GPflow of predicting the mean only.

What loss function to use in Keras when metric is SparseTopKCategoricalAccuracy/TopKCategoricalAccuracy?

For multiclass classification problems, Keras and tf.keras have metrics like SparseTopKCategoricalAccuracy and TopKCategoricalAccuracy. However, if one uses loss functions like SparseCategoricalCrossentropy or CategoricalCrossentropy, they cannot achieve the max values for these two metrics.
What is a good loss function to use when one wants to maximize SparseTopKCategoricalAccuracy or TopKCategoricalAccuracy?
I understand that SparseTopKCategoricalAccuracy is not differentiable, just like Accuracy. I am trying to find a function that can approximate the smooth loss function and yield a higher number for SparseTopKCategoricalAccuracy.
CrossEntropy is not the best loss function when you deal with Top-k accuracy because cross-entropy may be prone to overfitting on small datasets or noisy labels.
As you have already pointed out, "smooth loss" functions are developed for top-k classification with SVM. To my knowledge, there is no a "off-the-shelf" loss function in Keras/TF that is best suited for top-k. However, I suggest you to try Smooth Surrogate Loss (SSL) presented in the article and implemented in Pytorch to use with deep neural networks (see Github). It derives from multi-class SVMs as SSL creates a margin between the correct top-k predictions and the incorrect ones. The training time of SSL is comparatevely the same as in the case of cross-entropy thanking to a divide-and-conquer approach and the use of polynomials (see implementation).

What does stateful mean in tensorflow metrics in my case?

I don't really understand the explanation of a stateful metric here: Keras metrics with TF backend vs tensorflow metrics
Now, if I split my evaluation data in batches and for each batch I use tf.metrics.precision for the precision, does it mean that the previous variables (counter false positives etc. ) are used for the calculation in the next batch? That would be really bad, since I want the single evaluations for each batch (that is why I do the split!)
If this is the case how can I reset the variables for each batch.
I need the single values from each batch for a mean afterwards.
The reason why tf.metrics.Precision and the like (Recall, etc) store true/false positive is because we do not want to estimate them batch-wise (unlike Accuracy or Loss, etc). The original implementation of Precision in keras (noted, not tf.keras) did exactly what you described (single evaluations for each batch and then aggregate afterward) but was later removed in version 2.0.0 because this way of computing global metric is "more misleading than helpful" (https://github.com/keras-team/keras/issues/5794).
But you may still do what you want to do, you can subclass tf.metrics.Metric and implement the logic of Precision in update_state method. The Metric API doc on Tensorflow has an example of custom Metrics. https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric
I hope this is helpful!

what is the difference between sampled_softmax_loss and nce_loss in tensorflow?

i notice there are two functions about negative Sampling in tensorflow to compute the loss (sampled_softmax_loss and nce_loss). the paramaters of these two function are similar, but i really want to know what is the difference between the two?
Sample softmax is all about selecting a sample of the given number and try to get the softmax loss. Here the main objective is to make the result of the sampled softmax equal to our true softmax. So algorithm basically concentrate lot on selecting the those samples from the given distribution.
On other hand NCE loss is more of selecting noise samples and try to mimic the true softmax. It will take only one true class and a K noise classes.
Sampled softmax tries to normalise over all samples in your output. Having a non-normal distribution (logarithmic over your labels) this is not an optimal loss function. Note that although they have the same parameters, they way you use the function is different. Take a look at the documentation here: https://github.com/calebchoo/Tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard4/tf.nn.nce_loss.md and read this line:
By default this uses a log-uniform (Zipfian) distribution for sampling, so your labels must be sorted in order of decreasing frequency to achieve good results. For more details, see log_uniform_candidate_sampler.
Take a look at this paper where they explain why they use it for word embeddings: http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf
Hope this helps!
Check out this documentation from TensorFlow https://www.tensorflow.org/extras/candidate_sampling.pdf
They seem pretty similar, but sampled softmax is only applicable for a single label while NCE extends to the case where your labels are a multiset. NCE can then model the expected counts rather than presence/absence of a label. I'm not clear on an exact example of when to use the sampled_softmax.