What is the added value of calculating AUC of a PLS-DA model built with only training data - training-data

I'm having trouble understanding the added value of calculating AUC of training sets in general but for this question i'm using an example with PLS-DA.
Let's say you've built a PLS-DA model to try and see whether this model can distinguish between patients with diabetes and patients without. After this, the plot and visualisation of the model shows that there is some kind of discriminatory power. Mind you, this PLS-DA model is built on ONLY trainingdata/ trainig set.
In this situation, what is the added value of using ROC curve to calculate the AUC?
And let's say you plot ROC curve and calculate an AUC of 0,9. What does this explicitly mean? I'm tempted that this would mean that this model is able to/ has the potential to distinguish between, people with diabetes and people without diabetes with an accuracy of 90%. But something tells me this isn't right because after all; the performance of my model can ONLY be assessed after plotting ROC curve and calculating AUC of a validation set and test set right? Or am I looking at this in the wrong way?

Related

Can SigmoidFocalCrossEntropy in Tensorflow (tf-addons) be used in Multiclass Classification? ( What is the right way)?

Focal Loss given in Tensorflow is used for class imbalance. For Binary class classification, there are a lots of codes available but for Multiclass classification, a very little help is there. I ran the code with One Hot Encoded target variables of 250 classes and it gave me results without any error.
y = pd.get_dummies(df['target']) # One hot encoded target classes
model.compile(
optimizer="adam", loss=tfa.losses.SigmoidFocalCrossEntropy(), metrics= metric
)
I just want to know whoever wrote this code or someone having enough knowledge of this code, can it be used be used for Multiclass Classification. If no then how come it did not give me errors, instead better results than CrossEntropy. Also, in other implementations like this one, the value of alpha has to be given for every class but just one value in Tensorflow's implementations.
What is the correct way to use this?
Some basics first.
Categorical Crossentropy is designed to incentivize a model a model to predict 100% for the correct label. It was designed for models that predict single-label multi-class classification - like CIFAR10 or Imagenet. Usually these models finish in a Dense layer with more than one output.
Binary Crossentropy is designed to incentivize a model to predict 100% if the label is one, or, 0% is the label is zero. Usually these models finish in a Dense layer with exactly one output.
When you apply Binary Crossentropy to a single-label multi-class classification problem, you are doing something that is mathematically valid but defines a slightly different task: you are incentivizing a single-label classification model to not only get the true label correct, but also minimize the false labels.
For example, if your target is dog, and your model predict 60% dog, CCE doesn't care if your model predicts 20% cat and 20% French horn, or, 40% cat and 0% French horn. So this is aligned with a top-1 accuracy concept.
But if you take that same model and apply BCE, and your model predictions 60% dog, BCE DOES care if your models predict 20%/20% cat/frenchhorn, vs 40%/0% cat/frenchhorn. To put it in precise terminology, the former is more "calibrated" and so it has some additional measure of goodness. However, this has little correlation to top-1 accuracy.
When you use BCE, presumably you are wasting the model's energy to focus on calibration at the expense of top-1 acc. But as you might have seen, it doesn't always work out that way. Sometimes BCE gives you superior results. I don't know that there's a clear explanation of that but I'd assume that the additional signals (in the case of Imagenet, you'll literally get 1000 times more signals) somehow creates a smoother loss value that perhaps helps smooth the gradients you receive.
The alpha value of focal loss additionally penalizes very wrong predictions and lessens the penalty if your model predicts something close to the right answer - like predicting 90% cat if the ground truth is cat. This would be a shift from the original definition of CCE, based on the theory of Maximum Likelihood Estimation... which focuses on calibration... vs the normal metric most ML practitioners care about: top-1 accuracy.
Focal loss was originally designed for binary classification so the original formulation only has a single alpha value. The repo you pointed to extends the concept of Focal Loss to single-label classification and therefore there are multiple alpha values: one per class. However, by my read, it loses the additional possible smoothing effect of BCE.
Net net, for the best results, you'll want to benchmark CCE, BCE, Binary Focal Loss (out of TFA and per the original paper), and the single-label multi-class Focal Loss that you found in that repo. In general, those the discovery of those alpha values is done via guess & check, or grid search.
There's a lot of manual guessing and checking in ML unfortunately.

Is there a way to retrieve the weights from a GPflow GPR model?

Is there a way to retrieve the weights from a GPflow GPR model?
I do not necessarily need the explicit weights. However, I have two issues that may be solved using the weights:
I would like to compile and send a trained model to a third party. I
would like to do this without sending the training data and without
the third party having access to the training data.
I would like to be able to predict new mean values without
calculating new variances. Currently predict_f calculates both the
mean and the variance, but I only use the mean. I believe I could
speed up my prediction significantly if I didn't calculate the
variance.
I could resolve both of these issues if I could retrieve the weights from the GPR model after training. However, if it is possible to resolve these tasks without ever dealing with explicit weights, that would be even better.
It's not entirely clear what you mean by "explicit weights", but if you mean alpha = Kxx^{-1} y where Kxx is the evaluation of k(x,x') and y is the vector of observation targets, then you can get that by using the Posterior object (see https://github.com/GPflow/GPflow/blob/develop/gpflow/posteriors.py), which you get by calling posterior = model.posterior(). You can then access posterior.alpha.
Re 1.: However, for predictions you still need to be able to compute Kzx the covariance between new test points and the training points, so you will also need to provide the training locations and kernel hyperparameters.
This also means that you cannot rely on this to keep your training data secret, as the third party could simply compute Kxx instead of Kzx and then get back y = Kxx # alpha. You can avoid sharing exact (x,y) training set pairs by using a sparse approximation (this would remove "individual identifiability" at least). But I still wouldn't rely on it for privacy.
Re 2.: The Posterior object already provides much faster predictions; if you only ask for full_cov=False (marginal variances, the default), then you're at worst about a factor ~3 or so slower than predicting just the mean (in practice, I would guesstimate less than 1.5x as slow). As of GPflow 2.3.0, there is no implementation within GPflow of predicting the mean only.

Tensorflow bounded regression vs classification

As part of my masters thesis I have been tasked with predicting a label integer (0-255) which is a binned representation of an angle. The feature columns are also integers, in the range (0-255).
So far I have used the custom Tensorflow layers estimator, implementing a 256 output classifier which performs well. However, my issue with the classification approach I am using is the following:
My classification model thinks that predicting a 3 instead of a 28 is as good/bad as predicting a 27 as a 28
The numerical interval / ordinal nature of my data (not sure which) leads me to believe that if I used regression I would achieve results with less drastically incorrect predictions or outliers.
My goal:
to reduce the number of drastically incorrect predicted outliers
My questions:
Is regression the better approach, or can I improve my
classification to include an ordinal/interval relationship between
my labels?
If I choose regression, is there a way to bound my predicted output between 0-255 (I know I will have to round float values predicted).
Thanks in advance. Any other comments, suggestions or ideas to help me to best tackle the problem are also very helpful.
If I made any incorrect assumptions or mistake in my interpretation of the problem feel free to correct me.
Question 1: Regression is the simpler approach, however, you can also use classification and manipulate the loss function to have a lower loss for misclassifications that are "close" to the original class.
Question 2: The tensorflow command for bounding your prediction is tf.clip_by_value. Are you mapping all 360 degrees to [0,255]? In that case you will want to consider the boundary cases, i.e. your estimator yields -4 and the true value is 251, but they are the actually representing the same value so loss should be 0.

Understanding and tracking of metrics in object detection

I have some questions about metrics if I do some training or evaluation on my own dataset. I am still new to this topic and just experimented with tensorflow and googles object detection api and tensorboard...
So I did all this stuff to get things up and running with the object detection api and trained on some images and did some eval on other images.
So I decided to use the weighted PASCAL metrics set for evaluation:
And in tensorboard I get some IoU for every class and also mAP and thats fine to see and now comes the questions.
The IoU gives me the value of how well the overlapping of ground-truth and predictes boxes is and measures the accuracy of my object detector.
First Question: Is there a influencing to IoU if a object with ground-truth is not detected?
Second Question: Is there a influencing of IoU if a ground-truth object is predicted false negativ?
Third Question: What about False Positves where are no ground-truth objects?
Coding Questions:
Fourth Question: Has anyone modified the evaluation workflow from the object detection API to bring in more metrics like accuracy or TP/FP/TN/FN? And if so can provide me some code with explanation or a tutorial you used - that would be awesome!
Fifth Question: If I will monitor some overfitting and take 30% of my 70% traindata and do some evaluation, which parameter shows me that there is some overfitting on my dataset?
Maybe those question are newbie questions or I just have to read and understand more - I dont know - so your help to understand more is appreciated!!
Thanks
Let's start with defining precision with respect to a particular object class: its a proportion of good predictions to all predictions of that class, i.e., its TP / (TP + FP). E.g., if you have dog, cat and bird detector, the dog-precision would be number of correctly marked dogs over all predictions marked as dog (i.e., including false detections).
To calculate the precision, you need to decide if each detected box is TP or FP. To do this you may use IuO measure, i.e., if there is significant (e.g., 50% *) overlap of the detected box with some ground truth box, its TP if both boxes are of the same class, otherwise its FP (if the detection is not matched to any box its also FP).
* thats where the #0.5IUO shortcut comes from, you may have spotted it in the Tensorboard in titles of the graphs with PASCAL metrics.
If the estimator outputs some quality measure (or even probability), you may decide to drop all detections with quality below some threshold. Usually, the estimators are trained to output value between 0 and 1. By changing the threshold you may tune the recall metric of your estimator (the proportion of correctly discovered objects). Lowering the threshold increases the recall (but decreases precision) and vice versa. The average precision (AP) is the average of class predictions calculated over different thresholds, in PASCAL metrics the thresholds are from range [0, 0.1, ... , 1], i.e., its average of precision values for different recall levels. Its an attempt to capture characteristics of the detector in a single number.
The mean average precision is mean of average previsions over all classes. E.g., for our dog, cat, bird detector it would be (dog_AP + cat_AP + bird_AP)/3.
More rigorous definitions could be found in the PASCAL challenge paper, section 4.2.
Regarding your question about overfitting, there could be several indicators of it, one could be, that AP/mAP metrics calculated on the independent test/validation set begin to drop while the loss still decreases.

xgboost using the auc metric correctly

I have a slightly imbalanced dataset for a binary classification problem, with a positive to negative ratio of 0.6.
I recently learned about the auc metric from this answer: https://stats.stackexchange.com/a/132832/128229, and decided to use it.
But I came across another link http://fastml.com/what-you-wanted-to-know-about-auc/ which claims that, the AUC-ROC is insensitive to class imbalance, and we should use AUC for a precision-recall curve.
The xgboost docs are not clear on which AUC they use, do they use AUC-ROC?
Also the link mentions that AUC should only be used if you do not care about the probability and only care about the ranking.
However since i am using a binary:logistic objective i think i should care about probabilities since i have to set a threshold for my predictions.
The xgboost parameter tuning guide https://github.com/dmlc/xgboost/blob/master/doc/how_to/param_tuning.md
also suggests an alternate method to handle class imbalance, by not balancing positive and negative samples and using max_delta_step = 1.
So can someone explain, when is the AUC preffered over the other method for xgboost to handle class imbalance. And if i am using AUC , what is the threshold i need to set for prediction or more generally how exactly should i use AUC for handling imbalanced binary classification problem in xgboost?
EDIT:
I also need to eliminate false positives more than false negatives, how can i achieve that, apart from simply varying the threshold, with binary:logistic objective?
According the xgboost parameters section in here there is auc and aucprwhere prstands for precision recall.
I would say you could build some intuition by running both approaches and see how the metrics behave. You can include multiple metric and even optimize with respect to whichever you prefer.
You can also monitor the false positive (rate) in each boosting round by creating custom metric.
XGboost chose to write AUC (Area under the ROC Curve), but some prefer to be more explicit and say AUC-ROC / ROC-AUC.
https://xgboost.readthedocs.io/en/latest/parameter.html