Trading in precision for better recall in Keras classification neural net - tensorflow

There's always a tradeoff between precision and recall. I'm dealing with a multi-class problem, where for some classes I have perfect precision but really low recall.
Since for my problem false positives are less of an issue than missing true positives, I want reduce precision in favor of increasing recall for some specific classes, while keeping other things as stable as possible. What are some ways to trade in precision for better recall?

You can use a threshold on the confidence score of your classifier output layer and plot the precision and recall at different values of the threshold. You can use different thresholds for different classes.
You can also take a look on weighted cross entropy of Tensorflow as a loss function. As stated, it uses weights to allow trade off recall and precision by up- or down-weighting the cost of a positive error relative to a negative error.

Related

Understanding tensorboard plots for PPO in RLLIB

I am beginner in Deep RL and would like to train my own gym environment in RLLIB with the PPO algorithm. However, I am having some difficulties seeing if my hyperparameter settings are being successful. Apart from the obvious episode_reward_mean metric which should rise we have many other plots.
I am especially interested in how entropy should evolve during a successful training. In my case it looks like this:
entropy.jpg
It is usually dropping below 0 and then converging. I understand that entropy as part of the loss function is enforcing exploration and can therefore speedup learning. But why is it getting negative? Shouldn't it be always greater or equal to 0?
What are other characteristics of a successful training (vf_explained_var, vf_loss, kl,...)?
If your action space is continuous, entropy can be negative, because differential entropy can be negative.
Ideally, you want the entropy to be decreasing slowly and smoothly over the course of training, as the agent trades exploration in favor of exploitation.
With regards to the vf_* metrics, it's helpful to know what they mean.
In policy gradient methods, it can be helpful to reduce the variance of rollout estimates by using a value function--parameterized by a neural network--to estimate rewards that are farther in the future (check the PPO paper for some math on page 5).
vf_explained_var is the explained variation of those future rewards through the use of the value function. You want this to be higher if possible, and it tops out at 1; however, if there is randomness in your environment it's unlikely for this to actually hit 1. vf_loss is the error that your value function is incurring; ideally this would decrease to 0, though this isn't always possible (due to randomness). kl is the difference between your old strategy and your new strategy at each time step: you want this to smoothly decrease as you train to indicate convergence.

What is the impact from choosing auc/error/logloss as eval_metric for XGBoost binary classification problems?

How does choosing auc, error, or logloss as the eval_metric for XGBoost impact its performance? Assume data are unbalanced. How does it impact accuracy, recall, and precision?
Choosing between different evaluation matrices doesn't directly impact the performance. Evaluation matrices are there for the user to evaluate his model. accuracy is another evaluation method, and so does precision-recall. On the other hand, Objective functions is what impacts all those evaluation matrices
For example, if one classifier is yielding a probability of 0.7 for label 1 and 0.3 for label 0, and a different classifier is yielding a probability of 0.9 for label 1 and 0.1 for label 0 you will have a different error between them, though both of them will classify the labels correctly.
Personally, most of the times, I use roc auc to evaluate a binary classification, and if I want to look deeper, I look at a confusion matrix.
When dealing with unbalanced data, one needs to know how much unbalanced, is it 30% - 70% ratio or 0.1% - 99.9% ratio? I've read an article talking about how precision recall is a better evaluation for highly unbalanced data.
Here some more reading material:
Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations
ROC and precision-recall with imbalanced datasets
The only way evaluation metric can impact your model accuracy (or other different eval matrices) is when using early_stopping. early_stopping decides when to stop train additional boosters according to your evaluation metric.
early_stopping was designed to prevent over-fitting.

Cost function convergence in Tensorflow using softmax_cross_entropy_with_logits and "soft" labels/targets

I've found what is probably a rare case in Tensorflow, but I'm trying to train a classifier (linear or nonlinear) using KL divergence (cross entropy) as the cost function in Tensorflow, with soft targets/labels (labels that form a valid probability distribution but are not "hard" 1 or 0).
However it is clear (tell-tail signs) that something is definitely wrong. I've tried both linear and nonlinear (dense neural network) forms, but no matter what I always get the same final value for my loss function regardless of network architecture (even if I train only a bias). Also, the cost function converges extremely quickly (within like 20-30 iterations) using L-BFGS (a very reliable optimizer!). Another sign something is amiss is that I can't overfit the data, and the validation set appears to have exactly the same loss value as the training set. However, strangely I do see some improvements when I increase network architecture size and/or change regularization loss. The accuracy improves with this as well (although not to the point that I'm happy with it or it's as I expect).
It DOES work as expected when I use the exact same code but send in one-hot encoded labels (not soft targets). An example of the cost function from training taken from Tensorboard is shown below. Can someone pitch me some ideas?
Ahh my friend, you're problem is that with soft targets, especially ones that aren't close to 1 or zero, cross entropy loss doesn't change significantly as the algorithm improves. One thing that will help you understand this problem is to take an example from your training data and compute the entropy....then you will know what the lowest value your cost function can be. This may shed some light on your problem. So for one of your examples, let's say the targets are [0.39019628, 0.44301641, 0.16678731]. Well, using the formula for cross entropy
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
but then using the targets "y_" in place of the predicted probabilities "y" we arrive at the true entropy value of 1.0266190072458234. If you're predictions are just slightly off of target....lets say they are [0.39511779, 0.44509024, 0.15979198], then the cross entropy is 1.026805558049737.
Now, as with most difficult problems, it's not just one thing but a combination of things. The loss function is being implemented correctly, but you made the "mistake" of doing what you should do in 99.9% of cases when training deep learning algorithms....you used 32-bit floats. In this particular case though, you will run out of significant digits that a 32-bit float can represent well before you training algorithm converges to a nice result. If I use your exact same data and code but only change the data types to 64-bit floats though, you can see below that the results are much better -- your algorithm continues to train well out past 2000 iterations and you will see it reflected in your accuracy as well. In fact, you can see from the slope if 128 bit floating point was supported, you could continue training and probably see advantages from it. You wouldn't probably need that precision in your final weights and biases...just during training to support continuing optimization of the cost function.

Is there no precise implementation of batch normalization in tensorflow and why?

What is precisely done by batch normalization at inference phase is to normalize each layer with a population mean and an estimated population variance
But it seems every tensorflow implementation (including this one and the official tensorflow implementation) uses (exponential) moving average and variance.
Please forgive me, but I don't understand why. Is it because using moving average is just better for performance? Or for a pure computational speed sake?
Refercence: the original paper
Exact update rule for sample mean is just an exponential averaging with a step equal to inverse sample size. So, if you know sample size, you could just set the decay factor to be 1/n, where n is sample size. However, decay factor usually does not matter if chosen to be very close to one, as exponetital averaging with such decay rate still provides very close approximation of mean and variance, especially on large datasets.

xgboost using the auc metric correctly

I have a slightly imbalanced dataset for a binary classification problem, with a positive to negative ratio of 0.6.
I recently learned about the auc metric from this answer: https://stats.stackexchange.com/a/132832/128229, and decided to use it.
But I came across another link http://fastml.com/what-you-wanted-to-know-about-auc/ which claims that, the AUC-ROC is insensitive to class imbalance, and we should use AUC for a precision-recall curve.
The xgboost docs are not clear on which AUC they use, do they use AUC-ROC?
Also the link mentions that AUC should only be used if you do not care about the probability and only care about the ranking.
However since i am using a binary:logistic objective i think i should care about probabilities since i have to set a threshold for my predictions.
The xgboost parameter tuning guide https://github.com/dmlc/xgboost/blob/master/doc/how_to/param_tuning.md
also suggests an alternate method to handle class imbalance, by not balancing positive and negative samples and using max_delta_step = 1.
So can someone explain, when is the AUC preffered over the other method for xgboost to handle class imbalance. And if i am using AUC , what is the threshold i need to set for prediction or more generally how exactly should i use AUC for handling imbalanced binary classification problem in xgboost?
EDIT:
I also need to eliminate false positives more than false negatives, how can i achieve that, apart from simply varying the threshold, with binary:logistic objective?
According the xgboost parameters section in here there is auc and aucprwhere prstands for precision recall.
I would say you could build some intuition by running both approaches and see how the metrics behave. You can include multiple metric and even optimize with respect to whichever you prefer.
You can also monitor the false positive (rate) in each boosting round by creating custom metric.
XGboost chose to write AUC (Area under the ROC Curve), but some prefer to be more explicit and say AUC-ROC / ROC-AUC.
https://xgboost.readthedocs.io/en/latest/parameter.html