How to handle log(0) when using cross entropy - numpy

In order to make the case simple and intuitive, I will using binary (0 and 1) classification for illustration.
Loss function
loss = np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY)) #cross entropy
cost = -np.sum(loss)/m #num of examples in batch is m
Probability of Y
predY is computed using sigmoid and logits can be thought as the outcome of from a neural network before reaching the classification step
predY = sigmoid(logits) #binary case
def sigmoid(X):
return 1/(1 + np.exp(-X))
Problem
Suppose we are running a feed-forward net.
Inputs: [3, 5]: 3 is number of examples and 5 is feature size (fabricated data)
Num of hidden units: 100 (only 1 hidden layer)
Iterations: 10000
Such arrangement is set to overfit. When it's overfitting, we can perfectly predict the probability for the training examples; in other words, sigmoid outputs either 1 or 0, exact number because the exponential gets exploded. If this is the case, we would have np.log(0) undefined. How do you usually handle this issue?

If you don't mind the dependency on scipy, you can use scipy.special.xlogy. You would replace the expression
np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY))
with
xlogy(Y, predY) + xlogy(1 - Y, 1 - predY)
If you expect predY to contain very small values, you might get better numerical results using scipy.special.xlog1py in the second term:
xlogy(Y, predY) + xlog1py(1 - Y, -predY)
Alternatively, knowing that the values in Y are either 0 or 1, you can compute the cost in an entirely different way:
Yis1 = Y == 1
cost = -(np.log(predY[Yis1]).sum() + np.log(1 - predY[~Yis1]).sum())/m

How do you usually handle this issue?
Add small number (something like 1e-15) to predY - this number doesn't make predictions much off, and it solves log(0) issue.
BTW if your algorithm outputs zeros and ones it might be useful to check the histogram of returned probabilities - when algorithm is so sure that something's happening it can be a sign of overfitting.

One common way to deal with log(x) and y / x where x is always non-negative but can become 0 is to add a small constant (as written by Jakub).
You can also clip the value (e.g. tf.clip_by_value or np.clip).

Related

How does y_pred look like when making a custom loss function in keras?

I am training a UNet shaped CNN and have to deal with data imbalances. I want to minimise false negatives, so I want to implement a custom loss function that does so. I created the following loss function:
from tensorflow.keras import backend as K
def fbeta_loss(y_true, y_pred, beta=2., epsilon=K.epsilon()):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
tp = K.sum(y_true_f * y_pred_f)
predicted_positive = K.sum(y_pred_f)
actual_positive = K.sum(y_true_f)
precision = tp/(predicted_positive+epsilon) # calculating precision
recall = tp/(actual_positive+epsilon) # calculating recall
# calculating fbeta
beta_squared = K.square(beta)
fb = (1+beta_squared)*precision*recall / (beta_squared*precision + recall + epsilon)
return 1-fb
However, I am not sure if y_pred is binary, or a float number between 0 and 1. In my final layer I use a sigmoid activation. Does that mean if I create a custom loss function y_pred is a float between 0 and 1, and I should add a step that maps every value higher then a threshold(0.5) to 1 and lower to 0? Or is that step already included in the Keras model? Since in similar custom loss implementations that step is often not included, e.g. .
Hopefully this is sort of clear, I am relatively new to stackoverflow. Let me know if anything is missing! Thanks in advance.
The output of sigmoid activation function is always between 0 and 1.
In the limit of x tending towards infinity, S(x) converges to 1, and in the limit of x tending towards negative infinity, S(x) converges to 0. Here, the word converges does not mean that S(x) reach any of 0 or 1 but it converges to 0 and 1.
And so the output of S(x) is always a float between 0 and 1.
Range of S(x):
0 < S(x) < 1

Custom Weighted Cross Entropy loss in Keras

Ok, so I have a neural network that classifies fire size into 3 groups, 0-1, 1 - 100, and over 100 acres. I need a loss function that weights the loss as double when the classifier guesses a class that is off by 2 (Actual = 0, predicted = 3)
I need a loss function that weights the loss as double when the classifier guesses a class that is off by 2 (Actual = 0, predicted = 3)
double of what?.
A)Is it the double the loss value when the classifier guesses correctly,
B)or double the loss value when the classifier is off by 1.
C)Can we relax this 'double' constraint, and can we assume that any suitable higher power would suffice?
Let us assume A).
Let f(x) denotes the probability that your input variable belong to a particular class. Note that, in f(x), x is the absolute value of the difference in categorical value.
Then we see that f(0)=0.5 is a solution for assumption A. This means that f(1)=0.25 and f(2)=0.25. Btw, the fact that f(1)==f(2) doesn't look natural.
Assume that your classifier calculates a function f(x), and uses it as follows.
def classifier_output(firesize):
if (firesize >=0 and firesize < 1.0):
return [f(0), f(1), f(2)]
elif (firesize >= 1.0 and firesize < 100.0):
return [f(1), f(0), f(1)]
else :
assert(firesize > 100.0)
return (f(2), f(1), f(0)]
The constraints are
C1)
f(x) >=0
C2)
the components of your output vector should always sum to 1.0
ie. sum of all three components of the return value should always be 1.
C3)
When the true class and predicted class differ by 2, the 1-hot encoding loss
will be -log(f(2)), According to assumption A, this should equal -2log(f(0)).
ie:
log(f(2))=2*log(f(0))
This translates to
f(2) = f(0)*f(0)
Let us put z=f(0). Now f(2)=z*z. We don't know f(1). Let us assume, f(1)=y.
From the constraint C2,
We have the following equations,
z+ z*z + y=1
z + 2*y=1
A solution to the above is z=0.5, y=0.25
If you assume B), you wont be able to find such a function.

log(1+exp(X)) in Tensorflow (avoiding under and over flows)

I was debugging my program and I've realized that I my loss outputted NaN. These NaN values comes from the fact that I'm computing tf.log(1 + tf.exp(X))
where X is a 2d tensor. Indeed, When a value of X is large enough then tf.exp() returns +Inf and so tf.log(1 + exp(X)) will return +Inf. I was wondering if there exists a neat trick to avoid underflows and overflows in this case.
I have tried:
def log1exp(x):
maxi = tf.reduce_max(x)
return maxi + tf.log(tf.exp(x - maxi) + tf.exp(-maxi))
but it doesn't handle underflows in this case...
Also I've glanced at tf.reduce_logsumexp but it necessarily reduce the tensor along an axis... while I want to keep the same shape!
Finally I know that tf.log(1 + exp(X)) is almost equal to X for large values of X but I think that designing a function that will output X when X > threshold and log(1+exp(X)) otherwise is not very neat.
Thank you
This function is already implemented in tensorflow under the name tf.math.softplus, and takes care of overflows and underflows.

How leave's scores are calculated in this XGBoost trees?

I am looking at the below image.
Can someone explain how they are calculated?
I though it was -1 for an N and +1 for a yes but then I can't figure out how the little girl has .1. But that doesn't work for tree 2 either.
I agree with #user1808924. I think it's still worth to explain how XGBoost works under the hood though.
What is the meaning of leaves' scores ?
First, the score you see in the leaves are not probability. They are the regression values.
In Gradient Boosting Tree, there's only regression tree. To predict if a person like computer games or not, the model (XGboost) will treat it as a regression problem. The labels here become 1.0 for Yes and 0.0 for No. Then, XGboost puts regression trees in for training. The trees of course will return something like +2, +0.1, -1, which we get at the leaves.
We sum up all the "raw scores" and then convert them to probabilities by applying sigmoid function.
How to calculate the score in leaves ?
The leaf score (w) are calculated by this formula:
w = - (sum(gi) / (sum(hi) + lambda))
where g and h are the first derivative (gradient) and the second derivative (hessian).
For the sake of demonstration, let's pick the leaf which has -1 value of the first tree. Suppose our objective function is mean squared error (mse) and we choose the lambda = 0.
With mse, we have g = (y_pred - y_true) and h=1. I just get rid of the constant 2, in fact, you can keep it and the result should stay the same. Another note: at t_th iteration, y_pred is the prediction we have after (t-1)th iteration (the best we've got until that time).
Some assumptions:
The girl, grandpa, and grandma do NOT like computer games (y_true = 0 for each person).
The initial prediction is 1 for all the 3 people (i.e., we guess all people love games. Note that, I choose 1 on purpose to get the same result with the first tree. In fact, the initial prediction can be the mean (default for mean squared error), median (default for mean absolute error),... of all the observations' labels in the leaf).
We calculate g and h for each individual:
g_girl = y_pred - y_true = 1 - 0 = 1. Similarly, we have g_grandpa = g_grandma = 1.
h_girl = h_grandpa = h_grandma = 1
Putting the g, h values into the formula above, we have:
w = -( (g_girl + g_grandpa + g_grandma) / (h_girl + h_grandpa + h_grandma) ) = -1
Last note: In practice, the score in leaf which we see when plotting the tree is a bit different. It will be multiplied by the learning rate, i.e., w * learning_rate.
The values of leaf elements (aka "scores") - +2, +0.1, -1, +0.9 and -0.9 - were devised by the XGBoost algorithm during training. In this case, the XGBoost model was trained using a dataset where little boys (+2) appear somehow "greater" than little girls (+0.1). If you knew what the response variable was, then you could probably interpret/rationalize those contributions further. Otherwise, just accept those values as they are.
As for scoring samples, then the first addend is produced by tree1, and the second addend is produced by tree2. For little boys (age < 15, is male == Y, and use computer daily == Y), tree1 yields 2 and tree2 yields 0.9.
Read this
https://towardsdatascience.com/xgboost-mathematics-explained-58262530904a
and then this
https://medium.com/#gabrieltseng/gradient-boosting-and-xgboost-c306c1bcfaf5
and the appendix
https://gabrieltseng.github.io/appendix/2018-02-25-XGB.html

How to understand bias parameter in LIBLINEAR?

I don't understand the meaning of bias parameter in the API of LIBLINEAR. Why is it specified by user during the training? Shouldn't it be just a distance from the separating hyperplane to origin which is a parameter of the learned model?
This is from the README:
struct problem
{
int l, n;
int *y;
struct feature_node **x;
double bias;
};
If bias >= 0, we assume that one additional feature is added to the end of each data instance.
What is this additional feature?
Let's look at the equation for the separating hyperplane:
w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + ... + w_bias * x_bias = 0
Where x are the feature values and w are the trained "weights". The additional feature x_bias is a constant, whose value is equal to the bias. If bias = 0, you will get a separating hyperplane going through the origin (0,0,0,...). You can imagine many cases, where such a hyperplane is not the optimal separator.
The value of the bias affects the margin through scaling of w_bias. Therefore the bias is a tuning parameter, which is usually determined through cross-validation similar to other parameters.