Lightgbm: High AUC but low predicted score on Label = 1 - xgboost

I just fit a lightgbm model, the ROCAUC(0.81) looks good but the issue is, when I append the predicted score on the test dataset, I notice that the groups with true label = 1 and the groups with true label = 0, their predicted score have a lot of overlap. What me and my stakeholder expected was that the more separated from each other, the better. And my stakeholder argue that even we can arbitrarily choose a threshold value to make a predicted label to 0 and 1, but for example, for a predicted probability = 0.9, we are highly confident that this observation should be 1, but in our cases, for the true label = 1 group, most of them are only with a predicted probability ~0.3, which mean we only 0.3 confident that they can be 1. Can someone help me understand this and see how can I improve it? Thank you very much

Related

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

How to add sequential (time series) constraint to optimization problem using python PuLP?

A simple optimization problem: Find the optimal control sequence for a refrigerator based on the cost of energy. The only constraint is to stay below a temperature threshold, and the objective function tries to minimize the cost of energy used. This problem is simplified so the control is simply a binary array, ie. [0, 1, 0, 1, 0], where 1 means using electricity to cool the fridge, and 0 means to turn of the cooling mechanism (which means there is no cost for this period, but the temperature will increase). We can assume each period is fixed period of time, and has a constant temperature change based on it's on/off status.
Here are the example values:
Cost of energy (for our example 5 periods): [466, 426, 423, 442, 494]
Minimum cooling periods (just as a test): 3
Starting temperature: 0
Temperature threshold(must be less than or equal): 1
Temperature change per period of cooling: -1
Temperature change per period of warming (when control input is 0): 2
And here is the code in PuLP
from pulp import LpProblem, LpMinimize, LpVariable, lpSum, LpStatus, value
from itertools import accumulate
l = list(range(5))
costy = [466, 426, 423, 442, 494]
cost = dict(zip(l, costy))
min_cooling_periods = 3
prob = LpProblem("Fridge", LpMinimize)
si = LpVariable.dicts("time_step", l, lowBound=0, upBound=1, cat='Integer')
prob += lpSum([cost[i]*si[i] for i in l]) # cost function to minimize
prob += lpSum([si[i] for i in l]) >= min_cooling_periods # how many values must be positive
prob.solve()
The optimization seems to work before I try to account for the temperature threshold. With just the cost function, it returns an array of 0s, which does indeed minimize the cost (duh). With the first constraint (how many values must be positive) it picks the cheapest 3 cooling periods, and calculates the total cost correctly.
obj = value(prob.objective)
print(f'Solution is {LpStatus[prob.status]}\nThe total cost of this regime is: {obj}\n')
for v in prob.variables():
print(f'{v.name} = {v.varValue}')
output:
Solution is Optimal
The total cost of this regime is: 1291.0
time_step_0 = 0.0
time_step_1 = 1.0
time_step_2 = 1.0
time_step_3 = 1.0
time_step_4 = 0.0
So, if our control sequence is [0, 1, 1, 1, 0], the temperature will look like this at the end of each cooling/warming period: [2, 1, 0, -1, 1]. The temperature goes up 2 whenever the control input is 1, and down 1 whenever the control input is 1. This example sequence is a valid answer, but will have to change if we add a max temperature threshold of 1, which would mean the first value must be a 1, or else the fridge will warm to a temperature of 2.
However I get incorrect results when trying to specify the sequential constraint of staying within the temperature thresholds with the condition:
up_temp_thresh = 1
down = -1
up = 2
# here is where I try to ensure that the control sequence would never cause the temperature to
# surpass the threshold. In practice I would like a lower and upper threshold but for now
# let us focus only on the upper threshold.
prob += lpSum([e <= up_temp_thresh for e in accumulate([down if si[i] == 1. else up for i in l])]) >= len(l)
In this case the answer comes out the same as before, I am clearly not formulating it correctly as the sequence [0, 1, 1, 1, 0] would surpass the threshold.
I am trying to encode "the temperature at the end of each control sequence must be less than the threshold". I do this by turning the control sequence into an array of the temperature changes, so control sequence [0, 1, 1, 1, 0] gives us temperature changes [2, -1, -1, -1, 2]. Then using the accumulate function, it computes a cumulative sum, equal to the fridge temp after each step, which is [2, 1, 0, -1, 1]. I would like to just check if the max of this array is less than the threshold, but using lpSum I check that the sum of values in the array less than the threshold is equal to the length of the array, which should be the same thing.
However I'm clearly formulating this step incorrectly. As written this last constraint has no effect on the output, and small changes give other wrong answers. It seems the answer should be [1, 1, 1, 0, 0], which gives an acceptable temperature series of [-1, -2, -3, -1, 1]. How can I specify the sequential nature of the control input using PuLP, or another free python optimization library?
The easiest and least error-prone approach would be to create a new set of auxillary variables of your problem which track the temperature of the fridge in each interval. These are not 'primary decision variables' because you cannot directly choose them - rather the value of them is constrained by the on/off decision variables for the fridge.
You would then add constraints on these temperature state variables to represent the dynamics. So in untested code:
l_plus_1 = list(range(6))
fridge_temp = LpVariable.dicts("fridge_temp", l_plus_1, cat='Continuos')
fridge_temp[0] = init_temp # initial temperature of fridge - a known value
for i in l:
prob += fridge_temp[i+1] == fridge_temp[i] + 2 - 3*s[i]
You can then sent the min/max temperature constraints on these new fridge_temp variables.
Note that in the above I've assumed that the fridge temperature variables are defined at one more intervals than the on/off decisions for the fridge. The fridge temperature variables represent the temperature at the start of an interval - and having one extra one means we can ensure the final temperature of the fridge is acceptable.

Spline fitting to data how to predict for particular value

After fitting a spline model
fit<-lm(wage ~ bs(age,knots = c(30,50,60)),data = Wage)
how to predict for particular age?
Try this:
predict (fit ,newdata =list(age = 30))
Now you will ask how I know age should be 30.
One word for you - 'Magic'

How leave's scores are calculated in this XGBoost trees?

I am looking at the below image.
Can someone explain how they are calculated?
I though it was -1 for an N and +1 for a yes but then I can't figure out how the little girl has .1. But that doesn't work for tree 2 either.
I agree with #user1808924. I think it's still worth to explain how XGBoost works under the hood though.
What is the meaning of leaves' scores ?
First, the score you see in the leaves are not probability. They are the regression values.
In Gradient Boosting Tree, there's only regression tree. To predict if a person like computer games or not, the model (XGboost) will treat it as a regression problem. The labels here become 1.0 for Yes and 0.0 for No. Then, XGboost puts regression trees in for training. The trees of course will return something like +2, +0.1, -1, which we get at the leaves.
We sum up all the "raw scores" and then convert them to probabilities by applying sigmoid function.
How to calculate the score in leaves ?
The leaf score (w) are calculated by this formula:
w = - (sum(gi) / (sum(hi) + lambda))
where g and h are the first derivative (gradient) and the second derivative (hessian).
For the sake of demonstration, let's pick the leaf which has -1 value of the first tree. Suppose our objective function is mean squared error (mse) and we choose the lambda = 0.
With mse, we have g = (y_pred - y_true) and h=1. I just get rid of the constant 2, in fact, you can keep it and the result should stay the same. Another note: at t_th iteration, y_pred is the prediction we have after (t-1)th iteration (the best we've got until that time).
Some assumptions:
The girl, grandpa, and grandma do NOT like computer games (y_true = 0 for each person).
The initial prediction is 1 for all the 3 people (i.e., we guess all people love games. Note that, I choose 1 on purpose to get the same result with the first tree. In fact, the initial prediction can be the mean (default for mean squared error), median (default for mean absolute error),... of all the observations' labels in the leaf).
We calculate g and h for each individual:
g_girl = y_pred - y_true = 1 - 0 = 1. Similarly, we have g_grandpa = g_grandma = 1.
h_girl = h_grandpa = h_grandma = 1
Putting the g, h values into the formula above, we have:
w = -( (g_girl + g_grandpa + g_grandma) / (h_girl + h_grandpa + h_grandma) ) = -1
Last note: In practice, the score in leaf which we see when plotting the tree is a bit different. It will be multiplied by the learning rate, i.e., w * learning_rate.
The values of leaf elements (aka "scores") - +2, +0.1, -1, +0.9 and -0.9 - were devised by the XGBoost algorithm during training. In this case, the XGBoost model was trained using a dataset where little boys (+2) appear somehow "greater" than little girls (+0.1). If you knew what the response variable was, then you could probably interpret/rationalize those contributions further. Otherwise, just accept those values as they are.
As for scoring samples, then the first addend is produced by tree1, and the second addend is produced by tree2. For little boys (age < 15, is male == Y, and use computer daily == Y), tree1 yields 2 and tree2 yields 0.9.
Read this
https://towardsdatascience.com/xgboost-mathematics-explained-58262530904a
and then this
https://medium.com/#gabrieltseng/gradient-boosting-and-xgboost-c306c1bcfaf5
and the appendix
https://gabrieltseng.github.io/appendix/2018-02-25-XGB.html

Bayesian estimation of log-normal using JAGS

I try to find 95% credible interval of 50 sample means. Sample sizes range from 2 to 600, and the values in each sample are bounded between 1 and 5.
ex:
sample 1 = (1,3.5,2.8,5,4.6)
sample 2 = (1,5)
sample 3 = (4.1,1.1,5,3.5,2,2.4,...)
Samples with size of 10 or more have a lognormal distribution where i used JAGS for Bayesian estimation of log-normal parameters adapted from John K. Kruschke, with model specification as below:
modelstring = "
model {
for( i in 1 : N ) {
y[i] ~ dlnorm( muOfLogY , 1/sigmaOfLogY^2 )
}
sigmaOfLogY ~ dunif( 0.001*sdOfLogY , 1000*sdOfLogY )
muOfLogY ~ dunif( 0.001*meanOfLogY , 1000*meanOfLogY )
muOfY <- exp(muOfLogY+sigmaOfLogY^2/2)
modeOfY <- exp(muOfLogY-sigmaOfLogY^2)
sigmaOfY <- sqrt(exp(2*muOfLogY+sigmaOfLogY^2)*(exp(sigmaOfLogY^2)-1))
}
"
The model works fine with sample size > 10. However, with 3 <= samples < 10 i got extreme values in upper limit (e.g., 3000) which exceeded the maximum possible value of the mean (e.g., 5).
In case of sample size = 2, i got the below error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
I am new to JAGS and can't figure out how to solve this issues. I think for smaples < 10 the distribution is no longer lognormal!
Any ideas?
Thank you
First a semantic note. You are not using JAGS to find sample means. You are using JAGS to find the means of the populations from which the samples arose. If you wanted to find the sample (log)means, you could just take the mean of the (logarithms of the) sample values.
Now, if the values in each sample are bounded between 1 and 5 (due to some external constraint), then the sample is NEVER drawn from a log-normal distribution, which inherently puts probability mass over values greater than five.
Let's imagine, for the sake of saying, that the samples do arise from lognormal sampling (and therefore aren't inherently bounded between 1 and 5). Then JAGS is simply telling you that there is not enough information contained in the sample to get a good estimate of the population mean from which it is drawn. I wouldn't worry about understanding the error when the sample size is two, because there is literally no way to get good inference about the population mean from two samples. This is true even if you know that the population is indeed log-normally distributed. And since your populations are not actually log-normally distributed (they are bounded between 1 and 5) the entire inferential procedure is invalid anyway.