Internal node predictions of xgboost model - xgboost

Is it possible to calculate the internal node predictions of an xgboost model? The R package, gbm, provides a prediction for internal nodes of each tree.
The xgboost output, however only shows predictions for the final leaves of the model.
xgboost output:
Notice that the Quality column has the final prediction for the leaf node in row 6. I would like that value for each of the internal nodes as well.
Tree Node ID Feature Split Yes No Missing Quality Cover
1: 0 0 0-0 Sex=female 0.50000 0-1 0-2 0-1 246.6042790 222.75
2: 0 1 0-1 Age 13.00000 0-3 0-4 0-4 22.3424225 144.25
3: 0 2 0-2 Pclass=3 0.50000 0-5 0-6 0-5 60.1275253 78.50
4: 0 3 0-3 SibSp 2.50000 0-7 0-8 0-7 23.6302433 9.25
5: 0 4 0-4 Fare 26.26875 0-9 0-10 0-9 21.4425507 135.00
6: 0 5 0-5 Leaf NA <NA> <NA> <NA> 0.1747126 42.50
R gbm output:
In the R gbm package output, the prediction column contains values for both leaf nodes (SplitVar == -1) and the internal nodes. I would like access to these values from the xgboost model
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 1 0.000000000 1 8 15 32.564591 445 0.001132514
1 2 9.500000000 2 3 7 3.844470 282 -0.085827382
2 -1 0.119585850 -1 -1 -1 0.000000 15 0.119585850
3 0 1.000000000 4 5 6 3.047926 207 -0.092846157
4 -1 -0.118731665 -1 -1 -1 0.000000 165 -0.118731665
5 -1 0.008846912 -1 -1 -1 0.000000 42 0.008846912
6 -1 -0.092846157 -1 -1 -1 0.000000 207 -0.092846157
Question:
How do I access or calculate predictions for the internal nodes of an xgboost model? I would like to use them for a greedy, poor man's version of SHAP scores.

The solution to this problem is to dump the xgboost json object with all_stats=True. That adds the cover statistic to the output which can be used to distribute the leaf points through the internal nodes:
def _calculate_contribution(node: AnyNode) -> float32:
if isinstance(node, Leaf):
return node.contrib
else:
return (
node.left.cover * Node._calculate_contribution(node.left)
+ node.right.cover * Node._calculate_contribution(node.right)
) / node.cover
The internal contribution is the weighted average of the child contributions. Using this method, the generated results exactly match those returned when calling the predict method with pred_contribs=True and approx_contribs=True.

Related

Sklearn only predicts one class while dataset is fairly balanced (±80/20 split)

I am trying to come up with a way to check what are the most influential factors of a person not paying back a loan (defaulting). I have worked with the sklearn library quite intensively, but I feel like I am missing something quite trivial...
The dataframe looks like this:
0 7590-VHVEG Female Widowed Electronic check Outstanding loan 52000 20550 108 0.099 288.205374 31126.180361 0 No Employed No Dutch No 0
1 5575-GNVDE Male Married Bank transfer Other 42000 22370 48 0.083 549.272708 26365.089987 0 Yes Employed No Dutch No 0
2 3668-QPYBK Male Registered partnership Bank transfer Study 44000 24320 25 0.087 1067.134272 26678.356802 0 No Self-Employed No Dutch No 0
The distribution of the "DefaultInd" column (target variable) is this:
0 0.835408
1 0.164592
Name: DefaultInd, dtype: float64
I have label encoded the data to make it look like this, :
CustomerID Gender MaritalStatus PaymentMethod SpendingTarget EstimatedIncome CreditAmount TermLoanMonths YearlyInterestRate MonthlyCharges TotalAmountPayments CurrentLoans SustainabilityIndicator EmploymentStatus ExistingCustomer Nationality BKR_Registration DefaultInd
0 7590-VHVEG 0 4 2 2 52000 20550 108 0.099 288.205374 31126.180361 0 0 0 0 5 0 0
1 5575-GNVDE 1 1 0 1 42000 22370 48 0.083 549.272708 26365.089987 0 1 0 0 5 0 0
2 3668-QPYBK 1 2 0 4 44000 24320 25 0.087 1067.134272 26678.356802 0 0 2 0 5 0
After that I have removed NaNs and cleaned it up some more (removing capitalizion, punctuation etc)
After that, I try to run this cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
y = df['DefaultInd']
X = df.drop(['CustomerID','DefaultInd'],axis=1)
X = X.astype(float)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))
Which results in this:
precision recall f1-score support
0 0.83 1.00 0.91 1073
1 0.00 0.00 0.00 213
accuracy 0.83 1286
macro avg 0.42 0.50 0.45 1286
weighted avg 0.70 0.83 0.76 1286
As you can see, the "1" class does not get predicted 1 time, I am wondering whether or not this behaviour is to be expected (I think it is not). I tried to use class_weightd = ‘balanced’, but that resulted in an average f1 score of 0.59 (instead of 0.76)
I feel like I am missing something, or is this kind of behaviour expected and should I rebalance the dataset before fitting? I feel like the division is not that skewed (±80/20), there should not be this big of a problem.
Any help would be more than appreciated :)

change some values of a tibble to a function of another tibbles' values, but only on some sparse elements

So based on some of the previous questions here I was able to figure out how to mutate elements of a tibble only while preserving the tibble structure. However, one of the functions I have involves mutating the elements~but only some sparse number of elements (e.g. 10% of the total values of the tibble based on whether they are non-zero)~in terms of another tibble. This involves epidemiological data, each country's value on day date in total have been selected for serological testing, but only for some country's and on some dates. A fraction of each of the non-zero values in total have tested as positive cases. The values in cases are non-zero iff the corresponding values for the same row and column in total are also non-zero. Here is a sample of the data:
>cases >total
# A tibble 5 x 3 # A tibble 5 x 3
1 2 3 1 2 3
<int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0
2 0 5 3 0 19 31
3 0 2 23 0 15 40
4 11 9 16 20 21 29
5 0 0 9 0 0 15
I would like to be able to construct a positivity table which represents the fraction of positive cases for each country on each date. However, I cannot simply divide cases/total, for two reasons:
(1) Arithmetic operations do not work on tibbles like they do on data.frames.
(2) Division on the days of 0 serological tests submitted will result in NA values.
Is there a way to systematize a mutate function from the tidyverse which involves the same tibble, something like:
positivity = total
positivity %>% mutate_all(~ replace(., . > 0, variant/total))
This produces an error, I would like to get:
>positivity
# A tibble 5 x 3
1 2 3
<int> <int> <int>
1 0 0 0
2 0 x x
3 0 x x
4 x x x
5 0 0 x
with the values x in row i and column j of positivity correspond to cases[i,j]/total[i,j].

Computing JaroWinkler Similarity for unordered and different sized dataframes

I have two dataframes extracted from two attached files.
I want to compute JaroWinkler Similarity for tokens inside the files. I am using below code.
from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
df_gt['jarowinkler_sim'] = [jarowinkler.similarity(x.lower(), y.lower()) for x, y in zip(df_ex['abstract_ex'], df_gt['abstract_gt'])]
I am facing two problems:
1. Order of the tokens are not being handled.
When position of the token 'can' and 'interesting' is changed similarity index is wrongly computed!!
Unnamed: 0 abstract_gt jarowinkler_sim
0 0 Bipartite 1.000000
1 1 fluctuations 0.914141
2 2 can 0.474747 <--|
3 3 provide 1.000000 |-- Position swapped in one file
4 4 interesting 0.474747 <--|
5 5 information 1.000000
6 6 about 1.000000
7 7 entanglement 1.000000
8 8 properties 1.000000
9 9 and 1.000000
10 10 correlations 1.000000
2. Size of the dataframe might not be always same.
When one of the dataframe contains less elements my solution gives an error.
raise ValueError( ValueError: Length of values (10) does not match
length of index (11)
How can I solve these two problems and compute the similarity accurately?
Thanks !!
TSV FILES
1. df_ex
abstract_ex
0 Bipartite
1 fluctuations
2 interesting
3 provide
4 can
5 information
6 about
7 entanglement
8 properties
9 and
10 correlations
df_gt
abstract_gt
0 Bipartite
1 fluctuations
2 interesting
3 provide
4 can
5 information
6 about
7 entanglement
8 properties
9 and
10 correlations

To One-Hot encode or not to One-Hot encode

My data set has the day of the week number (Mon = 1, Tue = 2, Wed = 3 ...)
My data look like this
WeekDay Col1 Col2 Target
1 2.2 8 126
6 3.5 4 354
1 8.0 2 322
3 7.2 4 465
7 3.2 5 404
6 3.8 3 134
1 3.6 5 455
1 5.5 8 345
6 7.0 6 442
Shall I one-hot encode WeekDay so it will look like this ?
WeekDay Col1 Col2 Target Mo Tu We Th Fr Sa Su
1 2.2 8 126 1 0 0 0 0 0 0
6 3.5 4 354 0 0 0 0 0 1 0
1 8.0 2 322 1 0 0 0 0 0 0
3 7.2 4 465 0 0 1 0 0 0 0
7 3.2 5 404 0 0 0 0 0 0 1
6 3.8 3 134 0 0 0 0 0 1 0
1 3.6 5 455 1 0 0 0 0 0 0
1 5.5 8 345 1 0 0 0 0 0 0
6 7.0 6 442 0 0 0 0 0 1 0
I am going to use Random Forest
You should not use one hot encoding since you are using a random forest model. An RF model will be able to find the patterns from label encoding as well and generally RF models perform worse with one hot encoding as they might decide to lost a few days when creating a tree. Also one hot encoding introduces the curse of dimensionality in your data, which is never good.
One hot encoding is better in cases of methods like linear regression or logistic regression, where 1 i.e. Monday might get more importance then 6 i.e. Saturday as these models have a multiplication model on the backend.
Generally, it's preferable to use One-Hot-Encoding, before use Random Forest. If this is only a categorical variable in your dataset then go for One-hot-Encoding. If you use R's random forest then as I know R's library deal with it itself. For scikit-learn that's not the case and you have to one-hot encode yourself. There is a trade off. One-Hot encoding introduces sparsity which is undesirable for tree-based models if the cardinality of the categorical variable is big, or in other words, there are many unique values in the categorical variable. However, Python's catboost deals with categorical variables.

Apply noise on non zero elements of data frame

I am a bit struggling with this one.
I have a dataframe, and I want to apply gaussian noise only on the non zero elements of the data frame. A silly way to do this is :
mu, sigma = 0, 0.1
for i in range(df.shape[0]):
for j in range(df.shape[1]):
if df.iat[i,j] != 0:
df.iat[i,j] += np.random.normal(mu,sigma)
Noise must be different for each element, we do not add the same value each time.
And I would be happy if only this worked. Actually for some reason it does not. Instead, I got this :
before noise
after noise
As you can see on the image, for columns A and C it is working well, but not for the others. What is weird is that there is still a change (+/- 1, so far from what one would except of a gaussian noise...)
I tried to see if this was some decimals problem with df.round() but nothing came up.
So I am looking for another way to apply my noise mostly rather than to solve this weird problem. Thank you by advance.
I believe you can generate array with same size as orignal DataFrame and then add values by condition with where:
np.random.seed(234)
df = pd.DataFrame(np.random.randint(5, size=(5,5)))
print (df)
0 1 2 3 4
0 0 4 1 1 3
1 3 0 3 3 2
2 0 2 4 1 3
3 4 0 3 0 2
4 3 1 3 3 1
mu, sigma = 0, 0.1
a = np.random.normal(mu,sigma, size=df.shape)
print (a)
[[ 0.10452115 -0.01051424 -0.13329652 -0.06376671 0.07245456]
[-0.21753186 0.05700441 0.03595196 -0.08154859 0.0076684 ]
[ 0.08368405 0.10390984 0.04692948 0.09711873 -0.06820933]
[-0.07229613 0.03954906 -0.06136678 -0.02328597 -0.22123564]
[-0.04316055 0.05945377 0.13736261 0.07895045 0.03714287]]
df = df.where(df == 0, df.add(a))
print (df)
0 1 2 3 4
0 0.000000 3.989486 0.866703 0.936233 3.072455
1 2.782468 0.000000 3.035952 2.918451 2.007668
2 0.000000 2.103910 4.046929 1.097119 2.931791
3 3.927704 0.000000 2.938633 0.000000 1.778764
4 2.956839 1.059454 3.137363 3.078950 1.037143