Which is the correct tree in XGBClassifier? - pandas

I have trained the following XGBClassifier in Pandas:
model = XGBClassifier(
objective='binary:logistic',
base_score=0.5,
booster='gbtree',
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
enable_categorical=False,
gamma=2,
gpu_id=-1,
importance_type=None,
interaction_constraints='',
learning_rate=0.1,
max_delta_step=0,
max_depth=3,
min_child_weight=7,
monotone_constraints='(1,1,1,1,1)',
n_estimators=3,
n_jobs=1,
nthread=1,
num_parallel_tree=1,
predictor='auto',
random_state=0,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
silent=True,
subsample=0.8,
tree_method='exact',
validate_parameters=1,
pred_contribs=True,
verbosity=None)
model.fit(X, Y)
The resulting tree looks like this:
As you can see there are 4 leaves:
Leaf3 -> Log odds = -0.13381
Leaf4 -> Log odds = -0.05526
Leaf5 -> Log odds = -0.04303
Leaf6 -> Log odds = 0.00275
My assumption is that the number that we see in the picture are log odds.
Then I use the apply method to check the predicted leaf for every tree for each sample:
model.fit(X, Y)
x = pd.DataFrame((model.apply(X)))
x.to_csv('x.csv')
print(x)
The printout looks like this:
0 1 2
0 6.0 6.0 6.0
1 3.0 3.0 6.0
2 3.0 4.0 3.0
3 6.0 6.0 6.0
4 5.0 5.0 4.0
.. ... ... ...
457 4.0 4.0 6.0
458 6.0 6.0 6.0
459 5.0 5.0 4.0
460 6.0 6.0 5.0
461 3.0 4.0 5.0
The total number of Trees is 3 (from 0 to 2) because I set the n_estimators=3. Is my understanding correct?
Question: which one of these three trees above corresponds to the tree (plot) displayed in the picture above?
I have dumped the tree:
df = model.get_booster().trees_to_dataframe()
print(df)
Which looks like this:
I have highlighted in orange the leaves and gains that correspond to the plot shown above.
Therefore I assume that Tree 0 is the one chosen by the algo to segment the dataset.
Now, when I merge the dataframe that I have used to train the XGBClassifier with the dataframe that contains the apply results by choosing only the tree 0, I obtain a dataframe which contains the Probabilities that I have predicted with the model.predict_proba(X) method and a column called Leaf that contains leaves 3,4,5,6 as expected. The problem is that the probabilities field only contains TWO values: I was expecting FOUR values (one for each leaf).
Why is that? I expect that there is one and one only probability assigned to each leaf.
How can I figure out what is the segmentation that the algo has chosen to assign a leaf to each record? Where can I find the actual segmentation? And How can I create a column in the train dataframe that contains the correct leaf?

Related

How to compare a value against a column value containing csv in Postgres?

I have a table called device_info that looks like below (only a sample provided)
device_ip
cpu
memory
100.33.1.0
10.0
29.33
110.35.58.2
3.0, 2.0
20.47
220.17.58.3
4.0, 3.0
23.17
30.13.18.8
-1
26.47
70.65.18.10
-1
20.47
10.25.98.11
5.0, 7.0
19.88
12.15.38.10
7.0
22.45
Now I need to compare a number say 3 against the cpu column values and get the rows that are greater than that. Since the cpu column values are stored as a csv, I am not sure how to do the comparison.
I found there is a concept called string_to_array in Postgres which converts csv to array and accordingly tried the below query that didn't work out
select device_ip, cpu, memory
from device_info
where 3 > any(string_to_array(cpu, ',')::float[]);
What am I doing wrong?
Expected output
device_ip
cpu
memory
100.33.1.0
10.0
29.33
220.17.58.3
4.0, 3.0
23.17
10.25.98.11
5.0, 7.0
19.88
12.15.38.10
7.0
22.45
The statement as-is is saying "3 is greater than my array value". What I think you want is "3 is less than my array value".
Switch > to <.
select device_ip, cpu
from device_info
where 3 < any(string_to_array(cpu, ',')::float[]);
View on DB Fiddle

How to use bob.measure.load.split()

I'm a student studying with a focus on machine learning, and I'm interested in authentication.
I am interested in your library because I want to calculate the EER.
Sorry for the basic question, but please tell me about bob.measure.load.split().
Is the file format required by this correct in the perception that the first column is the correct label and the second column is the predicted score of the model?
like
# file.txt
|label|prob |
| -1 | 0.3 |
| 1 | 0.5 |
| -1 | 0.8 |
...
In addition, to actually calculate the EER, should I follow the following procedure?
neg, pos = bob.measure.load.split('file.txt')
eer = bob.measure.eer(neg, pos)
Sincerely.
You have two options of calculating EER with bob.measure:
Use the Python API to calculate EER using numpy arrays.
Use the command line application to generate error rates (including EER) and plots
Using Python API
First, you need to load the scores into memory and split them into positive and negative scores.
For examples:
import numpy as np
import bob.measure
positives = np.array([0.5, 0.5, 0.6, 0.7, 0.2])
negatives = np.array([0.0, 0.0, 0.6, 0.2, 0.2])
eer = bob.measure.eer(negatives, positives)
print(eer)
This will print 0.2. All you need to take care is that your positive comparison scores are higher than negative comparisons. That is your model should score higher for positive samples.
Using command line
bob.measure also comes with a suite of command line commands that can help you get the error rates. To use the command line, you need to save the scores in a text file. This file is made of two columns where columns are separated by space. For example the score file for the same example would be:
$ cat scores.txt
1 0.5
1 0.5
1 0.6
1 0.7
1 0.2
-1 0.0
-1 0.0
-1 0.6
-1 0.2
-1 0.2
and then you would call
$ bob measure metrics scores.txt
[Min. criterion: EER ] Threshold on Development set `scores.txt`: 3.500000e-01
================================ =============
.. Development
================================ =============
False Positive Rate 20.0% (1/5)
False Negative Rate 20.0% (1/5)
Precision 0.8
Recall 0.8
F1-score 0.8
Area Under ROC Curve 0.8
Area Under ROC Curve (log scale) 0.7
================================ =============
Ok it didn't print EER exactly but EER = (FPR+FNR)/2.
Using bob.bio.base command line
If your scores are the results of a biometrics experiment,
then you want to save your scores in the 4 or 5 column formats of bob.bio.base.
See an example in https://gitlab.idiap.ch/bob/bob.bio.base/-/blob/3efccd3b637ee73ec68ed0ac5fde2667a943bd6e/bob/bio/base/test/data/dev-4col.txt and documentation in https://www.idiap.ch/software/bob/docs/bob/bob.bio.base/stable/experiments.html#evaluating-experiments
Then, you would call bob bio metrics scores-4-col.txt to get biometrics related metrics.

Finding significant values from a series

I have a series with index and the count can be 0 to 1000.
I can select all the entries where the value is greater than 3
But after looking at the data, I decide to select all the entries where the value is more than 10 because some values are significantly higher than others!
s[s > 3].dropna()
-PB-[variable][variable] 8.0
-[variable] 15.0
-[variable][variable] 6.0
A-[variable][variable] 5.0
B 5.0
B-[variable][variable] 5.0
Book 4.0
Bus 8.0
Date 5.0
Dear 1609.0
MR 4.0
Man[variable] 4.0
Number[variable] 5.0
PM[variable] 4.0
Pickup 12.0
Pump[variable] 5.0
RJ 9.0
RJ-[variable]-PB-[variable][variable] 6.0
Time[variable] 6.0
[variable] 103.0
[variable][variable] 15.0
I have refined my query to something like this...
s[s > 10].dropna()
-[variable] 15.0
Dear 1609.0
Pickup 12.0
[variable] 103.0
[variable][variable] 15.0
Is there any function in pandas to return the significant entries. I can sort in descending order and select the first 5 or 10, but there is no guarantee that those entries will be very high compared to average. In that case I will prefer to select all entries.
In other words, I have decided the threshold of 10 in this case after looking at the data. Is there any method to select that value programmatically?
Selecting a threshold value with the quntile method might be a better solution, but still not the exact answer.
You can use .head function to select default top 5 row and .sort_values to sort in that dataframe. If you want to select top 10 then pass 10 in head function.
Simply call:
s[s['column_name'] > 10].sort_values(kind='quicksort', by='column_name_to_sort', ascending=False).head(10)

Display user specified contour levels in GrADS

I would like to know how to display specific contour levels on the colorbar. For example, as shown in the schematic above taken from pivotalweather, shows a colorbar for precipitation values that are not really equally spaced. I would like to know how to achieve a similar result with GrADS.
PS: I use the cbarn.gs and the xcbar.gs script sometimes.
You need to use the original color set of GRADS for this.
THREE steps:
1). Set the color using the 'set rgb # R G B'. You need the RGB of the colors in your color bar. Since there are 15 default colors in GrADS, you should start the # at 16.
Check this link for details of the colors:
http://cola.gmu.edu/grads/gadoc/colorcontrol.html
2). You need to set the color level as follows:
set clevs 0.01 0.05 0.1 0.02 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.2 1.4 1.6 1.8 2
2.5 3 3.5 4 5 6 8 15
3). You need to specify the colors based on your defined RGBs.
set ccols 16, 17, 18,....etc.

SSRS - column calculations

In SQL Server Reporting Services within Visual Studio, I created a report which has a detail and a total line. I try to subtract the value in the total line from the value in the detail line and I get a result of Zero which is incorrect. See example below :
Col A Col B
Detail 4.7 4.7 – 4.0
lines 3.7 3.7 – 4.0
3.5 3.5 – 4.0
Total/AVG 4.0
In Column B, I take the figure from detail line in col A and subtract the Total line from it and get zero instead of 0.7 etc....
You need to include the scope for the calculating the average within your detail row. If you are doing this at the group level, aggregate on the table's group:
=Fields!MyField.Value - AVG(Fields!MyField.Value, "table1_Group1")
If it is at the dataset level, you can do the same with the dataset:
=Fields!MyField.Value - AVG(Fields!MyField.Value, "MyDataset")