I am training an electra model with tensorflow on a multi label task. The ROC performance of each individual label is
AUROC per tag
morality_binary: 0.8840802907943726
emotion_binary: 0.8690611124038696
positive_binary: 0.9115268588066101
negative_binary: 0.9200447201728821
care_binary: 0.9266915321350098
fairness_binary: 0.8638730645179749
authority_binary: 0.8471786379814148
sanctity_binary: 0.9040042757987976
harm_binary: 0.9046630859375
injustice_binary: 0.8968375325202942
betrayal_binary: 0.846387505531311
subversion_binary: 0.7741811871528625
degradation_binary: 0.9601025581359863
But when I run the the sklearn classification report:
THRESHOLD = 0.5
y_pred = predictions.numpy()
y_true = labels.numpy()
upper, lower = 1, 0
y_pred = np.where(y_pred > THRESHOLD, upper, lower)
print(classification_report(
y_true,
y_pred,
target_names=LABEL_COLUMNS,
zero_division=0
))
... five of the labels turns out with an f-score of 0:
precision recall f1-score support
morality_binary 0.72 0.73 0.73 347
emotion_binary 0.66 0.73 0.69 303
positive_binary 0.71 0.76 0.73 242
negative_binary 0.70 0.62 0.65 141
care_binary 0.67 0.60 0.63 141
fairness_binary 0.55 0.53 0.54 166
authority_binary 0.00 0.00 0.00 49
sanctity_binary 0.00 0.00 0.00 23
harm_binary 0.48 0.32 0.39 50
injustice_binary 0.62 0.56 0.59 97
betrayal_binary 0.00 0.00 0.00 30
subversion_binary 0.00 0.00 0.00 8
degradation_binary 0.00 0.00 0.00 10
Can someone explain to me how this is possible? I can understand a low f-score, but 0?
I assume 0 is negative and 1 is positive.
AUROC calculates the area under the ROC curve as a measure of how well a classifier performs (0.5 score is a random, coin-flip model). To draw the ROC curve, you need to calculate two values at different threshold values to distinguish positive from negative examples.
y-axis: True positive rate (TPR) - How many of the positive examples did the model predict as negative.
x-axis: False positive rate (FPR) - How many of the negative examples did the model predict as positive.
TPR is also called recall. We calculate this using the following formula:
TPR = True positives / (True positives + False Negatives)
= True positives / All positives
So the only way TPR can be 0 is because TP is also 0. This means that precision will also be 0 as we calculate precision using the following formula:
Precision = True positives / (True positives + False positives)
Which will also result in 0 if and only if TP is equal to 0.
Now given the ROC curve (Roc curve shown here), if FPR is 0, the area under the curve will also be equal to 0. This is because you have picked a single threshold value (0.5) in your code to predict 0 or 1. This is not a representation of the ROC curve and AUROC measure.
I suggest you take a look at the ROC curve and try different values for you classification threshold. The resulting AUROC values suggest that your model performs better than a random one in general, so you should find a good threshold.
Related
I want to plot a line graph with shaded error bands across this line. My data is in two different data frames. First dataframe, as shown below, has mean and the positive error is mean + stdev, negative error is mean - stdev. A different dataframe has x-axis as the column against which I want to plot mean and error bands.
a
b
mean
stdev
0.05
0.06
0.055
0.007
0.1
0.13
0.115
0.02
-0.2
-0.5
-0.35
0.21
I found few discussions on how to do this with lists and arrays but nothing with dataframes.
How to plot shaded error bands with seaborn?
I'm trying to plot a graph for precision vs recall this is my classification report. i don't know how to plot a graph displaying these.
this is my code for classification report
from sklearn.metrics import classification_report
print("")
print("Confusion Matrix")
print(confusion_matrix(Y_test, predictions))
print("")
print("Classification Report XGBOOST")
print(classification_report(predictions,Y_test))
output:
Confusion Matrix
[[1163 55]
[ 46 665]]
Classification Report xgboost
precision recall f1-score support
0 0.95 0.96 0.96 1209
1 0.94 0.92 0.93 720
accuracy 0.95 1929
macro avg 0.95 0.94 0.94 1929
weighted avg 0.95 0.95 0.95 1929
i'm trying to do something like this:
visulise my precision and recall using a graph.
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(Y_test,predictions)
plt.step(recall, precision, color='b', alpha=0.2,
where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
Try this.
I have a question on how to this task. I want to return or group a series of numbers in my data frame, the numbers are from the column 'PD' which ranges from .001 to 1. What I want to do is to group those that are .91>'PD'>.9 to .91 (or return a value of .91), .92>'PD'>=.91 to .92, ..., 1>='PD' >=.99 to 1. onto a column named 'Grouping'. What I have been doing is manually doing each if statement then merging it with the base data frame. Can anyone please help me with a more efficient way of doing this? Still on the early stages of using python. Sorry if the question seems to be easy. Thank you for answering and for your time.
Let your data look like this
>>> df = pd.DataFrame({'PD': np.arange(0.001, 1, 0.001), 'data': np.random.randint(10, size=999)})
>>> df.head()
PD data
0 0.001 6
1 0.002 3
2 0.003 5
3 0.004 9
4 0.005 7
Then cut-off the last decimal of the PD column. This is a bit tricky since you get a lot of issues with rounding when doing it without str conversion. E.g.
>>> df['PD'] = df['PD'].apply(lambda x: float('{:.3f}'.format(x)[:-1]))
>>> df.tail()
PD data
994 0.99 1
995 0.99 3
996 0.99 2
997 0.99 1
998 0.99 0
Now you can use the pandas-groupby. Do with data whatever you want, e.g.
>>> df.groupby('PD').agg(lambda x: ','.join(map(str, x)))
data
PD
0.00 6,3,5,9,7,3,6,8,4
0.01 3,5,7,0,4,9,7,1,7,1
0.02 0,0,9,1,5,4,1,6,7,3
0.03 4,4,6,4,6,5,4,4,2,1
0.04 8,3,1,4,6,5,0,6,0,5
[...]
Note that the first row is one item shorter due to missing 0.000 in my sample.
My goal is to be able to identify price growth in a table of records.
I know this is probably far off from what is possible with data tools, so I appreciate any help or suggestions for improvement.
The immediate trouble I'm having is that scipy.stats.linregress does not return if some data in the pandas rows is absent. I think some kind of masking or filling will be necessary to return the slope measure for rows where there are nulls. There is an exception thrown but it still works.
Also, am I using the best solution to find the growth?
I've observed that if I filter for the records that have a positive slope, higher rvalue (correlation) and lower stderr (standard error) the trendline for these rows is upward and consistent.
The reason I tried quantifying the price growth with the slope and other numeric values is because if I plot the lines from all the data in an excel chart, it's overwhelming to select the lines that show consistent upward movement because there is so much noise. Can it be done in a better way?
Here is the working sample:
# credit jezrael
import pandas as pd
import numpy as np
import scipy
from scipy import stats
def calc_slope(row):
a = scipy.stats.linregress(row, y=axisvalues)
return pd.Series(a._asdict())
table=pd.DataFrame({'Category':['A','A','A','B','C','C','C','B','B','A','A','A','B','B','D','A','B','B'],
'Quarter':['2016-Q1','2017-Q2','2017-Q3','2017-Q4','2017-Q2','2016-Q2','2017-Q2','2016-Q3','2016-Q4','2016-Q2','2016-Q3','2017-Q4','2016-Q1','2016-Q2','2016-Q4','2016-Q4','2017-Q2','2017-Q3'],
'Value':[100,200,500,800,700,900,300,400,600,200,300,400,200,300,100,300,500,600]})
db=(table.groupby(['Category','Quarter']).filter(lambda group: len(group) >= 1)).groupby(['Category','Quarter'])["Value"].mean()
db=db.unstack()
axisvalues=list(range(1,len(db.columns)+1)) #used in calc_slope function
db = db.join(db.apply(calc_slope,axis=1))
You can use:
#np.arange instead range
axisvalues= np.arange(1,len(db.columns)+1)
def calc_slope(row):
#mask NaNs out
mask = row.notnull()
a = scipy.stats.linregress(row[mask.values], y=axisvalues[mask])
return pd.Series(a._asdict())
db = db.join(db.apply(calc_slope,axis=1))
print (db)
print (db)
2016-Q1 2016-Q2 2016-Q3 2016-Q4 2017-Q2 2017-Q3 2017-Q4 \
Category
A 100.0 200.0 300.0 300.0 200.0 500.0 400.0
B 200.0 300.0 400.0 600.0 500.0 600.0 800.0
C NaN 900.0 NaN NaN 500.0 NaN NaN
D NaN NaN NaN 100.0 NaN NaN NaN
slope intercept rvalue pvalue stderr
Category
A 0.012895 0.315789 0.802955 0.029677 0.004281
B 0.010057 -0.885057 0.947623 0.001172 0.001516
C -0.007500 8.750000 -1.000000 0.000000 0.000000
D NaN NaN 0.000000 NaN NaN
But for last row get RuntimeWarnings, because only one value in 2016-Q4.
And for remove warnings is possible use filterwarnings, thank Kdog:
import warnings
warnings.filterwarnings("ignore")
I have a file of data consisting of dates in column one and a series of measurements in columns 2 thru n. I like that Pandas understands dates but I can't figure out how to do simple best fit line. Using np.polyfit is easy but it doesn't understand dates. A sample of my attempt follows.
from datetime import datetime
from StringIO import StringIO
import pandas as pd
zdata = '2013-01-01, 5.00, 100.0 \n 2013-01-02, 7.05, 98.2 \n 2013-01-03, 8.90, 128.0 \n 2013-01-04, 11.11, 127.2 \n 2013-01-05 13.08, 140.0'
unames = ['date', 'm1', 'm2']
df = pd.read_table(StringIO(zdata), sep="[ ,]*", header=None, names=unames, \
parse_dates=True, index_col=0)
Y = pd.Series(df['m1'])
model = pd.ols(y=Y, x=df, intercept=True)
In [232]: model.beta['m1']
Out[232]: 0.99999999999999822
In [233]: model.beta['intercept']
Out[233]: -7.1054273576010019e-15
How do I interpret those numbers? If I use 1,2..5 instead of dates np.polyfit gives [ 2.024, 2.958]
which are slope and intercept I expect.
I looked for simple examples but didn't find any.
I believe you're doing multiple linear regression with the code you provided:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <m1> + <m2> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 1.0000
Adj R-squared: 1.0000
Rmse: 0.0000
F-stat (2, 2): inf, p-value: 0.0000
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
m1 1.0000 0.0000 271549416425785.53 0.0000 1.0000 1.0000
m2 -0.0000 0.0000 -0.09 0.9382 -0.0000 0.0000
intercept -0.0000 0.0000 -0.02 0.9865 -0.0000 0.0000
---------------------------------End of Summary---------------------------------
Note the formula for regression: Y ~ <m1> + <m2> + <intercept>. If you want a simple linear regression for m1 and m2 separately, then you should create Xs:
X = pd.Series(range(1, len(df) + 1), index=df.index)
And make the regression:
model = pd.ols(y=Y, x=X, intercept=True)
Result:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 2
R-squared: 0.9995
Adj R-squared: 0.9993
Rmse: 0.0861
F-stat (1, 3): 5515.0414, p-value: 0.0000
Degrees of Freedom: model 1, resid 3
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.0220 0.0272 74.26 0.0000 1.9686 2.0754
intercept 2.9620 0.0903 32.80 0.0001 2.7850 3.1390
---------------------------------End of Summary---------------------------------
It's a bit weird that you got slightly different numbers when using np.polyfit. Here's my output:
[ 2.022 2.962]
Which is the same as pandas' ols output. I checked this with scipy's linregress and got the same result.