Regression with categorical variable - pandas

I want to achieve regression with a categorical variable. I have my dataset like this:
item_id rating gender
1 4 F
2 3 M
3 2 M
model = ols("rating ~ C(gender) + genre", data = data).fit()
Output:
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 3.5175 0.012 295.935 0.000 3.494 3.541
C(gender)[T.M] -0.0021 0.008 -0.257 0.797 -0.018 0.014
genre[T.Adventure] -0.0275 0.017 -1.622 0.105 -0.061 0.006
genre[T.Animation] 0.0064 0.027 0.240 0.810 -0.046 0.058
genre[T.Childrens] 0.0134 0.020 0.657 0.511 -0.027 0.054
genre[T.Comedy] 0.0293 0.014 2.130 0.033 0.002 0.056
Although this gives a correct output it just gives the interaction between gender in general and I would like to get it for each gender separately, so to see the interaction of the female gender and the male gender.
I have tried to encode the gender as you would do with a categorical variable:
item_id rating gender
1 4 0
2 3 1
3 2 1
but it still does not give the desired output.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
item_id =[1,2,3]
rating=[4,3,2]
gender=[0,1,1]
df=pd.DataFrame({'item_id':item_id, 'rating':rating,'gender':gender})
X=df[['rating']]
y=np.array(df['gender'])
logreg=LogisticRegression(C=100)
logreg.fit(X,y)
y_predictions=logreg.predict_proba(X)[:,1]
auc=roc_auc_score(y, y_predictions)
print("Area under the curve: ", auc)
print(y_predictions)
y_pred2 = logreg.predict(X)
cm = confusion_matrix(y,y_pred2)
print(cm)
probable outcomes:
[0.05623715 0.94397376 0.99979014]
confusion matrix
[[1 0]
[0 2]]
data frame
item_id rating gender
0 1 4 0
1 2 3 1
2 3 2 1

Related

Sklearn only predicts one class while dataset is fairly balanced (±80/20 split)

I am trying to come up with a way to check what are the most influential factors of a person not paying back a loan (defaulting). I have worked with the sklearn library quite intensively, but I feel like I am missing something quite trivial...
The dataframe looks like this:
0 7590-VHVEG Female Widowed Electronic check Outstanding loan 52000 20550 108 0.099 288.205374 31126.180361 0 No Employed No Dutch No 0
1 5575-GNVDE Male Married Bank transfer Other 42000 22370 48 0.083 549.272708 26365.089987 0 Yes Employed No Dutch No 0
2 3668-QPYBK Male Registered partnership Bank transfer Study 44000 24320 25 0.087 1067.134272 26678.356802 0 No Self-Employed No Dutch No 0
The distribution of the "DefaultInd" column (target variable) is this:
0 0.835408
1 0.164592
Name: DefaultInd, dtype: float64
I have label encoded the data to make it look like this, :
CustomerID Gender MaritalStatus PaymentMethod SpendingTarget EstimatedIncome CreditAmount TermLoanMonths YearlyInterestRate MonthlyCharges TotalAmountPayments CurrentLoans SustainabilityIndicator EmploymentStatus ExistingCustomer Nationality BKR_Registration DefaultInd
0 7590-VHVEG 0 4 2 2 52000 20550 108 0.099 288.205374 31126.180361 0 0 0 0 5 0 0
1 5575-GNVDE 1 1 0 1 42000 22370 48 0.083 549.272708 26365.089987 0 1 0 0 5 0 0
2 3668-QPYBK 1 2 0 4 44000 24320 25 0.087 1067.134272 26678.356802 0 0 2 0 5 0
After that I have removed NaNs and cleaned it up some more (removing capitalizion, punctuation etc)
After that, I try to run this cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
y = df['DefaultInd']
X = df.drop(['CustomerID','DefaultInd'],axis=1)
X = X.astype(float)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))
Which results in this:
precision recall f1-score support
0 0.83 1.00 0.91 1073
1 0.00 0.00 0.00 213
accuracy 0.83 1286
macro avg 0.42 0.50 0.45 1286
weighted avg 0.70 0.83 0.76 1286
As you can see, the "1" class does not get predicted 1 time, I am wondering whether or not this behaviour is to be expected (I think it is not). I tried to use class_weightd = ‘balanced’, but that resulted in an average f1 score of 0.59 (instead of 0.76)
I feel like I am missing something, or is this kind of behaviour expected and should I rebalance the dataset before fitting? I feel like the division is not that skewed (±80/20), there should not be this big of a problem.
Any help would be more than appreciated :)

Concatenate labels to an existing dataframe

I want to use a list of names "headers" to create a new column in my dataframe. In the initial table, the name of each division is positioned above the results for each team in that division. I want to add that header to each row entry for each divsion to make the data more identifiable like this. I have the headers stored in the "header" object in my code. How can I multiply each division header by the number of rows that appear in the division and append to the dataset?
Edit: here is another snippet of what I want the get from the end product.
df3 = df.iloc[0:6]
df3.insert(0, 'Divisions', ['na','L5 Junior', 'L5 Junior', 'na',
'L5 Senior - Medium', 'L5 Senior - Medium'])
df3
.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
Import HTML
scr = 'https://tv.varsity.com/results/7361971-2022-spirit-unlimited-battle-at-the-
boardwalk-atlantic-city-grand-ntls/31220'
scr1 = requests.get(scr)
soup = BeautifulSoup(scr1.text, "html.parser")
List of names to append
table_MN = pd.read_html(scr)
sp3 = soup.find(class_="full-content").find_all("h2")
headers = [elt.text for elt in sp3]
table_MN = pd.read_html(scr)
Extract text and header from division info
div = pd.DataFrame(headers)
div.columns = ["division"]
df = pd.concat(table_MN, ignore_index=True)
df.columns = df.iloc[0]
df
It is still not clear what is the output you are looking for. However, may I suggest the following, which accomplishes selecting common headers from tables in table_MN and the concatenating the results. If it is going in the right direction pls let me know, and indicate what else you want to extract from the resulting table:
tmn_1 = [tbl.T.set_index(0).T for tbl in table_MN]
pd.concat(tmn_1, axis=0, ignore_index = True)
output:
Rank Program Name Team Name Raw Score Deductions Performance Score Event Score
-- ------ --------------------------- ----------------- ----------- ------------ ------------------- -------------
0 1 Rockstar Cheer New Jersey City Girls 47.8667 0 95.7333 95.6833
1 2 Cheer Factor Xtraordinary 46.6667 0.15 93.1833 92.8541
2 1 Rockstar Cheer New Jersey City Girls 47.7667 0 95.5333 23.8833
3 2 Cheer Factor Xtraordinary 46.0333 0.2 91.8667 22.9667
4 1 Star Athletics Roar 47.5333 0.9 94.1667 93.9959
5 1 Prime Time All Stars Lady Onyx 43.9 1.35 86.45 86.6958
6 1 Prime Time All Stars Lady Onyx 44.1667 0.9 87.4333 21.8583
7 1 Just Cheer All Stars Jag 5 46.4333 0.15 92.7167 92.2875
8 1 Just Cheer All Stars Jag 5 45.8 0.6 91 22.75
9 1 Quest Athletics Black Ops 47.4333 0.45 94.4167 93.725
10 1 Quest Athletics Black Ops 46.5 1.35 91.65 22.9125
11 1 The Stingray Allstars X-Rays 45.3 0.95 89.65 88.4375
12 1 Vortex Allstars Lady Rays 45.7 0.5 90.9 91.1083
13 1 Vortex Allstars Lady Rays 45.8667 0 91.7333 22.9333
14 1 Upper Merion All Stars Citrus 46.4333 0 92.8667 92.7
15 2 Cheer Factor JUNIOR X 45.9 1.1 90.7 90.6542
16 3 NJ Premier All Stars Prodigy 44.6333 0.05 89.2167 89.8292
17 1 Upper Merion All Stars Citrus 46.1 0 92.2 23.05
18 2 NJ Premier All Stars Prodigy 45.8333 0 91.6667 22.9167
19 3 Cheer Factor JUNIOR X 45.7333 0.95 90.5167 22.6292
20 1 Virginia Royalty Athletics Dynasty 46.5 0 93 92.9
21 1 Virginia Royalty Athletics Dynasty 46.3 0 92.6 23.15
22 1 South Jersey Storm Lady Reign 47.2333 0 94.4667 93.4875
...

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Python : How to do conditional rounding in dataframe column values?

data = {'Name' : ['tom','bul','zack','doll','viru'],'price':[.2012,.05785,2.03,5.89,.029876]}
df = pd.DataFrame(data)
I want to round to 0 decimal points if the 'price' value is more than 1 and round to 4 decimal points if the value is less than 1. Please suggest.
If there are many conditions, I prefer using numpy.select as in following:
import numpy as np
np.select(
[df.price >= 1, df.price < 1],
[df.price.round(0), df.price.round(2)],
)
# df
# Name price price2
# 0 tom 0.201200 0.20
# 1 bul 0.057850 0.06
# 2 zack 2.030000 2.00
# 3 doll 5.890000 6.00
# 4 viru 0.029876 0.03
With more conditions, we could do something like this:
df['price3'] = np.select(
[df.price >= 3, df.price >= 1, df.price < 1],
[df.price.round(0), df.price.round(2), df.price.round(3)],
)
# df
# Name price price2 price3
# 0 tom 0.201200 0.20 0.201
# 1 bul 0.057850 0.06 0.058
# 2 zack 2.030000 2.00 2.030
# 3 doll 5.890000 6.00 6.000
# 4 viru 0.029876 0.03 0.030

Convert ordered levels to numeric in pandas

I was wondering is there any function in pandas that allows me to do this.
I have a column with levels [low, medium, high].
I would like to translate them to [1,2,3] to perform linear regression. However, what i am currently doing is df[df['interest_level'] == 'low'] = 1. is there a better way of doing this?
Thanks.
use pd.factorize() method:
df['interest_level'] = pd.factorize(df['interest_level'])[0]
you can also categorize your new numerical values (this might save a lot of memory):
Sample DataFrame:
In [34]: df = pd.DataFrame({'interest_level':np.random.choice(['medium','high','low'], 10)})
In [35]: df
Out[35]:
interest_level
0 high
1 low
2 medium
3 high
4 low
5 high
6 high
7 low
8 low
9 medium
Solution:
In [36]: df['interest_level'], cats = pd.factorize(df['interest_level'])
In [37]: df['interest_level'] = pd.Categorical(df['interest_level'], categories=np.arange(len(cats)))
In [38]: df
Out[38]:
interest_level
0 0
1 1
2 2
3 0
4 1
5 0
6 0
7 1
8 1
9 2
In [39]: cats # this can be used for the backtracing ...
Out[39]: Index(['high', 'low', 'medium'], dtype='object')
In [40]: df.memory_usage()
Out[40]:
Index 80
interest_level 34 # <---- NOTE: only 34 bytes used for 10 integers
dtype: int64
In [41]: df.dtypes
Out[41]:
interest_level category
dtype: object
You can use map:
d = {'low':1,'medium':2,'high':3}
df['interest_level'] = df['interest_level'].map(d)
Sample:
df = pd.DataFrame({'interest_level':['medium','high','low', 'low', 'medium']})
print (df)
interest_level
0 medium
1 high
2 low
3 low
4 medium
d = {'low':1,'medium':2,'high':3}
df['interest_level'] = df['interest_level'].map(d)
print (df)
interest_level
0 2
1 3
2 1
3 1
4 2
Another solution is cast to Categorical and then use cat.codes:
categories = ['low','medium','high']
df['interest_level'] = df['interest_level'].astype('category',
categories=categories,
ordered=True).cat.codes + 1
print (df)
interest_level
0 2
1 3
2 1
3 1
4 2