Python : How to do conditional rounding in dataframe column values? - pandas

data = {'Name' : ['tom','bul','zack','doll','viru'],'price':[.2012,.05785,2.03,5.89,.029876]}
df = pd.DataFrame(data)
I want to round to 0 decimal points if the 'price' value is more than 1 and round to 4 decimal points if the value is less than 1. Please suggest.

If there are many conditions, I prefer using numpy.select as in following:
import numpy as np
np.select(
[df.price >= 1, df.price < 1],
[df.price.round(0), df.price.round(2)],
)
# df
# Name price price2
# 0 tom 0.201200 0.20
# 1 bul 0.057850 0.06
# 2 zack 2.030000 2.00
# 3 doll 5.890000 6.00
# 4 viru 0.029876 0.03
With more conditions, we could do something like this:
df['price3'] = np.select(
[df.price >= 3, df.price >= 1, df.price < 1],
[df.price.round(0), df.price.round(2), df.price.round(3)],
)
# df
# Name price price2 price3
# 0 tom 0.201200 0.20 0.201
# 1 bul 0.057850 0.06 0.058
# 2 zack 2.030000 2.00 2.030
# 3 doll 5.890000 6.00 6.000
# 4 viru 0.029876 0.03 0.030

Related

Group by and join columns in Pandas Dataframe

For every column in the list cat_column, I need to loop over the list numerical_cols and get the mean and standard deviation. I have the code below that does it. But in the end of second loop, I need to have a final table with the respective cat_column and mean and standard deviation of the all numerical columns like below.
Code1 Code2 Mean_Code1_CarAge Std_Code1_CarAge Mean_Code1_CarPrice Std_Code1_CarPrice Mean_Code2_CarAge Std_Code2_CarAge Mean_Code2_CarPrice Std_Code2_CarPrice
Code:
cat_column = ["Code1", "Code2"]
numerical_cols = ['CarAge', 'CarPrice']
for base_col in cat_column :
for col in numerical_cols:
df = df.groupby(base_col)[col].agg([np.mean, np.std]).reset_index().rename(
columns={'mean': 'mean_'+base_col+"_"+col, 'std': 'std_'+base_col+"_"+col})
Input:
Code1 Code2 CarAge CarPrice
AAA AA1 12 5000
BBB BB1 30 10000
CCC CC1 64 22000
AAA AA1 19 4000
BBB BB1 49 10000
Output:
Code1 Code2 Mean_Code1_CarAge Std_Code1_CarAge Mean_Code1_CarPrice Std_Code1_CarPrice Mean_Code2_CarAge Std_Code2_CarAge Mean_Code2_CarPrice Std_Code2_CarPrice
AAA AA1 15.5 4.95 4500 707.10 15.5 4.95 4500 707.10
BBB BB1 9.5 13.43 10000 0.00 9.5 13.43 10000 0.00
CCC CC1 64.0 NaN 22000 NaN 64.0 NaN 22000 NaN
Not sure how to do that dynamically in the above code. Any leads/suggestions would be appreciated.
Try groupby aggregate using a dictionary made from the values in numerical_cols then reduce the multi-index using map, lastly concat on axis=1:
import pandas as pd
df = pd.DataFrame({
'Code1': {0: 'AAA', 1: 'BBB', 2: 'CCC', 3: 'AAA', 4: 'BBB'},
'Code2': {0: 'AA1', 1: 'BB1', 2: 'CC1', 3: 'AA1', 4: 'BB1'},
'CarAge': {0: 12, 1: 30, 2: 64, 3: 19, 4: 49},
'CarPrice': {0: 5000, 1: 10000, 2: 22000, 3: 4000, 4: 10000}}
)
cat_columns = ["Code1", "Code2"]
numerical_cols = ['CarAge', 'CarPrice']
# Create a dictionary to map keys to aggregation types
agg_d = {k: ['mean', 'std'] for k in numerical_cols}
dfs = []
for cat_column in cat_columns:
# Groupby Agg to get aggs for each key in agg_d per group
g = df.groupby(cat_column).aggregate(agg_d)
# Reduce Multi Index
g.columns = g.columns.map(lambda x: f'{x[1]}_{cat_column}_{x[0]}')
# Reset Index
g = g.reset_index()
dfs.append(g)
# Concat on Axis 1
new_df = pd.concat(dfs, axis=1)
# Re Order Columns
new_df = new_df[[*cat_columns, *new_df.columns.difference(cat_columns)]]
print(new_df.to_string())
new_df:
Code1 Code2 mean_Code1_CarAge mean_Code1_CarPrice mean_Code2_CarAge mean_Code2_CarPrice std_Code1_CarAge std_Code1_CarPrice std_Code2_CarAge std_Code2_CarPrice
0 AAA AA1 15.5 4500 15.5 4500 4.949747 707.106781 4.949747 707.106781
1 BBB BB1 39.5 10000 39.5 10000 13.435029 0.000000 13.435029 0.000000
2 CCC CC1 64.0 22000 64.0 22000 NaN NaN NaN NaN

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Regression with categorical variable

I want to achieve regression with a categorical variable. I have my dataset like this:
item_id rating gender
1 4 F
2 3 M
3 2 M
model = ols("rating ~ C(gender) + genre", data = data).fit()
Output:
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 3.5175 0.012 295.935 0.000 3.494 3.541
C(gender)[T.M] -0.0021 0.008 -0.257 0.797 -0.018 0.014
genre[T.Adventure] -0.0275 0.017 -1.622 0.105 -0.061 0.006
genre[T.Animation] 0.0064 0.027 0.240 0.810 -0.046 0.058
genre[T.Childrens] 0.0134 0.020 0.657 0.511 -0.027 0.054
genre[T.Comedy] 0.0293 0.014 2.130 0.033 0.002 0.056
Although this gives a correct output it just gives the interaction between gender in general and I would like to get it for each gender separately, so to see the interaction of the female gender and the male gender.
I have tried to encode the gender as you would do with a categorical variable:
item_id rating gender
1 4 0
2 3 1
3 2 1
but it still does not give the desired output.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
item_id =[1,2,3]
rating=[4,3,2]
gender=[0,1,1]
df=pd.DataFrame({'item_id':item_id, 'rating':rating,'gender':gender})
X=df[['rating']]
y=np.array(df['gender'])
logreg=LogisticRegression(C=100)
logreg.fit(X,y)
y_predictions=logreg.predict_proba(X)[:,1]
auc=roc_auc_score(y, y_predictions)
print("Area under the curve: ", auc)
print(y_predictions)
y_pred2 = logreg.predict(X)
cm = confusion_matrix(y,y_pred2)
print(cm)
probable outcomes:
[0.05623715 0.94397376 0.99979014]
confusion matrix
[[1 0]
[0 2]]
data frame
item_id rating gender
0 1 4 0
1 2 3 1
2 3 2 1

Normalizing and denormalizing rows in a dataframe

I have a dataframe with 20k rows and 100 columns. I am trying to normalize my data across rows. Scikit's MinMaxScaler doesn't allow me to do this by rows. It has something called minmax_scale that allows row normalization but I cannot denormalize it later. At least, I don't see how to do it. How would you guys do it?
From sklearn.preprocessing.minmax_scale:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 5],
'B': [88, 300, 200]})
# Find and store min and max vectors
min_values = df.min()
max_values = df.max()
normalized_df = (df - min_values) / (df.max() - min_values)
denormalized_df= normalized_df * (max_values - min_values) + min_values
A B
1 88
2 300
5 200
A B
0.00 0.000000
0.25 1.000000
1.00 0.528302
A B
1.0 88.0
2.0 300.0
5.0 200.0

Pandas Dataframe how to iterate over rows and perform calculations on their values

I've started to work with Pandas Dataframe and try to figure out how to deal with the below task.
I have an excel spreadsheet that needs to be imported to Pandas DataFrame and the below calculations need to be done to populate PercentageOnSale , Bonus and EmployeesIncome columns.
If the sum of all SalesValues for the EmployeeID is less than 5000 the PercentageOnSale should be 5% of SalesValue.
If the sum of all SalesValues for the EmployeeID is equal or more than 5000 the PercentageOnSale should be 7% of SalesValue.
If the sum of all SalesValues for the EmployeeID is more than 10.000 the PercentageOnSale should be 7% of SalesValue and additionaly a Bonus of 3% should be calculated.
EmployeesIncome is the sum of PercentageOnSale and Bonus columns.
sample excel view
You could try groupby-apply as follows:
# Data
df = pd.DataFrame({"EmployeeID":[1,1,2,3,1,3,5,1],
"ProductSold":["P1","P2","P3","P1","P2","P3","P1","P2"],
"SalesValue":[3000,3500,4000,3000,5000,3000,3000,4000]})
# Calculations
def calculate(x):
# Calcualte Bonus
x['Bonus'] = 0
if x['SalesValue'].sum() > 10000:
x['Bonus'] = 0.03*x['SalesValue']
# Calculate PercentageOnSale
if x['SalesValue'].sum() < 3000:
x['PercentageOnSale'] = 0.05*x['SalesValue']
else:
x['PercentageOnSale'] = 0.07*x['SalesValue']
# Total Income per sale
x['EmployeesIncome'] = x['PercentageOnSale'] + x['Bonus']
return x
df_final = df.groupby('EmployeeID').apply(calculate)
The output is as follows:
EmployeeID ProductSold SalesValue Bonus PercentageOnSale EmployeesIncome
0 1 P1 3000 90.0 210.0 300.0
1 1 P2 3500 105.0 245.0 350.0
2 2 P3 4000 0.0 280.0 280.0
3 3 P1 3000 0.0 210.0 210.0
4 1 P2 5000 150.0 350.0 500.0
5 3 P3 3000 0.0 210.0 210.0
6 5 P1 3000 0.0 210.0 210.0
7 1 P2 4000 120.0 280.0 400.0