Group by and join columns in Pandas Dataframe - pandas

For every column in the list cat_column, I need to loop over the list numerical_cols and get the mean and standard deviation. I have the code below that does it. But in the end of second loop, I need to have a final table with the respective cat_column and mean and standard deviation of the all numerical columns like below.
Code1 Code2 Mean_Code1_CarAge Std_Code1_CarAge Mean_Code1_CarPrice Std_Code1_CarPrice Mean_Code2_CarAge Std_Code2_CarAge Mean_Code2_CarPrice Std_Code2_CarPrice
Code:
cat_column = ["Code1", "Code2"]
numerical_cols = ['CarAge', 'CarPrice']
for base_col in cat_column :
for col in numerical_cols:
df = df.groupby(base_col)[col].agg([np.mean, np.std]).reset_index().rename(
columns={'mean': 'mean_'+base_col+"_"+col, 'std': 'std_'+base_col+"_"+col})
Input:
Code1 Code2 CarAge CarPrice
AAA AA1 12 5000
BBB BB1 30 10000
CCC CC1 64 22000
AAA AA1 19 4000
BBB BB1 49 10000
Output:
Code1 Code2 Mean_Code1_CarAge Std_Code1_CarAge Mean_Code1_CarPrice Std_Code1_CarPrice Mean_Code2_CarAge Std_Code2_CarAge Mean_Code2_CarPrice Std_Code2_CarPrice
AAA AA1 15.5 4.95 4500 707.10 15.5 4.95 4500 707.10
BBB BB1 9.5 13.43 10000 0.00 9.5 13.43 10000 0.00
CCC CC1 64.0 NaN 22000 NaN 64.0 NaN 22000 NaN
Not sure how to do that dynamically in the above code. Any leads/suggestions would be appreciated.

Try groupby aggregate using a dictionary made from the values in numerical_cols then reduce the multi-index using map, lastly concat on axis=1:
import pandas as pd
df = pd.DataFrame({
'Code1': {0: 'AAA', 1: 'BBB', 2: 'CCC', 3: 'AAA', 4: 'BBB'},
'Code2': {0: 'AA1', 1: 'BB1', 2: 'CC1', 3: 'AA1', 4: 'BB1'},
'CarAge': {0: 12, 1: 30, 2: 64, 3: 19, 4: 49},
'CarPrice': {0: 5000, 1: 10000, 2: 22000, 3: 4000, 4: 10000}}
)
cat_columns = ["Code1", "Code2"]
numerical_cols = ['CarAge', 'CarPrice']
# Create a dictionary to map keys to aggregation types
agg_d = {k: ['mean', 'std'] for k in numerical_cols}
dfs = []
for cat_column in cat_columns:
# Groupby Agg to get aggs for each key in agg_d per group
g = df.groupby(cat_column).aggregate(agg_d)
# Reduce Multi Index
g.columns = g.columns.map(lambda x: f'{x[1]}_{cat_column}_{x[0]}')
# Reset Index
g = g.reset_index()
dfs.append(g)
# Concat on Axis 1
new_df = pd.concat(dfs, axis=1)
# Re Order Columns
new_df = new_df[[*cat_columns, *new_df.columns.difference(cat_columns)]]
print(new_df.to_string())
new_df:
Code1 Code2 mean_Code1_CarAge mean_Code1_CarPrice mean_Code2_CarAge mean_Code2_CarPrice std_Code1_CarAge std_Code1_CarPrice std_Code2_CarAge std_Code2_CarPrice
0 AAA AA1 15.5 4500 15.5 4500 4.949747 707.106781 4.949747 707.106781
1 BBB BB1 39.5 10000 39.5 10000 13.435029 0.000000 13.435029 0.000000
2 CCC CC1 64.0 22000 64.0 22000 NaN NaN NaN NaN

Related

I am trying to unwrap?, explode?, a data frame with several columns into a new data frame with rows

I apologize for not knowing the correct terminology, but I am looking for a way in Pandas to transform a data frame with several similar columns into a data frame with rows that explode? into more rows. Basically for every column that starts with Line.{x}, I want to create a new row that has all the Line.{x} columns. Same for all columns with values in {x}, e.g. 1,2,3.
Here is an example of a data frame I'd like to convert from:
Column1 Column2 Column3 Column4 Line.0.a Line.0.b Line.0.c Line.1.a Line.1.b Line.1.c Line.2.a Line.2.b Line.2.c Line.3.a Line.3.b Line.3.c
0 the quick brown dog 100 200 300 400 500 600 700 800 900 1000 1100 1200
1 you see spot run 101 201 301 401 501 601
2 four score and seven 102 202 302
I would like to convert it to this:
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100 200 300
1 the quick brown dog 400 500 600
2 the quick brown dog 700 800 900
3 the quick brown dog 1000 1100 1200
4 you see spot run 101 201 301
5 you see spot run 401 501 601
6 four score and seven 102 202 302
Thank you in advance!
One option is with pivot_longer from pyjanitor, where for this particular use case, you pass .value placeholder to names_to, to keep track of the parts of the column you want to retain as headers; you then pass a regular expression with matching groups to names_pattern:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index='Col*',
names_to = (".value", ".value"),
names_pattern = r"(.+)\.\d+(.+)")
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 you see spot run 101.0 201.0 301.0
2 four score and seven 102.0 202.0 302.0
3 the quick brown dog 400.0 500.0 600.0
4 you see spot run 401.0 501.0 601.0
5 four score and seven NaN NaN NaN
6 the quick brown dog 700.0 800.0 900.0
7 you see spot run NaN NaN NaN
8 four score and seven NaN NaN NaN
9 the quick brown dog 1000.0 1100.0 1200.0
10 you see spot run NaN NaN NaN
11 four score and seven NaN NaN NaN
You can get rid of the nulls with dropna:
(df
.pivot_longer(
index='Col*',
names_to = (".value", ".value"),
names_pattern = r"(.+)\.\d+(.+)")
.dropna()
)
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 you see spot run 101.0 201.0 301.0
2 four score and seven 102.0 202.0 302.0
3 the quick brown dog 400.0 500.0 600.0
4 you see spot run 401.0 501.0 601.0
6 the quick brown dog 700.0 800.0 900.0
9 the quick brown dog 1000.0 1100.0 1200.0
Another option, as pointed out by #Mozway, is to convert the columns into a MultiIndex, and stack:
temp = df.set_index(['Column1', 'Column2', 'Column3','Column4'])
# this is where the MultiIndex is created
cols = temp.columns.str.split('.', expand=True)
temp.columns = cols
# now we stack
# nulls are dropped by default
temp = temp.stack(level=1).droplevel(-1)
temp.columns = temp.columns.map('.'.join)
temp.reset_index()
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 the quick brown dog 400.0 500.0 600.0
2 the quick brown dog 700.0 800.0 900.0
3 the quick brown dog 1000.0 1100.0 1200.0
4 you see spot run 101.0 201.0 301.0
5 you see spot run 401.0 501.0 601.0
6 four score and seven 102.0 202.0 302.0
Here is an approach that works. It uses melt and then joins.
new_df contains what you need. the order of items might be different though. The fuction takes 3 parameters. The first is your data frame. Second is keys that remain static and third is convertion dict that tells what goes where.
import pandas as pd
def vars_to_cases(df:pd.DataFrame,keys:list,convertion_dict:dict):
vals = list(convertion_dict.values())
l = len(vals[0])
if not all(len(item) == l for item in vals):
raise Exception("Dictionary values don't have the same length")
tempkeys = keys.copy()
tempkeys.append("variable")
df_data = pd.DataFrame()
for short_name, my_list in convertion_dict.items():
my_replace_dict = {}
for count, item, in enumerate(my_list):
my_replace_dict[item] = count
mydf = pd.melt(df, id_vars=tempkeys[:-1], value_vars=my_list)
mydf["variable"].replace(my_replace_dict, inplace=True)
mydf.rename(columns={"value": short_name}, inplace=True)
mydf = mydf.set_index(tempkeys)
if df_data.empty:
df_data = mydf.copy()
else:
df_data = df_data.join(mydf)
return df_data
#here is the data
df=pd.DataFrame({'Column1': {0: 'the', 1: 'you', 2: 'four'},
'Column2': {0: 'quick', 1: 'see', 2: 'score'},
'Column3': {0: 'brown', 1: 'spot', 2: 'and'},
'Column4': {0: 'dog', 1: 'run', 2: 'seven'},
'Line.0.a': {0: 100, 1: 101, 2: 102},
'Line.0.b': {0: 200, 1: 201, 2: 202},
'Line.0.c': {0: 300, 1: 301, 2: 302},
'Line.1.a': {0: 400.0, 1: 401.0, 2: None},
'Line.1.b': {0: 500.0, 1: 501.0, 2: None},
'Line.1.c': {0: 600.0, 1: 601.0, 2: None},
'Line.2.a': {0: 700.0, 1: None, 2: None},
'Line.2.b': {0: 800.0, 1: None, 2: None},
'Line.2.c': {0: 900.0, 1: None, 2: None},
'Line.3.a': {0: 1000.0, 1: None, 2: None},
'Line.3.b': {0: 1100.0, 1: None, 2: None},
'Line.3.c': {0: 1200.0, 1: None, 2: None}})
convertion_dict={"Line.a":["Line.0.a","Line.1.a","Line.2.a","Line.3.a"],
"Line.b":["Line.0.b","Line.1.b","Line.2.b","Line.3.b"],
"Line.c":["Line.0.c","Line.1.c","Line.2.c","Line.3.c"]}
keys=["Column1","Column2","Column3","Column4"]
new_df=vars_to_cases(df,keys,convertion_dict)
new_df=new_df.reset_index()
new_df=new_df.dropna()
new_df=new_df.drop(columns="variable")

Create a dataframe from a series with a TimeSeriesIndex multiplied by another series

Let's say I have a series, ser1 with a TimeSeriesIndex length x. I also have another series, ser2 length y. How do I multiply these so that I get a dataframe shape (x,y) where the index is from ser1 and the columns are the indices from ser2. I want every element of ser2 to be multiplied by the values of each element in ser1.
import pandas as pd
ser1 = pd.Series([100, 105, 110, 114, 89],index=pd.date_range(start='2021-01-01', end='2021-01-05', freq='D'), name='test')
test_ser2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
Perhaps this is more elegantly done with numpy.
Try this using np.outer with pandas DataFrame constructor:
pd.DataFrame(np.outer(ser1, test_ser2), index=ser1.index, columns=test_ser2.index)
Output:
a b c d e
2021-01-01 100 200 300 400 500
2021-01-02 105 210 315 420 525
2021-01-03 110 220 330 440 550
2021-01-04 114 228 342 456 570
2021-01-05 89 178 267 356 445

Rounding the numeric output of pandas pivot tables and SciPy's stats.mode

Rather than the mean score displaying as 91.144105, how to display instead 91.1?
Rather than the mode score displaying as ([90.0], [77]), how to display instead 90?
code snippet and output:
from scipy import stats, import numpy as np
pd.pivot_table(df_inspections_violations, index= ['ACTIVITY YEAR', 'FACILITY ZIP'], values= "SCORE",
aggfunc= ['mean', 'median', stats.mode])
You can use style.format (documentation).
But you'd better split the mode SCORE column in value and (I guess) index, so that you can use a dictionary to control each single column, for example:
df = pd.DataFrame({
'a': np.linspace(0, 1, 7),
'b': np.linspace(31, 90, 7),
'c': np.arange(10, 17)
})
df.style.format({
'a': "{:.2f}",
'b': "{:.1f}",
'c': int,
})
Output
a b c
0 0.00 31.0 10
1 0.17 40.8 11
2 0.33 50.7 12
3 0.50 60.5 13
4 0.67 70.3 14
5 0.83 80.2 15
6 1.00 90.0 16

Pandas transform rows with specific character

i am working on features transformation, and ran into this issue. Let me know what you think. Thanks!
I have a table like this
And I want to create an output column like this
Some info:
All the outputs will be based on numbers that end with a ':'
I have 100M+ rows in this table. Need to consider performance issue.
Let me know if you have some good ideas. Thanks!
Here is some copy and paste-able sample data:
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
Solution #1:
You can use .str.contains(':') with np.where() to identify the values, otherwise return np.nan. Then, use ffill() to fill down on nan values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
df['Output'] = np.where(df['Number'].str.contains(':'),df['Number'].str.split(':').str[0],np.nan)
df['Output'] = df['Output'].ffill()
df
Solution #2 - Even easier and potentially better performance you can do some regex with str.extract() and then again ffill():
df['Output'] = df['Number'].str.extract('^(\d+):').ffill()
df
Out[1]:
Number Output
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7
I think this is what you are looking for:
import pandas as pd
c = ['Number']
d = ['1:00',100,1001,1321,3254,'15:00',20,60,80,90,'4:00',26,45,90,89]
df = pd.DataFrame(data=d,columns=c)
temp= df['Number'].str.split(":", n = 1, expand = True)
df['New_Val'] = temp[0].ffill()
print(df)
The output of this will be as follows:
Number New_Val
0 1:00 1
1 100 1
2 1001 1
3 1321 1
4 3254 1
5 15:00 15
6 20 15
7 60 15
8 80 15
9 90 15
10 4:00 4
11 26 4
12 45 4
13 90 4
14 89 4
Looks like your DataFrame has string values. I considered them as a mix of numbers and strings.
Here's the solution if df['Number'] is all strings.
df1 = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
temp= df1['Number'].str.split(":", n = 1, expand = True)
temp.loc[temp[1].astype(bool) != False, 'New_val'] = temp[0]
df1['New_val'] = temp['New_val'].ffill()
print (df1)
The output of df1 will be:
Number New_val
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7

Python : How to do conditional rounding in dataframe column values?

data = {'Name' : ['tom','bul','zack','doll','viru'],'price':[.2012,.05785,2.03,5.89,.029876]}
df = pd.DataFrame(data)
I want to round to 0 decimal points if the 'price' value is more than 1 and round to 4 decimal points if the value is less than 1. Please suggest.
If there are many conditions, I prefer using numpy.select as in following:
import numpy as np
np.select(
[df.price >= 1, df.price < 1],
[df.price.round(0), df.price.round(2)],
)
# df
# Name price price2
# 0 tom 0.201200 0.20
# 1 bul 0.057850 0.06
# 2 zack 2.030000 2.00
# 3 doll 5.890000 6.00
# 4 viru 0.029876 0.03
With more conditions, we could do something like this:
df['price3'] = np.select(
[df.price >= 3, df.price >= 1, df.price < 1],
[df.price.round(0), df.price.round(2), df.price.round(3)],
)
# df
# Name price price2 price3
# 0 tom 0.201200 0.20 0.201
# 1 bul 0.057850 0.06 0.058
# 2 zack 2.030000 2.00 2.030
# 3 doll 5.890000 6.00 6.000
# 4 viru 0.029876 0.03 0.030