Conditional mean while using iloc pandas - pandas

Assume I have a dataframe with columns stated below (consist more column in actual data).
Customer Group1 jan_revenue feb_revenue mar_revenue
Sam Bank A 40 50 0
Wilson Bank A 60 70 30
Jay Bank B 10 40 40
Jim Bank A 0 40 70
Yan Bank C 0 40 90
Tim Bank C 10 0 50
I want to calculate the mean for each customer but only those are non-zero.
For example, customer Sam has mean (40+50)/2 = 45 and Wilson (60+70+30)/3 = 53.3333
Since I have a large number of columns, so i choose to use iloc but my approach included all the 0.
df['avg_revenue21'] = df.iloc[:,27:39].mean(axis=1)
May I know is there a way for conditional mean while using iloc?
Thank you

You can use select_dtypes to get numeric columns, replace the zeros with NA, then get the mean as usual:
df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)
output:
Sam 45.000000
Wilson 53.333333
Jay 30.000000
Jim 55.000000
Yan 65.000000
Tim 30.000000
dtype: float64
As new column:
df['avg_revenue21'] = df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)
Customer Group1 jan_revenue feb_revenue mar_revenue avg_revenue21
Sam Bank A 40 50 0 45.000000
Wilson Bank A 60 70 30 53.333333
Jay Bank B 10 40 40 30.000000
Jim Bank A 0 40 70 55.000000
Yan Bank C 0 40 90 65.000000
Tim Bank C 10 0 50 30.000000
variants:
If the input are strings:
df['avg_revenue21'] = df.apply(pd.to_numeric, errors='coerce').replace(0, pd.NA).mean(axis=1)
If you only want to consider a subset:
df['avg_revenue21'] = df.filter(regex='(feb|mar)_').replace(0, pd.NA).mean(axis=1)
or:
df['avg_revenue21'] = df[['feb_revenue', 'mar_revenue']].replace(0, pd.NA).mean(axis=1)

Use DataFrame.replace with mean:
df['new'] = df.replace(0, np.nan).mean(axis=1)
print (df)
Customer Group1 jan_revenue feb_revenue mar_revenue new
0 Sam Bank A 40 50 0 45.000000
1 Wilson Bank A 60 70 30 53.333333
2 Jay Bank B 10 40 40 30.000000
3 Jim Bank A 0 40 70 55.000000
4 Yan Bank C 0 40 90 65.000000
5 Tim Bank C 10 0 50 30.000000
Or:
df['new'] = df.replace(0, np.nan).mean(numeric_only=True, axis=1)
print (df)
Customer Group1 jan_revenue feb_revenue mar_revenue new
0 Sam Bank A 40 50 0 45.000000
1 Wilson Bank A 60 70 30 53.333333
2 Jay Bank B 10 40 40 30.000000
3 Jim Bank A 0 40 70 55.000000
4 Yan Bank C 0 40 90 65.000000
5 Tim Bank C 10 0 50 30.000000
EDIT: If possible columns are not numeric, use to_numeric with errors='coerce' for missing values if no numbers:
df['new'] = df.apply(pd.to_numeric, errors='coerce').replace(0, np.nan).mean(axis=1)

Related

aggregate data between two dates with two dataframes

Given I have the following DF,
Assume this table has all the sales rep and all the Q end dates in the last 20 years.
Q End date
Rep
Var1
03/31/2010
Bob
11
03/31/2010
Alice
12
03/31/2010
Jack
13
06/30/2010
Bob
14
06/30/2010
Alice
15
06/30/2010
Jack
16
I also have a table of transactions events
Sell Date
Rep
04/01/2009
Bob
03/01/2010
Bob
02/01/2010
Jack
02/01/2010
Jack
I am trying to modify the first DF so to have a column that aggregates the number of transactions that happened 12 month prior to the q end date per Qend per Rep
The result should look like this
Q End end
Rep
Var1
Trailing 12M transactions
03/31/2010
Bob
11
2
03/31/2010
Alice
12
0
03/31/2010
Jack
13
2
06/30/2010
Bob
14
1
06/30/2010
Alice
15
0
06/30/2010
Jack
16
2
My table has 2000-3000 sales rep per Q for ~20 years and number of transactions per trailing 12m can range between 0-7k ish.
Any help here would be appreciated. Thanks!
Try:
df1["Q End date"] = pd.to_datetime(df1["Q End date"])
df2["Sell Date"] = pd.to_datetime(df2["Sell Date"])
df2 = df2.sort_values(by="Sell Date").set_index("Sell Date")
df1["Trailing 12M transactions"] = df1.apply(
lambda x: df2.loc[
x["Q End date"] - pd.DateOffset(years=1) : x["Q End date"]
]
.eq(x["Rep"])
.sum(),
axis=1,
)
print(df1)
Prints:
Q End date Rep Var1 Trailing 12M transactions
0 2010-03-31 Bob 11 2
1 2010-03-31 Alice 12 0
2 2010-03-31 Jack 13 2
3 2010-06-30 Bob 14 1
4 2010-06-30 Alice 15 0
5 2010-06-30 Jack 16 2

concatenate two data frames and add a tag column to differentiate in pandas

I have two data frames as shown below
df1:
Name Age goals
Messi 31 500
Suarez 35 300
Xavi 39 100
df2:
Name Age goals
Benzema 33 400
Kroos 30 100
I would like to concatenate this two and new one with an extra column club as shown below.
df_concate:
Name Age goals club
Messi 31 500 barcelona
Suarez 35 300 barcelona
Xavi 39 100 barcelona
Benzema 33 400 realmadrid
Kroos 30 100 realmadrid
I tried below code:
pieces = {'barcelona': df1, 'realmadrid': df2}
df_concate = pd.concat(pieces)
You were close...
pieces = {'barcelona': df1, 'realmadrid': df2}
df_concate = pd.concat(pieces, names=['club'])
df_concate = df_concate.reset_index(level=0)
df_concate
Output:
club Name Age goals
0 barcelona Messi 31 500
1 barcelona Suarez 35 300
2 barcelona Xavi 39 100
0 realmadrid Benzema 33 400
1 realmadrid Kroos 30 100
assign the new column before concat:
pd.concat([df1.assign(club='barcelona'), df2.assign(club='realmadrid')])
a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')
frames = [a,b]
frames[0]['club'] = 'Barcelona'
frames[1]['club'] = 'RealMadrid'
df = pd.concat(frames)

groupby sum month wise on date time data

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-18
6 1 PLATINUM GROCERY 400 11-Feb-18
7 1 PLATINUM RESTRAUNT 600 21-Feb-18
8 1 PLATINUM GROCERY 800 17-Mar-18
9 1 PLATINUM GROCERY 200 21-Mar-18
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-18
16 2 GOLD GROCERY 400 11-Feb-18
17 2 GOLD RESTRAUNT 600 21-Mar-18
18 2 GOLD GROCERY 200 21-Mar-18
19 2 GOLD HOTEL 700 25-Mar-18
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-18
26 3 SILVER GROCERY 400 11-Feb-18
27 3 SILVER HOTEL 600 25-Mar-18
28 3 SILVER GROCERY 200 29-Mar-18
29 3 SILVER RESTRAUNT 700 30-Mar-18
I am struggling to get below dataframe.
Card_No Card_Type D Jan_Sp Jan_N Feb_Sp Feb_N Mar_Sp GR_T RES_T
1 PLATINUM 70 3300 5 1500 3 1000 2300 100
2 GOLD 72 5200 5 1900 2 1500 2300 1100
3 SILVER . 76 2900 5 900 2 1500 1800 1700
D = Duration in days from first transaction to last transaction.
Jan_Sp = Total spending on January.
Feb_Sp = Total spending on February.
Mar_Sp = Total spending on March.
Jan_N = Number of transaction in Jan.
Feb_N = Number of transaction in Feb.
GR_T = Total spending on GROCERY.
RES_T = Total spending on RESTRAUNT.
I tried following code. I am very new to pandas.
q9['Date'] = pd.to_datetime(Card_Number['Date'])
q9 = q9.sort_values(['Card_Number', 'Date'])
q9['D'] = q9.groupby('ID')['Date'].diff().dt.days
My approach is three steps
get the date range
get the Monthly spending
get the category spending
Step 1: Date
date_df = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days)
Step 2: Month
month_df = (df.groupby(['Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack('Date')
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
Step 3: Category
cat_df = df.pivot_table(index='Card_type',
columns='Category',
values='Amount',
aggfunc='sum')
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
And finally concat:
pd.concat( (date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T RE_T
Card_type
GOLD 72 1900 5200 1500 2 5 3 2300 5200 1100
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500 1000
SILVER 76 900 3200 1500 2 5 3 1800 2100 1700
If your data have several years, and you want to separate them by year, then you can add df.Date.dt.year in each groupby above:
date_df = df.groupby([df.Date.dt.year,'Card_type']).Date.apply(lambda x: (x.max()-x.min()).days)
month_df = (df.groupby([df.Date.dt.year,'Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack(level=-1)
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
cat_df = (df.groupby([df.Date.dt.year,'Card_type', 'Category'])
.Amount
.sum()
.unstack(level=-1)
)
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
pd.concat((date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T
Date Card_type
2017 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
2018 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
I would recommend keeping the dataframe this way, so you can access the annual data, e.g. result_df.loc[2017] gives you 2017 data. If you really want 2017 as year, you can do result_df.unstack(level=0).

pandas groupby sum, count by date time, where consider only year

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-19
6 1 PLATINUM GROCERY 400 11-Feb-19
7 1 PLATINUM RESTRAUNT 600 21-Feb-19
8 1 PLATINUM GROCERY 800 17-Mar-17
9 1 PLATINUM GROCERY 200 21-Mar-17
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-19
16 2 GOLD GROCERY 400 11-Feb-19
17 2 GOLD RESTRAUNT 600 21-Mar-17
18 2 GOLD GROCERY 200 21-Mar-17
19 2 GOLD HOTEL 700 25-Mar-17
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-19
26 3 SILVER GROCERY 400 11-Feb-19
27 3 SILVER HOTEL 600 25-Mar-17
28 3 SILVER GROCERY 200 29-Mar-17
29 3 SILVER RESTRAUNT 700 30-Mar-17
I am struggling to get below dataframe.
Card_No Card_Type D 2018_Sp 2018_N 2019_Sp 2019_N 2018_Sp
1 PLATINUM 70 3300 5 1500 3 1000
2 GOLD 72 5200 5 1900 2 1500
3 SILVER . 76 2900 5 900 2 1500
D = Duration in days from first transaction to last transaction.
2018_Sp = Total spending on year 2018.
2019_Sp = Total spending on 2019.
2017_Sp = Total spending on 2017.
2018_N = Number of transaction in 2018.
2019_N = Number of transaction in 2019.
Use:
#convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#sorting if necessary
df = df.sort_values(['Card_Number','Card_type', 'Date'])
#aggregate count and sum
df1 = (df.groupby(['Card_Number','Card_type', df['Date'].dt.year])['Amount']
.agg([('Sp','size'),('N','sum')])
.unstack()
.sort_index(axis=1, level=1))
#MultiIndex to columns
df1.columns = [f'{b}_{a}' for a, b in df1.columns]
#difference (different output, because different years)
s = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days).rename('D')
#join together
df1 = df1.join(s).reset_index()
print (df1)
Card_Number Card_type 2017_N 2017_Sp 2018_N 2018_Sp 2019_N 2019_Sp \
0 1 PLATINUM 1000 2 3300 5 1500 3
1 2 GOLD 1500 3 5200 5 1900 2
2 3 SILVER 1500 3 3200 5 900 2
D
0 706
1 692
2 688

Splitting Column Headers and Duplicating Row Values in Pandas Dataframe

In the example df below, I'm trying to find a way to split the column headers ('1;2','4','5;6') based on the ';' that exists and duplicate the row values in these split columns. (My actual df comes from an imported csv file so generally I have around 50-80 column headers that need spliting)
Below is my code below with output
import pandas as pd
import numpy as np
#
data = np.array([['Market','Product Code','1;2','4','5;6'],
['Total Customers',123,1,500,400],
['Total Customers',123,2,400,320],
['Major Customer 1',123,1,100,220],
['Major Customer 1',123,2,230,230],
['Major Customer 2',123,1,130,30],
['Major Customer 2',123,2,20,10],
['Total Customers',456,1,500,400],
['Total Customers',456,2,400,320],
['Major Customer 1',456,1,100,220],
['Major Customer 1',456,2,230,230],
['Major Customer 2',456,1,130,30],
['Major Customer 2',456,2,20,10]])
df =pd.DataFrame(data)
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
print (df)
0 Market Product Code 1;2 4 5;6
1 Total Customers 123 1 500 400
2 Total Customers 123 2 400 320
3 Major Customer 1 123 1 100 220
4 Major Customer 1 123 2 230 230
5 Major Customer 2 123 1 130 30
6 Major Customer 2 123 2 20 10
7 Total Customers 456 1 500 400
8 Total Customers 456 2 400 320
9 Major Customer 1 456 1 100 220
10 Major Customer 1 456 2 230 230
11 Major Customer 2 456 1 130 30
12 Major Customer 2 456 2 20 10
Below is my desired output
0 Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
Ideally I would like to perform such a task at the 'read_csv' level. Any thoughts?
Try reindex with repeat
s=df.columns.str.split(';')
df=df.reindex(columns=df.columns.repeat(s.str.len()))
df.columns=sum(s.tolist(),[])
df
Out[247]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
You can split the columns with ';' and then reconstruct a df:
pd.DataFrame({c:df[t] for t in df.columns for c in t.split(';')})
Out[157]:
1 2 4 5 6 Market Product Code
1 1 1 500 400 400 Total Customers 123
2 2 2 400 320 320 Total Customers 123
3 1 1 100 220 220 Major Customer 1 123
4 2 2 230 230 230 Major Customer 1 123
5 1 1 130 30 30 Major Customer 2 123
6 2 2 20 10 10 Major Customer 2 123
7 1 1 500 400 400 Total Customers 456
8 2 2 400 320 320 Total Customers 456
9 1 1 100 220 220 Major Customer 1 456
10 2 2 230 230 230 Major Customer 1 456
11 1 1 130 30 30 Major Customer 2 456
12 2 2 20 10 10 Major Customer 2 456
Or if you would like to reserve column order:
pd.concat([df[t].to_frame(c) for t in df.columns for c in t.split(';')],1)
Out[167]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10