Conditional mean while using iloc pandas

Conditional mean while using iloc pandas - pandas

Assume I have a dataframe with columns stated below (consist more column in actual data).
Customer Group1 jan_revenue feb_revenue mar_revenue
Sam Bank A 40 50 0
Wilson Bank A 60 70 30
Jay Bank B 10 40 40
Jim Bank A 0 40 70
Yan Bank C 0 40 90
Tim Bank C 10 0 50
I want to calculate the mean for each customer but only those are non-zero.
For example, customer Sam has mean (40+50)/2 = 45 and Wilson (60+70+30)/3 = 53.3333
Since I have a large number of columns, so i choose to use iloc but my approach included all the 0.
df['avg_revenue21'] = df.iloc[:,27:39].mean(axis=1)
May I know is there a way for conditional mean while using iloc?
Thank you

You can use select_dtypes to get numeric columns, replace the zeros with NA, then get the mean as usual:
df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)
output:
Sam 45.000000
Wilson 53.333333
Jay 30.000000
Jim 55.000000
Yan 65.000000
Tim 30.000000
dtype: float64
As new column:
df['avg_revenue21'] = df.select_dtypes('number').replace(0, pd.NA).mean(axis=1)
Customer Group1 jan_revenue feb_revenue mar_revenue avg_revenue21
Sam Bank A 40 50 0 45.000000
Wilson Bank A 60 70 30 53.333333
Jay Bank B 10 40 40 30.000000
Jim Bank A 0 40 70 55.000000
Yan Bank C 0 40 90 65.000000
Tim Bank C 10 0 50 30.000000
variants:
If the input are strings:
df['avg_revenue21'] = df.apply(pd.to_numeric, errors='coerce').replace(0, pd.NA).mean(axis=1)
If you only want to consider a subset:
df['avg_revenue21'] = df.filter(regex='(feb|mar)_').replace(0, pd.NA).mean(axis=1)
or:
df['avg_revenue21'] = df[['feb_revenue', 'mar_revenue']].replace(0, pd.NA).mean(axis=1)

Use DataFrame.replace with mean:
df['new'] = df.replace(0, np.nan).mean(axis=1)
print (df)
Customer Group1 jan_revenue feb_revenue mar_revenue new
0 Sam Bank A 40 50 0 45.000000
1 Wilson Bank A 60 70 30 53.333333
2 Jay Bank B 10 40 40 30.000000
3 Jim Bank A 0 40 70 55.000000
4 Yan Bank C 0 40 90 65.000000
5 Tim Bank C 10 0 50 30.000000
Or:
df['new'] = df.replace(0, np.nan).mean(numeric_only=True, axis=1)
print (df)
Customer Group1 jan_revenue feb_revenue mar_revenue new
0 Sam Bank A 40 50 0 45.000000
1 Wilson Bank A 60 70 30 53.333333
2 Jay Bank B 10 40 40 30.000000
3 Jim Bank A 0 40 70 55.000000
4 Yan Bank C 0 40 90 65.000000
5 Tim Bank C 10 0 50 30.000000
EDIT: If possible columns are not numeric, use to_numeric with errors='coerce' for missing values if no numbers:
df['new'] = df.apply(pd.to_numeric, errors='coerce').replace(0, np.nan).mean(axis=1)

Related

aggregate data between two dates with two dataframes

Given I have the following DF,
Assume this table has all the sales rep and all the Q end dates in the last 20 years.
Q End date
Rep
Var1
03/31/2010
Bob
11
03/31/2010
Alice
12
03/31/2010
Jack
13
06/30/2010
Bob
14
06/30/2010
Alice
15
06/30/2010
Jack
16
I also have a table of transactions events
Sell Date
Rep
04/01/2009
Bob
03/01/2010
Bob
02/01/2010
Jack
02/01/2010
Jack
I am trying to modify the first DF so to have a column that aggregates the number of transactions that happened 12 month prior to the q end date per Qend per Rep
The result should look like this
Q End end
Rep
Var1
Trailing 12M transactions
03/31/2010
Bob
11
2
03/31/2010
Alice
12
0
03/31/2010
Jack
13
2
06/30/2010
Bob
14
1
06/30/2010
Alice
15
0
06/30/2010
Jack
16
2
My table has 2000-3000 sales rep per Q for ~20 years and number of transactions per trailing 12m can range between 0-7k ish.
Any help here would be appreciated. Thanks!

Try:
df1["Q End date"] = pd.to_datetime(df1["Q End date"])
df2["Sell Date"] = pd.to_datetime(df2["Sell Date"])
df2 = df2.sort_values(by="Sell Date").set_index("Sell Date")
df1["Trailing 12M transactions"] = df1.apply(
lambda x: df2.loc[
x["Q End date"] - pd.DateOffset(years=1) : x["Q End date"]
]
.eq(x["Rep"])
.sum(),
axis=1,
)
print(df1)
Prints:
Q End date Rep Var1 Trailing 12M transactions
0 2010-03-31 Bob 11 2
1 2010-03-31 Alice 12 0
2 2010-03-31 Jack 13 2
3 2010-06-30 Bob 14 1
4 2010-06-30 Alice 15 0
5 2010-06-30 Jack 16 2

concatenate two data frames and add a tag column to differentiate in pandas

I have two data frames as shown below
df1:
Name Age goals
Messi 31 500
Suarez 35 300
Xavi 39 100
df2:
Name Age goals
Benzema 33 400
Kroos 30 100
I would like to concatenate this two and new one with an extra column club as shown below.
df_concate:
Name Age goals club
Messi 31 500 barcelona
Suarez 35 300 barcelona
Xavi 39 100 barcelona
Benzema 33 400 realmadrid
Kroos 30 100 realmadrid
I tried below code:
pieces = {'barcelona': df1, 'realmadrid': df2}
df_concate = pd.concat(pieces)

You were close...
pieces = {'barcelona': df1, 'realmadrid': df2}
df_concate = pd.concat(pieces, names=['club'])
df_concate = df_concate.reset_index(level=0)
df_concate
Output:
club Name Age goals
0 barcelona Messi 31 500
1 barcelona Suarez 35 300
2 barcelona Xavi 39 100
0 realmadrid Benzema 33 400
1 realmadrid Kroos 30 100

assign the new column before concat:
pd.concat([df1.assign(club='barcelona'), df2.assign(club='realmadrid')])

a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')
frames = [a,b]
frames[0]['club'] = 'Barcelona'
frames[1]['club'] = 'RealMadrid'
df = pd.concat(frames)

groupby sum month wise on date time data

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-18
6 1 PLATINUM GROCERY 400 11-Feb-18
7 1 PLATINUM RESTRAUNT 600 21-Feb-18
8 1 PLATINUM GROCERY 800 17-Mar-18
9 1 PLATINUM GROCERY 200 21-Mar-18
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-18
16 2 GOLD GROCERY 400 11-Feb-18
17 2 GOLD RESTRAUNT 600 21-Mar-18
18 2 GOLD GROCERY 200 21-Mar-18
19 2 GOLD HOTEL 700 25-Mar-18
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-18
26 3 SILVER GROCERY 400 11-Feb-18
27 3 SILVER HOTEL 600 25-Mar-18
28 3 SILVER GROCERY 200 29-Mar-18
29 3 SILVER RESTRAUNT 700 30-Mar-18
I am struggling to get below dataframe.
Card_No Card_Type D Jan_Sp Jan_N Feb_Sp Feb_N Mar_Sp GR_T RES_T
1 PLATINUM 70 3300 5 1500 3 1000 2300 100
2 GOLD 72 5200 5 1900 2 1500 2300 1100
3 SILVER . 76 2900 5 900 2 1500 1800 1700
D = Duration in days from first transaction to last transaction.
Jan_Sp = Total spending on January.
Feb_Sp = Total spending on February.
Mar_Sp = Total spending on March.
Jan_N = Number of transaction in Jan.
Feb_N = Number of transaction in Feb.
GR_T = Total spending on GROCERY.
RES_T = Total spending on RESTRAUNT.
I tried following code. I am very new to pandas.
q9['Date'] = pd.to_datetime(Card_Number['Date'])
q9 = q9.sort_values(['Card_Number', 'Date'])
q9['D'] = q9.groupby('ID')['Date'].diff().dt.days

My approach is three steps
get the date range
get the Monthly spending
get the category spending
Step 1: Date
date_df = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days)
Step 2: Month
month_df = (df.groupby(['Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack('Date')
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
Step 3: Category
cat_df = df.pivot_table(index='Card_type',
columns='Category',
values='Amount',
aggfunc='sum')
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
And finally concat:
pd.concat( (date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T RE_T
Card_type
GOLD 72 1900 5200 1500 2 5 3 2300 5200 1100
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500 1000
SILVER 76 900 3200 1500 2 5 3 1800 2100 1700
If your data have several years, and you want to separate them by year, then you can add df.Date.dt.year in each groupby above:
date_df = df.groupby([df.Date.dt.year,'Card_type']).Date.apply(lambda x: (x.max()-x.min()).days)
month_df = (df.groupby([df.Date.dt.year,'Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack(level=-1)
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
cat_df = (df.groupby([df.Date.dt.year,'Card_type', 'Category'])
.Amount
.sum()
.unstack(level=-1)
)
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
pd.concat((date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T
Date Card_type
2017 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
2018 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
I would recommend keeping the dataframe this way, so you can access the annual data, e.g. result_df.loc[2017] gives you 2017 data. If you really want 2017 as year, you can do result_df.unstack(level=0).

pandas groupby sum, count by date time, where consider only year

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-19
6 1 PLATINUM GROCERY 400 11-Feb-19
7 1 PLATINUM RESTRAUNT 600 21-Feb-19
8 1 PLATINUM GROCERY 800 17-Mar-17
9 1 PLATINUM GROCERY 200 21-Mar-17
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-19
16 2 GOLD GROCERY 400 11-Feb-19
17 2 GOLD RESTRAUNT 600 21-Mar-17
18 2 GOLD GROCERY 200 21-Mar-17
19 2 GOLD HOTEL 700 25-Mar-17
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-19
26 3 SILVER GROCERY 400 11-Feb-19
27 3 SILVER HOTEL 600 25-Mar-17
28 3 SILVER GROCERY 200 29-Mar-17
29 3 SILVER RESTRAUNT 700 30-Mar-17
I am struggling to get below dataframe.
Card_No Card_Type D 2018_Sp 2018_N 2019_Sp 2019_N 2018_Sp
1 PLATINUM 70 3300 5 1500 3 1000
2 GOLD 72 5200 5 1900 2 1500
3 SILVER . 76 2900 5 900 2 1500
D = Duration in days from first transaction to last transaction.
2018_Sp = Total spending on year 2018.
2019_Sp = Total spending on 2019.
2017_Sp = Total spending on 2017.
2018_N = Number of transaction in 2018.
2019_N = Number of transaction in 2019.

Use:
#convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#sorting if necessary
df = df.sort_values(['Card_Number','Card_type', 'Date'])
#aggregate count and sum
df1 = (df.groupby(['Card_Number','Card_type', df['Date'].dt.year])['Amount']
.agg([('Sp','size'),('N','sum')])
.unstack()
.sort_index(axis=1, level=1))
#MultiIndex to columns
df1.columns = [f'{b}_{a}' for a, b in df1.columns]
#difference (different output, because different years)
s = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days).rename('D')
#join together
df1 = df1.join(s).reset_index()
print (df1)
Card_Number Card_type 2017_N 2017_Sp 2018_N 2018_Sp 2019_N 2019_Sp \
0 1 PLATINUM 1000 2 3300 5 1500 3
1 2 GOLD 1500 3 5200 5 1900 2
2 3 SILVER 1500 3 3200 5 900 2
D
0 706
1 692
2 688

Splitting Column Headers and Duplicating Row Values in Pandas Dataframe

In the example df below, I'm trying to find a way to split the column headers ('1;2','4','5;6') based on the ';' that exists and duplicate the row values in these split columns. (My actual df comes from an imported csv file so generally I have around 50-80 column headers that need spliting)
Below is my code below with output
import pandas as pd
import numpy as np
#
data = np.array([['Market','Product Code','1;2','4','5;6'],
['Total Customers',123,1,500,400],
['Total Customers',123,2,400,320],
['Major Customer 1',123,1,100,220],
['Major Customer 1',123,2,230,230],
['Major Customer 2',123,1,130,30],
['Major Customer 2',123,2,20,10],
['Total Customers',456,1,500,400],
['Total Customers',456,2,400,320],
['Major Customer 1',456,1,100,220],
['Major Customer 1',456,2,230,230],
['Major Customer 2',456,1,130,30],
['Major Customer 2',456,2,20,10]])
df =pd.DataFrame(data)
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
print (df)
0 Market Product Code 1;2 4 5;6
1 Total Customers 123 1 500 400
2 Total Customers 123 2 400 320
3 Major Customer 1 123 1 100 220
4 Major Customer 1 123 2 230 230
5 Major Customer 2 123 1 130 30
6 Major Customer 2 123 2 20 10
7 Total Customers 456 1 500 400
8 Total Customers 456 2 400 320
9 Major Customer 1 456 1 100 220
10 Major Customer 1 456 2 230 230
11 Major Customer 2 456 1 130 30
12 Major Customer 2 456 2 20 10
Below is my desired output
0 Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
Ideally I would like to perform such a task at the 'read_csv' level. Any thoughts?

Try reindex with repeat
s=df.columns.str.split(';')
df=df.reindex(columns=df.columns.repeat(s.str.len()))
df.columns=sum(s.tolist(),[])
df
Out[247]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10

You can split the columns with ';' and then reconstruct a df:
pd.DataFrame({c:df[t] for t in df.columns for c in t.split(';')})
Out[157]:
1 2 4 5 6 Market Product Code
1 1 1 500 400 400 Total Customers 123
2 2 2 400 320 320 Total Customers 123
3 1 1 100 220 220 Major Customer 1 123
4 2 2 230 230 230 Major Customer 1 123
5 1 1 130 30 30 Major Customer 2 123
6 2 2 20 10 10 Major Customer 2 123
7 1 1 500 400 400 Total Customers 456
8 2 2 400 320 320 Total Customers 456
9 1 1 100 220 220 Major Customer 1 456
10 2 2 230 230 230 Major Customer 1 456
11 1 1 130 30 30 Major Customer 2 456
12 2 2 20 10 10 Major Customer 2 456
Or if you would like to reserve column order:
pd.concat([df[t].to_frame(c) for t in df.columns for c in t.split(';')],1)
Out[167]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Conditional mean while using iloc pandas - pandas

Related

aggregate data between two dates with two dataframes

concatenate two data frames and add a tag column to differentiate in pandas

groupby sum month wise on date time data

pandas groupby sum, count by date time, where consider only year

Splitting Column Headers and Duplicating Row Values in Pandas Dataframe

Categories

Resources