I have a dataframe which looks like this:
cuid day transactions
0 a mon 1
1 a tue 2
2 b tue 2
3 b wed 1
For each 'cuid' (customer ID) I want to find the day that maximizes the number of transactions. For e.g. for the above df, the output should be a df which looks like
cuid day transactions
1 a tue 2
2 b tue 2
I've tried code which looks like:
dd = {'cuid':['a','a','b','b'],'day':['mon','tue','tue','wed'],'transactions':[1,2,2,1]}
df = pd.DataFrame(dd)
dfg = df.groupby(by=['cuid']).agg(transactions=('transactions',max)).reset_index()
But I'm not able to figure out how to join dfg and df.
Approach1
idxmax gives you the index at which a certain column value (which is transaction here) is the maximum.
First we set the index,
step1 = df.set_index(['day', 'cuid'])
Then we find the idxmax,
indices = step1.groupby('cuid')['transactions'].idxmax()
and we get the result
step1.loc[indices].reset_index()
Approach2
df.groupby('cuid')\
.apply(lambda df: df[df['transactions']==df['transactions'].max()])\
.reset_index(drop=True)
Related
I have a pandas df as follows:
YEAR MONTH USERID TRX_COUNT
2020 1 1 1
2020 2 1 2
2020 3 1 1
2020 12 1 1
2021 1 1 3
2021 2 1 3
2021 3 1 4
I want to sum the TRX_COUNT such that, each TRX_COUNT is the sum of TRX_COUNTS of the next 12 months.
So my end result would look like
YEAR MONTH USERID TRX_COUNT TRX_COUNT_SUM
2020 1 1 1 5
2020 2 1 2 7
2020 3 1 1 8
2020 12 1 1 11
2021 1 1 3 10
2021 2 1 3 7
2021 3 1 4 4
For example TRX_COUNT_SUM for 2020/1 is 1+2+1+1=5 the count of the first 12 months.
Two areas I am confused how to proceed:
I tried various variations of cumsum and grouping by USERID, YR, MONTH but am running into errors with handling the time window as there might be MONTHS where a user has no transactions and these have to be accounted for. For example in 2020/1 the user has no transactions for months 4-11, hence a full year of transaction count would be 5.
Towards the end there will be partial years, which can be summed up and left as is (like 2021/3 which is left as 4).
Any thoughts on how to handle this?
Thanks!
I was able to accomplish this using a combination of numpy arrays, pandas, and indexing
import pandas as pd
import numpy as np
#df = your dataframe
df_dates = pd.DataFrame(np.arange(np.datetime64('2020-01-01'), np.datetime64('2021-04-01'), np.timedelta64(1, 'M'), dtype='datetime64[M]').astype('datetime64[D]'), columns = ['DATE'])
df_dates['YEAR'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[0]).apply(lambda x : int(x))
df_dates['MONTH'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[1]).apply(lambda x : int(x))
df_merge = df_dates.merge(df, how = 'left')
df_merge.replace(np.nan, 0, inplace=True)
df_merge.reset_index(inplace = True)
for i in range(0, len(df_merge)):
max_index = df_merge['index'].max()
if(i + 11 < max_index):
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:i + 12]['TRX_COUNT'].sum()
elif(i != max_index):
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:max_index + 1]['TRX_COUNT'].sum()
else:
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i]['TRX_COUNT']
final_df = pd.merge(df_merge, df)
Try this:
# Set the Dataframe index to a time series constructed from YEAR and MONTH
ts = pd.to_datetime(df.assign(DAY=1)[["YEAR", "MONTH", "DAY"]])
df.set_index(ts, inplace=True)
df["TRX_COUNT_SUM"] = (
# Reindex the dataframe with every missing month in-between
# Also reverse the index so that rolling(12) means 12 months
# forward instead of backward
df.reindex(pd.date_range(ts.min(), ts.max(), freq="MS")[::-1])
# Roll and sum
.rolling(12, min_periods=1)
["TRX_COUNT"].sum()
)
I have two dataframes.
df_1:
Year
ID
Flag
2021
1
1
2020
1
0
2021
2
1
df_2:
Year
ID
2021
1
2020
2
I'm looking to add the flag from df_1 to df_2 based on id and year. I think I need to use an np.where statement but i'm having a hard time figuring it out. any ideas?
You can use pandas.merge() to combine df1 and df2 with outer ways.
df2["Flag"] = pd.NaT
df2["Flag"].update(df2.merge(df1, on=["Year", "ID"], how="outer")["Flag_y"])
print(df2)
Year ID Flag
0 2020 2 NaT
1 2021 1 1.0
I have a dataframe of id number and dates:
import pandas as pd
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
df
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 1 04/05/2003
3 2 01/05/2010
4 2 08/08/2009
5 3 12/11/2008
I am looking for a way to leave for each id the first TWO dates (i.e. the two earliest dates).
for the example above the output would be:
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 2 08/08/2009
3 2 01/05/2010
4 3 12/11/2008
Thanks!
ensure timestamp
df['start_date']=pd.to_datetime(df['start_date'])
sort values
df=df.sort_values(by=['id','start_date'])
group and select first 2 only
df_=df.groupby('id')['id','start_date'].head(2)
Just group by id and then you can call head. Be sure to sort your values first.
df = df.sort_values(['id', 'start_date'])
df.groupby('id').head(2)
full code:
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
# 1. convert 'start_time' column to datetime
df['start_date'] = pd.to_datetime(df['start_date'])
# 2. sort the dataframe ascending by 'start_time'
df.sort_values(by='start_date', ascending=True, inplace=True)
# 3. select only the first two occurances of each id
df.groupby('id').head(2)
output:
id start_date
0 1 2000-01-01
1 1 2002-01-07
5 3 2008-12-11
4 2 2009-08-08
3 2 2010-01-05
I have the following data frame :
Loc1
Loc2
Month
Trips
a
b
1
200
a
b
4
500
a
b
7
600
c
d
6
400
c
d
4
300
I need to find out for every route (Loc1 to Loc2) which month has the most trips and with the corresponding trips number .
I run some code but the output I get is as follows. How do I get the Trips column appear together.
Loc1
Loc2
Month
a
b
7
c
d
6
The code I used as below :
df = pd.read_csv('data.csv')
df = df[['Loc1','Loc2','Month','Trips']]
df = df.pivot_table(index = ['Loc1', 'Loc2'],
columns = 'Month',
values = 'Trips',)
df = df.idxmax(axis = 1)
df = df.reset_index()
print(f"Each route's busiest month : \n {df.to_string()}")
Try to sort by Trips in descending order and get the first row per group
df.sort_values(by='Trips', ascending=False).groupby(['Loc1', 'Loc2'], as_index=False).first()
Or:
df.sort_values(by='Trips').groupby(['Loc1', 'Loc2'], as_index=False).last()
NB. I couldn't run the code to test, but you get the general idea.
I have a pandas dataframe like this:
date id flow type
2020-04-26 1 3 A
2020-04-27 2 4 A
2020-04-28 1 2 A
2020-04-26 1 -3 B
2020-04-27 1 4 B
2020-04-28 2 3 B
2020-04-26 3 0 C
2020-04-27 2 5 C
i also have a dictionary like this of 'trailing_date' keys.
{'T-1': Timestamp('2020-04-27')
'T-2' : Timestamp('2020-04-26')}
I would like to sum the flows for each id and group by they keys in my dictionary where
the sum of flows is inclusive of this trailing dates minus the flows of most recent date. In other words. i would like to have this:
type T-1 T-2
A 4 7
B 4 1
Why did i get 4 for T-1 at A? its because if today is 28th, then T-1 is 27th, hence answer is 4. Likewise at T-2, its 3+4 = 7 etc.
I tried:
df2 = df.groupby(["type","date"])['flow'].sum().unstack("type")
Im somewhat stuck what to do after this. Thanks
Tough problem. There might be a more elegant way to do this, but here is what I came up with.
import pandas as pd
dates1 = pd.Series(range(3), index=pd.date_range('2020-04-26', freq='D', periods=3)).index
dates2 = dates1.copy()
dates3 = dates1.copy()[0:-1]
dates = dates1.append([dates2, dates3])
types = ['A']*3 + ['B']*3 + ['C']*2
df = pd.DataFrame({'date': dates, 'id':[1,2,1,1,1,2,3,2],
'flow': [3,4,2,-3,4,3,0,5], 'type': types})
dates_dict = {'T-1': pd.Timestamp('2020-04-27'), 'T-2': pd.Timestamp('2020-04-26')}
grouped_df = df.groupby(["type","date"])['flow'].sum()
new_dict = {}
for key in dates_dict:
sums_list = []
# loops through the unique levels of the grouped_df: 'A', 'B', 'C'
types = grouped_df.index.get_level_values(0).unique()
new_dict.update({'type': types})
for letter in types:
# sums up the flows by dates
# coming before the timestamp label corresponding to the key
# but leaves out the most recent date
sums_list.append(grouped_df[letter][grouped_df[letter].index >= dates_dict[key]].iloc[:-1].sum())
new_dict.update({key: sums_list})
final_df = pd.DataFrame(new_dict)
Output:
>>> final_df
type T-1 T-2
0 A 4 7
1 B 4 1
2 C 0 0