np.where multi-conditional based on another column - pandas

I have two dataframes.
df_1:
Year
ID
Flag
2021
1
1
2020
1
0
2021
2
1
df_2:
Year
ID
2021
1
2020
2
I'm looking to add the flag from df_1 to df_2 based on id and year. I think I need to use an np.where statement but i'm having a hard time figuring it out. any ideas?

You can use pandas.merge() to combine df1 and df2 with outer ways.
df2["Flag"] = pd.NaT
df2["Flag"].update(df2.merge(df1, on=["Year", "ID"], how="outer")["Flag_y"])
print(df2)
Year ID Flag
0 2020 2 NaT
1 2021 1 1.0

Related

Pandas rolling window cumsum

I have a pandas df as follows:
YEAR MONTH USERID TRX_COUNT
2020 1 1 1
2020 2 1 2
2020 3 1 1
2020 12 1 1
2021 1 1 3
2021 2 1 3
2021 3 1 4
I want to sum the TRX_COUNT such that, each TRX_COUNT is the sum of TRX_COUNTS of the next 12 months.
So my end result would look like
YEAR MONTH USERID TRX_COUNT TRX_COUNT_SUM
2020 1 1 1 5
2020 2 1 2 7
2020 3 1 1 8
2020 12 1 1 11
2021 1 1 3 10
2021 2 1 3 7
2021 3 1 4 4
For example TRX_COUNT_SUM for 2020/1 is 1+2+1+1=5 the count of the first 12 months.
Two areas I am confused how to proceed:
I tried various variations of cumsum and grouping by USERID, YR, MONTH but am running into errors with handling the time window as there might be MONTHS where a user has no transactions and these have to be accounted for. For example in 2020/1 the user has no transactions for months 4-11, hence a full year of transaction count would be 5.
Towards the end there will be partial years, which can be summed up and left as is (like 2021/3 which is left as 4).
Any thoughts on how to handle this?
Thanks!
I was able to accomplish this using a combination of numpy arrays, pandas, and indexing
import pandas as pd
import numpy as np
#df = your dataframe
df_dates = pd.DataFrame(np.arange(np.datetime64('2020-01-01'), np.datetime64('2021-04-01'), np.timedelta64(1, 'M'), dtype='datetime64[M]').astype('datetime64[D]'), columns = ['DATE'])
df_dates['YEAR'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[0]).apply(lambda x : int(x))
df_dates['MONTH'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[1]).apply(lambda x : int(x))
df_merge = df_dates.merge(df, how = 'left')
df_merge.replace(np.nan, 0, inplace=True)
df_merge.reset_index(inplace = True)
for i in range(0, len(df_merge)):
max_index = df_merge['index'].max()
if(i + 11 < max_index):
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:i + 12]['TRX_COUNT'].sum()
elif(i != max_index):
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:max_index + 1]['TRX_COUNT'].sum()
else:
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i]['TRX_COUNT']
final_df = pd.merge(df_merge, df)
Try this:
# Set the Dataframe index to a time series constructed from YEAR and MONTH
ts = pd.to_datetime(df.assign(DAY=1)[["YEAR", "MONTH", "DAY"]])
df.set_index(ts, inplace=True)
df["TRX_COUNT_SUM"] = (
# Reindex the dataframe with every missing month in-between
# Also reverse the index so that rolling(12) means 12 months
# forward instead of backward
df.reindex(pd.date_range(ts.min(), ts.max(), freq="MS")[::-1])
# Roll and sum
.rolling(12, min_periods=1)
["TRX_COUNT"].sum()
)

Rolling Rows in pandas.DataFrame

I have a dataframe that looks like this:
year
month
valueCounts
2019
1
73.411285
2019
2
53.589128
2019
3
71.103842
2019
4
79.528084
I want valueCounts column's values to be rolled like:
year
month
valueCounts
2019
1
53.589128
2019
2
71.103842
2019
3
79.528084
2019
4
NaN
I can do this by dropping first index of dataframe and assigning last index to NaN but it doesn't look efficient. Is there any simpler method to do this?
Thanks.
Assuming your dataframe are already sorted.
Use shift:
df['valueCounts'] = df['valueCounts'].shift(-1)
print(df)
# Output
year month valueCounts
0 2019 1 53.589128
1 2019 2 71.103842
2 2019 3 79.528084
3 2019 4 NaN

finding combination of columns which maximizes a quantity

I have a dataframe which looks like this:
cuid day transactions
0 a mon 1
1 a tue 2
2 b tue 2
3 b wed 1
For each 'cuid' (customer ID) I want to find the day that maximizes the number of transactions. For e.g. for the above df, the output should be a df which looks like
cuid day transactions
1 a tue 2
2 b tue 2
I've tried code which looks like:
dd = {'cuid':['a','a','b','b'],'day':['mon','tue','tue','wed'],'transactions':[1,2,2,1]}
df = pd.DataFrame(dd)
dfg = df.groupby(by=['cuid']).agg(transactions=('transactions',max)).reset_index()
But I'm not able to figure out how to join dfg and df.
Approach1
idxmax gives you the index at which a certain column value (which is transaction here) is the maximum.
First we set the index,
step1 = df.set_index(['day', 'cuid'])
Then we find the idxmax,
indices = step1.groupby('cuid')['transactions'].idxmax()
and we get the result
step1.loc[indices].reset_index()
Approach2
df.groupby('cuid')\
.apply(lambda df: df[df['transactions']==df['transactions'].max()])\
.reset_index(drop=True)

Pandas Shift Column & Remove Row

I have a dataframe 'df1' that has 2 columns and i need to shift the 2nd column down a row and then remove the entire top row of the df1.
My data looks like this:
year ER12
0 2017 -2.05
1 2018 1.05
2 2019 -0.04
3 2020 -0.60
4 2021 -99.99
And, I need it to look like this:
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
We can try this:
df = df.assign(ER12=df.ER12.shift()).dropna().reset_index(drop=True)
print(df)
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
This works on your example:
import pandas as pd
df = pd.DataFrame({'year':[2017,2018,2019,2020,2021], 'ER12':[-2.05,1.05,-0.04,-0.6,-99.99]})
df['year'] = df['year'].shift(-1)
df = df.dropna()

Name-Specific Variability Calculations Pandas

I'm trying to calculate variability statistics from two df's - one with current data and one df with average data for the month. Suppose I have a df "DF1" that looks like this:
Name year month output
0 A 1991 1 10864.8
1 A 1997 2 11168.5
2 B 1994 1 6769.2
3 B 1998 2 3137.91
4 B 2002 3 4965.21
and a df called "DF2" that contains monthly averages from multiple years such as:
Name month output_average
0 A 1 11785.199
1 A 2 8973.991
2 B 1 8874.113
3 B 2 6132.176667
4 B 3 3018.768
and, i need a new DF calling it "DF3" that needs to look like this with the calculations specific to the change in the "name" column and for each "month" change:
Name year month Variability
0 A 1991 1 -0.078097875
1 A 1997 2 0.24454103
2 B 1994 1 -0.237197002
3 B 1998 2 -0.488287737
4 B 2002 3 0.644782
I have tried options like this below but with errors about duplicating the axis or key errors -
DF3['variability'] =
((DF1.output/DF2.set_index('month'['output_average'].reindex(DF1['name']).values)-1)
Thank you for your help in leaning Python row calculations coming from matlab!
For two columns, you can better use merge instead of set_index:
df3 = df1.merge(df2, on=['Name','month'], how='left')
df3['variability'] = df3['output']/df3['output_average'] - 1
Output:
Name year month output output_average variability
0 A 1991 1 10864.80 11785.199000 -0.078098
1 A 1997 2 11168.50 8973.991000 0.244541
2 B 1994 1 6769.20 8874.113000 -0.237197
3 B 1998 2 3137.91 6132.176667 -0.488288
4 B 2002 3 4965.21 3018.768000 0.644780