Pandas rolling window cumsum - pandas

I have a pandas df as follows:
YEAR MONTH USERID TRX_COUNT
2020 1 1 1
2020 2 1 2
2020 3 1 1
2020 12 1 1
2021 1 1 3
2021 2 1 3
2021 3 1 4
I want to sum the TRX_COUNT such that, each TRX_COUNT is the sum of TRX_COUNTS of the next 12 months.
So my end result would look like
YEAR MONTH USERID TRX_COUNT TRX_COUNT_SUM
2020 1 1 1 5
2020 2 1 2 7
2020 3 1 1 8
2020 12 1 1 11
2021 1 1 3 10
2021 2 1 3 7
2021 3 1 4 4
For example TRX_COUNT_SUM for 2020/1 is 1+2+1+1=5 the count of the first 12 months.
Two areas I am confused how to proceed:
I tried various variations of cumsum and grouping by USERID, YR, MONTH but am running into errors with handling the time window as there might be MONTHS where a user has no transactions and these have to be accounted for. For example in 2020/1 the user has no transactions for months 4-11, hence a full year of transaction count would be 5.
Towards the end there will be partial years, which can be summed up and left as is (like 2021/3 which is left as 4).
Any thoughts on how to handle this?
Thanks!

I was able to accomplish this using a combination of numpy arrays, pandas, and indexing
import pandas as pd
import numpy as np
#df = your dataframe
df_dates = pd.DataFrame(np.arange(np.datetime64('2020-01-01'), np.datetime64('2021-04-01'), np.timedelta64(1, 'M'), dtype='datetime64[M]').astype('datetime64[D]'), columns = ['DATE'])
df_dates['YEAR'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[0]).apply(lambda x : int(x))
df_dates['MONTH'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[1]).apply(lambda x : int(x))
df_merge = df_dates.merge(df, how = 'left')
df_merge.replace(np.nan, 0, inplace=True)
df_merge.reset_index(inplace = True)
for i in range(0, len(df_merge)):
max_index = df_merge['index'].max()
if(i + 11 < max_index):
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:i + 12]['TRX_COUNT'].sum()
elif(i != max_index):
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:max_index + 1]['TRX_COUNT'].sum()
else:
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i]['TRX_COUNT']
final_df = pd.merge(df_merge, df)

Try this:
# Set the Dataframe index to a time series constructed from YEAR and MONTH
ts = pd.to_datetime(df.assign(DAY=1)[["YEAR", "MONTH", "DAY"]])
df.set_index(ts, inplace=True)
df["TRX_COUNT_SUM"] = (
# Reindex the dataframe with every missing month in-between
# Also reverse the index so that rolling(12) means 12 months
# forward instead of backward
df.reindex(pd.date_range(ts.min(), ts.max(), freq="MS")[::-1])
# Roll and sum
.rolling(12, min_periods=1)
["TRX_COUNT"].sum()
)

Related

extract week columns from date in pandas

I have a dataframe that has columns like these:
Date earnings workingday length_week first_wday_week last_wdayweek
01.01.2000 10000 1 1
02.01.2000 0 0 1
03.01.2000 0 0 2
04.01.2000 0 0 2
05.01.2000 0 0 2
06.01.2000 23000 1 2
07.01.2000 1000 1 2
08.01.2000 0 0 2
09.01.2000 0 0 2
..
..
..
30.01.2000 0 0 0
31.01.2000 0 1 3
01.02.2000 0 1 3
02.02.2000 2500 1 3
working day indicates there earnings present on that particular day. I am trying to generate last three column from the date.
length_week : gives number of working days in that week
first_working_day_of_week : 1 if its first working day of a week
last_working_day_of_week : 1 if its last working day of a week
Can anyone help me with this?
I first changed the format of your date column as pd.to_datetime couldn't infer the right date format:
df.Date.str.replace('.', '-', regex=True)
df.Date = pd.to_datetime(df.Date, format='%d-%m-%Y')
Then use isocalendar so that we can work with weeks and days more easily:
df[['year', 'week', 'weekday']] = df.Date.dt.isocalendar()
Now length_week is just the sum of workingdays for each seperate weeks:
df['length_week'] = df.groupby(['year', 'week']).workingday.transform('sum')
and we can get frst_worday_week with idxmax:
min_indexes = df.groupby(['year', 'week'], as_index=False).workingday.transform('idxmax')
df['frst_worday_week'] = np.where(df.index == min_indexes.workingday, 1, 0)
Lastly, last_workdayweek is similar but a bit tricky. We need the last occurence of idxmax, so we will reverse each week inside groupby:
max_indexes = df.groupby(['year', 'week'], as_index=False).\
workingday.transform(lambda x: x[::-1].idxmax())
df['last_workdayweek'] = np.where(df.index == max_indexes.workingday, 1, 0)

np.where multi-conditional based on another column

I have two dataframes.
df_1:
Year
ID
Flag
2021
1
1
2020
1
0
2021
2
1
df_2:
Year
ID
2021
1
2020
2
I'm looking to add the flag from df_1 to df_2 based on id and year. I think I need to use an np.where statement but i'm having a hard time figuring it out. any ideas?
You can use pandas.merge() to combine df1 and df2 with outer ways.
df2["Flag"] = pd.NaT
df2["Flag"].update(df2.merge(df1, on=["Year", "ID"], how="outer")["Flag_y"])
print(df2)
Year ID Flag
0 2020 2 NaT
1 2021 1 1.0

how to create monthly and season 24 hours average table using pandas

I have a dataframe with 2 columns: Date and LMP and there are totals of 8760 rows. This is the dummy dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2023-01-01 00:00', '2023-12-31 23:00', freq='1H'), 'LMP': np.random.randint(10, 20, 8760)})
I extract month from the date and then created the season column for the specific dates. Like this
df['month'] = pd.DatetimeIndex(df['Date']).month
season = []
for i in df['month']:
if i <= 2 or i == 12:
season.append('Winter')
elif 2 < i <= 5:
season.append('Spring')
elif 5 < i <= 8:
season.append('Summer')
else:
season.append('Autumn')
df['Season'] = season
df2 = df.groupby(['month']).mean()
df3 = df.groupby(['Season']).mean()
print(df2['LMP'])
print(df3['LMP'])
Output:
**month**
1 20.655113
2 20.885532
3 19.416946
4 22.025248
5 26.040606
6 19.323863
7 51.117965
8 51.434093
9 21.404680
10 14.701989
11 20.009590
12 38.706160
**Season**
Autumn 18.661426
Spring 22.499365
Summer 40.856845
Winter 26.944382
But I want the output to be in 24 hour average for both monthly and seasonal.
Desired Output:
for seasonal 24 hours average
For monthyl 24 hours average
Note: in the monthyl 24 hour average columns are months(1,2,3,4,5,6,7,8,9,10,11,12) and rows are hours(starting from 0).
Can anyone help?
try:
df['hour']=pd.DatetimeIndex(df['Date']).hour
dft = df[['Season', 'hour', 'LMP']]
dftg = dft.groupby(['hour', 'Season'])['LMP'].mean()
dftg.reset_index().pivot(index='hour', columns='Season')
result:

group values in pandas and sum after all dates

I have a pandas dataframe like this:
date id flow type
2020-04-26 1 3 A
2020-04-27 2 4 A
2020-04-28 1 2 A
2020-04-26 1 -3 B
2020-04-27 1 4 B
2020-04-28 2 3 B
2020-04-26 3 0 C
2020-04-27 2 5 C
i also have a dictionary like this of 'trailing_date' keys.
{'T-1': Timestamp('2020-04-27')
'T-2' : Timestamp('2020-04-26')}
I would like to sum the flows for each id and group by they keys in my dictionary where
the sum of flows is inclusive of this trailing dates minus the flows of most recent date. In other words. i would like to have this:
type T-1 T-2
A 4 7
B 4 1
Why did i get 4 for T-1 at A? its because if today is 28th, then T-1 is 27th, hence answer is 4. Likewise at T-2, its 3+4 = 7 etc.
I tried:
df2 = df.groupby(["type","date"])['flow'].sum().unstack("type")
Im somewhat stuck what to do after this. Thanks
Tough problem. There might be a more elegant way to do this, but here is what I came up with.
import pandas as pd
dates1 = pd.Series(range(3), index=pd.date_range('2020-04-26', freq='D', periods=3)).index
dates2 = dates1.copy()
dates3 = dates1.copy()[0:-1]
dates = dates1.append([dates2, dates3])
types = ['A']*3 + ['B']*3 + ['C']*2
df = pd.DataFrame({'date': dates, 'id':[1,2,1,1,1,2,3,2],
'flow': [3,4,2,-3,4,3,0,5], 'type': types})
dates_dict = {'T-1': pd.Timestamp('2020-04-27'), 'T-2': pd.Timestamp('2020-04-26')}
grouped_df = df.groupby(["type","date"])['flow'].sum()
new_dict = {}
for key in dates_dict:
sums_list = []
# loops through the unique levels of the grouped_df: 'A', 'B', 'C'
types = grouped_df.index.get_level_values(0).unique()
new_dict.update({'type': types})
for letter in types:
# sums up the flows by dates
# coming before the timestamp label corresponding to the key
# but leaves out the most recent date
sums_list.append(grouped_df[letter][grouped_df[letter].index >= dates_dict[key]].iloc[:-1].sum())
new_dict.update({key: sums_list})
final_df = pd.DataFrame(new_dict)
Output:
>>> final_df
type T-1 T-2
0 A 4 7
1 B 4 1
2 C 0 0

week number from given date in pandas

I have a data frame with two columns Date and value.
I want to add new column named week_number that basically is how many weeks back from the given date
import pandas as pd
df = pd.DataFrame(columns=['Date','value'])
df['Date'] = [ '04-02-2019','03-02-2019','28-01-2019','20-01-2019']
df['value'] = [10,20,30,40]
df
Date value
0 04-02-2019 10
1 03-02-2019 20
2 28-01-2019 30
3 20-01-2019 40
suppose given date is 05-02-2019.
Then I need to add a column week_number in a way such that how many weeks back the Date column date is from given date.
The output should be
Date value week_number
0 04-02-2019 10 1
1 03-02-2019 20 1
2 28-01-2019 30 2
3 20-01-2019 40 3
how can I do this in pandas
First convert column to datetimes by to_datetime with dayfirst=True, then subtract from right side by rsub, convert timedeltas to days, get modulo by 7 and add 1:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['week_number'] = df['Date'].rsub(pd.Timestamp('2019-02-05')).dt.days // 7 + 1
#alternative
#df['week_number'] = (pd.Timestamp('2019-02-05') - df['Date']).dt.days // 7 + 1
print (df)
Date value week_number
0 2019-02-04 10 1
1 2019-02-03 20 1
2 2019-01-28 30 2
3 2019-01-20 40 3