Pandas Group/Merge Dataframe by Non-Periodic Series - pandas

How do I group one DataFrame by another possibly-non-periodic Series? Mock-up below:
This is the DataFrame to be split:
i = pd.date_range(end="today", periods=20, freq="d").normalize()
v = np.random.randint(0,100,size=len(i))
d = pd.DataFrame({"value": v}, index=i)
>>> d
value
2021-02-06 48
2021-02-07 1
2021-02-08 86
2021-02-09 82
2021-02-10 40
2021-02-11 22
2021-02-12 63
2021-02-13 37
2021-02-14 41
2021-02-15 57
2021-02-16 30
2021-02-17 69
2021-02-18 63
2021-02-19 27
2021-02-20 23
2021-02-21 46
2021-02-22 66
2021-02-23 10
2021-02-24 91
2021-02-25 43
This is the splitting criteria, grouping by the Series dates. A group consists of any ordered dataframe value v such that {v} intersects [s,s+1) - but as with resampling it would be nice to control the inclusion parameters.
s = pd.date_range(start="2019-10-14", freq="2W", periods=52).to_series()
s = s.drop(np.random.choice(s.index, 10, replace=False))
s = s.reset_index(drop=True)
>>> s[25:29]
25 2021-01-24
26 2021-02-07
27 2021-02-21
28 2021-03-07
dtype: datetime64[ns]
And this is the example output... or something like it. Index is taken from the series rather than the dataframe.
>>> ???.sum()
value
...
2021-01-24 47
2021-02-07 768
2021-02-21 334
...
Internally the groups would have this structure:
...
2021-01-10
sum: 0
2021-01-24
2021-02-06 47
sum: 47
2021-02-07
2021-02-07 52
2021-02-08 56
2021-02-09 21
2021-02-10 39
2021-02-11 86
2021-02-12 30
2021-02-13 20
2021-02-14 76
2021-02-15 91
2021-02-16 70
2021-02-17 34
2021-02-18 73
2021-02-19 41
2021-02-20 79
sum: 768
2021-02-21
2021-02-21 90
2021-02-22 75
2021-02-23 12
2021-02-24 70
2021-02-25 87
sum: 334
2021-03-07
sum: 0
...

Looks like you can do:
bucket = pd.cut(d.index, bins=s, label=s[:-1], right=False)
d.groupby(bucket).sum()

Related

Pandas: Find weekly max from timeseries(calendar week not 7 days)

I want my dataframe to be grouped by calendar weekly, like Monday to Sunday.
timestamp value
# before time
...
# this is a Friday
2021-10-01 13:00:00 2204.0
2021-10-01 13:30:00 3262.0
...
# this is next Monday
2021-10-04 16:00:00 254.0
2021-10-04 16:30:00 990.0
2021-10-04 17:00:00 1044.0
2021-10-04 17:30:00 26.0
...
# time continues
The result I'm expecting, hope this is clear enough.
timestamp value weekly_max
# this is a Friday
2021-10-01 13:00:00 2204.0 3262.0 # assume 3262.0 is the maximum value during 2021-09-27 to 2021-10-03
2021-10-01 13:30:00 3262.0 3262.0
...
# this is next Monday
2021-10-04 16:00:00 254.0 1044.0
2021-10-04 16:30:00 990.0 1044.0
2021-10-04 17:00:00 1044.0 1044.0
2021-10-04 17:30:00 26.0 1044.0
...
get week number:
df['week'] = df.datetime.dt.isocalendar().week
get max for each week
df_weeklymax = df.groupby('week').agg(max=('value', 'max')).reset_index()
merge 2 tables
df.merge(df_weeklymax, on='week', how='left')
example output:
datetime
value
week
max
0
2021-01-01 00:00:00
20
53
69
1
2021-01-01 13:36:00
69
53
69
2
2021-01-02 03:12:00
69
53
69
3
2021-01-02 16:48:00
57
53
69
4
2021-01-03 06:24:00
39
53
69
5
2021-01-03 20:00:00
56
53
69
6
2021-01-04 09:36:00
73
1
92
7
2021-01-04 23:12:00
76
1
92
8
2021-01-05 12:48:00
92
1
92
9
2021-01-06 02:24:00
4
1
92

Weekly cohorts of subscribers retention

My analysis subjects remind Netflix subscribers. Users subscribe on a certain date (e.g. 2021-04-25) and unsubscribe on another date (e.g. e.g. 2022-01-15) or null if user is still subscribed:
user_id subscription_start subscription_end
1231 2021-03-24 2021-04-07
1232 2021-05-06 2021-05-26
1234 2021-05-28 null
1235 2021-05-30 2021-06-19
1236 2021-06-01 2021-07-07
1237 2021-06-24 2021-07-09
1238 2021-07-06 null
1239 2021-08-14 null
1240 2021-09-12 null
How could I using SQL extract the weekly cohort data of user retention. E.g. 2021-03-22 (Monday) - 2021-03-28 (Sunday) is first cohort which had a single subscriber on 2021-03-24. This user stayed with the service until 2021-04-07, that is for 3 weekly cohorts and should be displayed as active on 1, 2 and 3rd week.
The end result should look like (dummy data):
Subscribed Week 1 Week2 Week 3 Week 4 Week 5 Week 6
2021-03-22 100 98 97 82 72 53 21
2021-03-29 100 97 88 88 76 44 22
2021-04-05 100 87 86 86 86 83 81
2021-04-12 100 100 100 99 98 97 96
2021-04-19 100 100 99 89 79 79 79

Find Maximum Value in Column Pandas

I have a data frame like this- Machine Vibration data.
datetime
tagid
value
quality
0
2021-03-01 13:43:41.440
B42
345
192
1
2021-03-01 13:43:41.440
B43
958
192
2
2021-03-01 13:43:41.440
B44
993
192
3
2021-03-01 13:43:41.440
B45
1224
192
4
2021-03-01 13:43:43.527
B188
6665
192
5
2021-03-01 13:43:43.527
B189
7162
192
6
2021-03-01 13:43:43.527
B190
7193
192
7
2021-03-01 13:43:43.747
C29
2975
192
8
2021-03-01 13:43:43.747
C30
4445
192
9
2021-03-01 13:43:43.747
C31
4015
192
I want to convert this to hourly maximum value for each tag id.
Sample Output
datetime
tagid
value
quality
01-03-2021 13:00
C91
3982
192
01-03-2021 14:00
C91
3972
192
01-03-2021 13:00
C92
9000
192
01-03-2021 14:00
C92
9972
192
01-03-2021 13:00
B42
396
192
01-03-2021 14:00
B42
370
192
01-03-2021 15:00
B42
370
192
I tried with grouper, but couldn't get output.
Use Grouper with aggregate max:
df = df.groupby([pd.Grouper(freq='H', key='datetime'), 'tagid']).max().reset_index()

Skip Week Number 53 to week number 1 in pandas

The output for the function (analysis_data['Date']+ pd.DateOffset(1)).dt.week is
Date Week
2020-12-26 52
2020-12-27 53
2020-12-28 53
2020-12-29 53
2020-12-30 53
2020-12-31 53
2021-01-01 53
2021-01-02 53
2021-01-03 1
But i want my dataframe to consider 53 as Week 1 as well
Date Week
2020-12-26 52
2020-12-27 1
2020-12-28 1
2020-12-29 1
2020-12-30 1
2020-12-31 1
2021-01-01 1
2021-01-02 1
2021-01-03 2

Capping values after a trigger level in a different variable _after GroupBy

There was an elegant answer to a question almost like this provided by EdChum. The difference between that question and this is that now the capping needs to be applied to data that had had "GroupBy" performed.
Original Data:
Symbol DTE Spot Strike Vol
AAPL 30.00 100.00 80.00 14.58
AAPL 30.00 100.00 85.00 16.20
AAPL 30.00 100.00 90.00 18.00
AAPL 30.00 100.00 95.00 20.00
AAPL 30.00 100.00 100.00 22.00
AAPL 30.00 100.00 105.00 25.30
AAPL 30.00 100.00 110.00 29.10
AAPL 30.00 100.00 115.00 33.46
AAPL 30.00 100.00 120.00 38.48
AAPL 50.00 102.00 80.00 13.08
AAPL 50.00 102.00 85.00 14.70
AAPL 50.00 102.00 90.00 16.50
AAPL 50.00 102.00 95.00 18.50
AAPL 50.00 102.00 100.00 20.50
AAPL 50.00 102.00 105.00 23.80
AAPL 50.00 102.00 110.00 27.60
AAPL 50.00 102.00 115.00 31.96
AAPL 50.00 102.00 120.00 36.98
IBM 30.00 170.00 150.00 7.29
IBM 30.00 170.00 155.00 8.10
IBM 30.00 170.00 160.00 9.00
IBM 30.00 170.00 165.00 10.00
IBM 30.00 170.00 170.00 11.00
IBM 30.00 170.00 175.00 12.65
IBM 30.00 170.00 180.00 14.55
IBM 30.00 170.00 185.00 16.73
IBM 30.00 170.00 190.00 19.24
IBM 60.00 171.00 150.00 5.79
IBM 60.00 171.00 155.00 6.60
IBM 60.00 171.00 160.00 7.50
IBM 60.00 171.00 165.00 8.50
IBM 60.00 171.00 170.00 9.50
IBM 60.00 171.00 175.00 11.15
IBM 60.00 171.00 180.00 13.05
IBM 60.00 171.00 185.00 15.23
IBM 60.00 171.00 190.00 17.74
I then create a few new variables:
df['ATM_dist'] =abs(df['Spot']-df['Strike'])
imin = df.groupby(['DTE','Symbol'])['ATM_dist'].transform('idxmin')
df['NormStrike']=np.log(df['Strike']/df['Spot'])/(((df['DTE']/365)**.5)*df['ATMvol']/100)
df['ATMvol'] = df.loc[imin,'Vol'].values
The results are below:
Symbol DTE Spot Strike Vol ATM_dist ATMvol NormStrike
0 AAPL 30 100 80 14.58 20 22.0 -3.537916
1 AAPL 30 100 85 16.20 15 22.0 -2.576719
2 AAPL 30 100 90 18.00 10 22.0 -1.670479
3 AAPL 30 100 95 20.00 5 22.0 -0.813249
4 AAPL 30 100 100 22.00 0 22.0 0.000000
5 AAPL 30 100 105 25.30 5 22.0 0.773562
6 AAPL 30 100 110 29.10 10 22.0 1.511132
7 AAPL 30 100 115 33.46 15 22.0 2.215910
8 AAPL 30 100 120 38.48 20 22.0 2.890688
9 AAPL 50 102 80 13.08 22 20.5 -3.201973
10 AAPL 50 102 85 14.70 17 20.5 -2.402955
11 AAPL 50 102 90 16.50 12 20.5 -1.649620
12 AAPL 50 102 95 18.50 7 20.5 -0.937027
13 AAPL 50 102 100 20.50 2 20.5 -0.260994
14 AAPL 50 102 105 23.80 3 20.5 0.382049
15 AAPL 50 102 110 27.60 8 20.5 0.995172
16 AAPL 50 102 115 31.96 13 20.5 1.581035
17 AAPL 50 102 120 36.98 18 20.5 2.141961
18 IBM 30 170 150 7.29 20 11.0 -3.968895
19 IBM 30 170 155 8.10 15 11.0 -2.929137
20 IBM 30 170 160 9.00 10 11.0 -1.922393
21 IBM 30 170 165 10.00 5 11.0 -0.946631
22 IBM 30 170 170 11.00 0 11.0 0.000000
23 IBM 30 170 175 12.65 5 11.0 0.919188
24 IBM 30 170 180 14.55 10 11.0 1.812480
25 IBM 30 170 185 16.73 15 11.0 2.681295
26 IBM 30 170 190 19.24 20 11.0 3.526940
27 IBM 60 171 150 5.79 21 9.5 -3.401827
28 IBM 60 171 155 6.60 16 9.5 -2.550520
29 IBM 60 171 160 7.50 11 9.5 -1.726243
30 IBM 60 171 165 8.50 6 9.5 -0.927332
31 IBM 60 171 170 9.50 1 9.5 -0.152273
32 IBM 60 171 175 11.15 4 9.5 0.600317
33 IBM 60 171 180 13.05 9 9.5 1.331704
34 IBM 60 171 185 15.23 14 9.5 2.043051
35 IBM 60 171 190 17.74 19 9.5 2.735427
I wish to have the values of 'Vol' cap to the level where another column 'NormStrike' hits a trigger (in this case abs(NormStrike) >= 2 ). This new column, 'Desired_Level', created while leaving the 'Vol' column unchanged. The first cap should cause the Vol value at index location 0 to be 16.2 because the cap was triggered at index location 1 when NormStrike hit -2.576719.
Added clarification:
I am looking for a generic solution, that works away from the lowest abs(NormStrike) level in both directions to hit both the -2 and the +2 trigger. If it is not hit (which it might not be) then desired level is just original_level
An additional note, it will always be true that the abs(NormStrike) continues to grow in size from the min(abs(NormStrike)) level as it is a function of abs(distance from spot to strike)
the code that EdChum provided (prior to me bringing GroupBy into the mix) is below:
clip = 4
lower = df.loc[df['NS'] <= -clip, 'Vol'].idxmax()
upper = df.loc[df['NS'] >= clip, 'Vol'].idxmin()
df['Original_level'] = df['Original_level'].clip(df.loc[lower,'Original_level'], df.loc[upper, 'Original_level'])
There are 2 issues, first, it did not work after groupby and second, if a particular group of data does not have a NS value that exceeds the "clip" value then it generates an error. The ideal outcome would be, in this case, nothing is done to the Vol level for the particular Symbol/DTE group in question.
Ed suggested implementing a reset_index() but I am not sure how to use that to solve the issue.
I hope this was not to convoluted of a question
thank you for any assistance
You can try this to see whether it works out. I assume if the clip has been triggered, then NaN will be put. You can replace it by your customized choice.
import pandas as pd
import numpy as np
# use np.where(criterion, x, y) to do a vectorized statement like if criterion is True, then set it to x, else set it to y
def func(group):
group['Triggered'] = np.where((group['NormStrike'] >= 2) | (group['NormStrike'] <= -4), 'Yes', 'No')
group['Desired_Level'] = np.where((group['NormStrike'] >= 2) | (group['NormStrike'] <= -4), np.nan, group['Vol'])
group = group.fillna(method='ffill').fillna(method='bfill')
return group
df = df.groupby(['Symbol', 'DTE']).apply(func)
Out[410]:
Symbol DTE Spot Strike Vol ATM_dist ATMvol NormStrike Triggered Desired_Level
0 AAPL 30 100 80 14.58 20 22 -3.5379 No 14.58
1 AAPL 30 100 85 16.20 15 22 -2.5767 No 16.20
2 AAPL 30 100 90 18.00 10 22 -1.6705 No 18.00
3 AAPL 30 100 95 20.00 5 22 -0.8132 No 20.00
4 AAPL 30 100 100 22.00 0 22 0.0000 No 22.00
5 AAPL 30 100 105 25.30 5 22 0.7736 No 25.30
6 AAPL 30 100 110 29.10 10 22 1.5111 No 29.10
7 AAPL 30 100 115 33.46 15 22 2.2159 Yes 29.10
8 AAPL 30 100 120 38.48 20 22 2.8907 Yes 29.10
9 AAPL 50 102 80 14.58 22 22 -3.5379 No 14.58
10 AAPL 50 102 85 16.20 17 22 -2.5767 No 16.20
11 AAPL 50 102 90 18.00 12 22 -1.6705 No 18.00
12 AAPL 50 102 95 20.00 7 22 -0.8132 No 20.00
13 AAPL 50 102 100 22.00 2 22 0.0000 No 22.00
14 AAPL 50 102 105 25.30 3 22 0.7736 No 25.30
15 AAPL 50 102 110 29.10 8 22 1.5111 No 29.10
16 AAPL 50 102 115 33.46 13 22 2.2159 Yes 29.10
17 AAPL 50 102 120 38.48 18 22 2.8907 Yes 29.10
18 AAPL 30 170 150 14.58 20 22 -3.5379 No 14.58
19 AAPL 30 170 155 16.20 15 22 -2.5767 No 16.20
20 AAPL 30 170 160 18.00 10 22 -1.6705 No 18.00
21 AAPL 30 170 165 20.00 5 22 -0.8132 No 20.00
22 AAPL 30 170 170 22.00 0 22 0.0000 No 22.00
23 AAPL 30 170 175 25.30 5 22 0.7736 No 25.30
24 AAPL 30 170 180 29.10 10 22 1.5111 No 29.10
25 AAPL 30 170 185 33.46 15 22 2.2159 Yes 29.10
26 AAPL 30 170 190 38.48 20 22 2.8907 Yes 29.10
27 AAPL 60 171 150 14.58 21 22 -3.5379 No 14.58
28 AAPL 60 171 155 16.20 16 22 -2.5767 No 16.20
29 AAPL 60 171 160 18.00 11 22 -1.6705 No 18.00
30 AAPL 60 171 165 20.00 6 22 -0.8132 No 20.00
31 AAPL 60 171 170 22.00 1 22 0.0000 No 22.00
32 AAPL 60 171 175 25.30 4 22 0.7736 No 25.30
33 AAPL 60 171 180 29.10 9 22 1.5111 No 29.10
34 AAPL 60 171 185 33.46 14 22 2.2159 Yes 29.10
35 AAPL 60 171 190 38.48 19 22 2.8907 Yes 29.10