The output for the function (analysis_data['Date']+ pd.DateOffset(1)).dt.week is
Date Week
2020-12-26 52
2020-12-27 53
2020-12-28 53
2020-12-29 53
2020-12-30 53
2020-12-31 53
2021-01-01 53
2021-01-02 53
2021-01-03 1
But i want my dataframe to consider 53 as Week 1 as well
Date Week
2020-12-26 52
2020-12-27 1
2020-12-28 1
2020-12-29 1
2020-12-30 1
2020-12-31 1
2021-01-01 1
2021-01-02 1
2021-01-03 2
Related
I want my dataframe to be grouped by calendar weekly, like Monday to Sunday.
timestamp value
# before time
...
# this is a Friday
2021-10-01 13:00:00 2204.0
2021-10-01 13:30:00 3262.0
...
# this is next Monday
2021-10-04 16:00:00 254.0
2021-10-04 16:30:00 990.0
2021-10-04 17:00:00 1044.0
2021-10-04 17:30:00 26.0
...
# time continues
The result I'm expecting, hope this is clear enough.
timestamp value weekly_max
# this is a Friday
2021-10-01 13:00:00 2204.0 3262.0 # assume 3262.0 is the maximum value during 2021-09-27 to 2021-10-03
2021-10-01 13:30:00 3262.0 3262.0
...
# this is next Monday
2021-10-04 16:00:00 254.0 1044.0
2021-10-04 16:30:00 990.0 1044.0
2021-10-04 17:00:00 1044.0 1044.0
2021-10-04 17:30:00 26.0 1044.0
...
get week number:
df['week'] = df.datetime.dt.isocalendar().week
get max for each week
df_weeklymax = df.groupby('week').agg(max=('value', 'max')).reset_index()
merge 2 tables
df.merge(df_weeklymax, on='week', how='left')
example output:
datetime
value
week
max
0
2021-01-01 00:00:00
20
53
69
1
2021-01-01 13:36:00
69
53
69
2
2021-01-02 03:12:00
69
53
69
3
2021-01-02 16:48:00
57
53
69
4
2021-01-03 06:24:00
39
53
69
5
2021-01-03 20:00:00
56
53
69
6
2021-01-04 09:36:00
73
1
92
7
2021-01-04 23:12:00
76
1
92
8
2021-01-05 12:48:00
92
1
92
9
2021-01-06 02:24:00
4
1
92
I have a df which looks like this:
user_id | date1 | date2 | purchase
1 | 2020-01-01 | 2021-01-01 | 100
1 | 2021-02-01 | 2021-05-01 | 29
2 | 2019-01-01 | 2021-01-01 | 11..
I want a dataframe which returns for every user the sum of purchase amounts between date1 and date 2. Those dates are likely always different for each user. How could I achieve this the most efficiently?
df.groupby('user_id').purchase.sum() #But how do I say that only between date1 and date2?
IIUC first repeat months and then aggregate:
df['date1'] = pd.to_datetime(df['date1']).dt.to_period('m')
df['date2'] = pd.to_datetime(df['date2']).dt.to_period('m')
diff = df['date2'].astype('int').sub(df['date1'].astype('int')) + 1
df = df.loc[df.index.repeat(diff)]
df['date'] = df.groupby(level=0).cumcount().add(df['date1']).dt.to_timestamp()
print (df)
user_id date1 date2 purchase date
0 1 2020-01 2021-01 100 2020-01-01
0 1 2020-01 2021-01 100 2020-02-01
0 1 2020-01 2021-01 100 2020-03-01
0 1 2020-01 2021-01 100 2020-04-01
0 1 2020-01 2021-01 100 2020-05-01
0 1 2020-01 2021-01 100 2020-06-01
0 1 2020-01 2021-01 100 2020-07-01
0 1 2020-01 2021-01 100 2020-08-01
0 1 2020-01 2021-01 100 2020-09-01
0 1 2020-01 2021-01 100 2020-10-01
0 1 2020-01 2021-01 100 2020-11-01
0 1 2020-01 2021-01 100 2020-12-01
0 1 2020-01 2021-01 100 2021-01-01
1 1 2021-02 2021-05 29 2021-02-01
1 1 2021-02 2021-05 29 2021-03-01
1 1 2021-02 2021-05 29 2021-04-01
1 1 2021-02 2021-05 29 2021-05-01
2 2 2019-01 2021-01 11 2019-01-01
2 2 2019-01 2021-01 11 2019-02-01
2 2 2019-01 2021-01 11 2019-03-01
2 2 2019-01 2021-01 11 2019-04-01
2 2 2019-01 2021-01 11 2019-05-01
2 2 2019-01 2021-01 11 2019-06-01
2 2 2019-01 2021-01 11 2019-07-01
2 2 2019-01 2021-01 11 2019-08-01
2 2 2019-01 2021-01 11 2019-09-01
2 2 2019-01 2021-01 11 2019-10-01
2 2 2019-01 2021-01 11 2019-11-01
2 2 2019-01 2021-01 11 2019-12-01
2 2 2019-01 2021-01 11 2020-01-01
2 2 2019-01 2021-01 11 2020-02-01
2 2 2019-01 2021-01 11 2020-03-01
2 2 2019-01 2021-01 11 2020-04-01
2 2 2019-01 2021-01 11 2020-05-01
2 2 2019-01 2021-01 11 2020-06-01
2 2 2019-01 2021-01 11 2020-07-01
2 2 2019-01 2021-01 11 2020-08-01
2 2 2019-01 2021-01 11 2020-09-01
2 2 2019-01 2021-01 11 2020-10-01
2 2 2019-01 2021-01 11 2020-11-01
2 2 2019-01 2021-01 11 2020-12-01
2 2 2019-01 2021-01 11 2021-01-01
df = df.groupby(['user_id','date'], as_index=False).purchase.sum()
My analysis subjects remind Netflix subscribers. Users subscribe on a certain date (e.g. 2021-04-25) and unsubscribe on another date (e.g. e.g. 2022-01-15) or null if user is still subscribed:
user_id subscription_start subscription_end
1231 2021-03-24 2021-04-07
1232 2021-05-06 2021-05-26
1234 2021-05-28 null
1235 2021-05-30 2021-06-19
1236 2021-06-01 2021-07-07
1237 2021-06-24 2021-07-09
1238 2021-07-06 null
1239 2021-08-14 null
1240 2021-09-12 null
How could I using SQL extract the weekly cohort data of user retention. E.g. 2021-03-22 (Monday) - 2021-03-28 (Sunday) is first cohort which had a single subscriber on 2021-03-24. This user stayed with the service until 2021-04-07, that is for 3 weekly cohorts and should be displayed as active on 1, 2 and 3rd week.
The end result should look like (dummy data):
Subscribed Week 1 Week2 Week 3 Week 4 Week 5 Week 6
2021-03-22 100 98 97 82 72 53 21
2021-03-29 100 97 88 88 76 44 22
2021-04-05 100 87 86 86 86 83 81
2021-04-12 100 100 100 99 98 97 96
2021-04-19 100 100 99 89 79 79 79
How do I group one DataFrame by another possibly-non-periodic Series? Mock-up below:
This is the DataFrame to be split:
i = pd.date_range(end="today", periods=20, freq="d").normalize()
v = np.random.randint(0,100,size=len(i))
d = pd.DataFrame({"value": v}, index=i)
>>> d
value
2021-02-06 48
2021-02-07 1
2021-02-08 86
2021-02-09 82
2021-02-10 40
2021-02-11 22
2021-02-12 63
2021-02-13 37
2021-02-14 41
2021-02-15 57
2021-02-16 30
2021-02-17 69
2021-02-18 63
2021-02-19 27
2021-02-20 23
2021-02-21 46
2021-02-22 66
2021-02-23 10
2021-02-24 91
2021-02-25 43
This is the splitting criteria, grouping by the Series dates. A group consists of any ordered dataframe value v such that {v} intersects [s,s+1) - but as with resampling it would be nice to control the inclusion parameters.
s = pd.date_range(start="2019-10-14", freq="2W", periods=52).to_series()
s = s.drop(np.random.choice(s.index, 10, replace=False))
s = s.reset_index(drop=True)
>>> s[25:29]
25 2021-01-24
26 2021-02-07
27 2021-02-21
28 2021-03-07
dtype: datetime64[ns]
And this is the example output... or something like it. Index is taken from the series rather than the dataframe.
>>> ???.sum()
value
...
2021-01-24 47
2021-02-07 768
2021-02-21 334
...
Internally the groups would have this structure:
...
2021-01-10
sum: 0
2021-01-24
2021-02-06 47
sum: 47
2021-02-07
2021-02-07 52
2021-02-08 56
2021-02-09 21
2021-02-10 39
2021-02-11 86
2021-02-12 30
2021-02-13 20
2021-02-14 76
2021-02-15 91
2021-02-16 70
2021-02-17 34
2021-02-18 73
2021-02-19 41
2021-02-20 79
sum: 768
2021-02-21
2021-02-21 90
2021-02-22 75
2021-02-23 12
2021-02-24 70
2021-02-25 87
sum: 334
2021-03-07
sum: 0
...
Looks like you can do:
bucket = pd.cut(d.index, bins=s, label=s[:-1], right=False)
d.groupby(bucket).sum()
I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429