Calculate the rolling average every two weeks for the same day and hour in a DataFrame - pandas

I have a Dataframe like the following:
df = pd.DataFrame()
df['datetime'] = pd.date_range(start='2023-1-2', end='2023-1-29', freq='15min')
df['week'] = df['datetime'].apply(lambda x: int(x.isocalendar()[1]))
df['day_of_week'] = df['datetime'].dt.weekday
df['hour'] = df['datetime'].dt.hour
df['minutes'] = pd.DatetimeIndex(df['datetime']).minute
df['value'] = range(len(df))
df.set_index('datetime',inplace=True)
df = week day_of_week hour minutes value
datetime
2023-01-02 00:00:00 1 0 0 0 0
2023-01-02 00:15:00 1 0 0 15 1
2023-01-02 00:30:00 1 0 0 30 2
2023-01-02 00:45:00 1 0 0 45 3
2023-01-02 01:00:00 1 0 1 0 4
... ... ... ... ... ...
2023-01-08 23:00:00 1 6 23 0 668
2023-01-08 23:15:00 1 6 23 15 669
2023-01-08 23:30:00 1 6 23 30 670
2023-01-08 23:45:00 1 6 23 45 671
2023-01-09 00:00:00 2 0 0 0 672
And I want to calculate the average of the column "value" for the same hour/minute/day, every two consecutive weeks.
What I would like to get is the following:
df=
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336
2023-01-23 00:00:00 1008
15 2023-01-02 00:15:00 NaN
2023-01-09 00:15:00 NaN
2023-01-16 00:15:00 337
2023-01-23 00:15:00 1009
So the first two weeks should have NaN values and week-3 should be the average of week-1 and week-2 and then week-4 the average of week-2 and week-3 and so on.
I tried the following code but it does not seem to do what I expect:
df = pd.DataFrame(df.groupby(['day_of_week','hour','minutes'])['value'].rolling(window='14D', min_periods=1).mean())
As what I am getting is:
value
day_of_week hour minutes. datetime
0 0 0 2023-01-02 00:00:00 0
2023-01-09 00:00:00 336
2023-01-16 00:00:00 1008
2023-01-23 00:00:00 1680
15 2023-01-02 00:15:00 1
2023-01-09 00:15:00 337
2023-01-16 00:15:00 1009
2023-01-23 00:15:00 1681

I think you want to shift within each group. Then you need another groupby:
(df.groupby(['day_of_week','hour','minutes'])['value']
.rolling(window='14D', min_periods=2).mean() # `min_periods` is different
.groupby(['day_of_week','hour','minutes']).shift() # shift within each group
.to_frame()
)
Output:
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336.0
2023-01-23 00:00:00 1008.0
15 2023-01-02 00:15:00 NaN
... ...
6 23 30 2023-01-15 23:30:00 NaN
2023-01-22 23:30:00 1006.0
45 2023-01-08 23:45:00 NaN
2023-01-15 23:45:00 NaN
2023-01-22 23:45:00 1007.0

Related

Overlap in seconds between datetime range and a time range

I have a dataframe like this:
df11 = pd.DataFrame(
{
"Start_date": ["2018-01-31 12:00:00", "2018-02-28 16:00:00", "2018-02-27 22:00:00"],
"End_date": ["2019-01-31 21:45:00", "2019-03-24 22:00:00", "2018-02-28 01:00:00"],
}
)
Start_date End_date
0 2018-01-31 12:00:00 2019-01-31 21:45:00
1 2018-02-28 16:00:00 2019-03-24 22:00:00
2 2018-02-27 22:00:00 2018-02-28 01:00:00
I need to check the overlap time duration in specific periods in seconds. My expected results are like this:
Start_date End_date 12h-16h 16h-22h 22h-00h 00h-02h30
0 2018-01-31 12:00:00 2019-01-31 21:45:00 14400 20700 0 0
1 2018-02-28 16:00:00 2019-03-24 22:00:00 0 21600 0 0
2 2018-02-27 22:00:00 2018-02-28 01:00:00 0 0 7200 3600
I know it`s completely wrong and I´ve tried other solutions. This is one of my attempts:
df11['12h-16h']=np.where(df11['Start_date']<timedelta(hours=16, minutes=0, seconds=0) & df11['End_date']>timedelta(hours=12, minutes=0, seconds=0),(np.minimum(df11['End_date'],timedelta(hours=16, minutes=0, seconds=0)))-(np.maximum(df11['Start_date'],timedelta(hours=12, minutes=0, seconds=0)))

Pandas resample is jumbling date order

I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.

Time difference between two columns in Pandas

How can I subtract the time between two columns and convert it to minutes
Date Time Ordered Time Delivered
0 1/11/19 9:25:00 am 10:58:00 am
1 1/11/19 10:16:00 am 11:13:00 am
2 1/11/19 10:25:00 am 10:45:00 am
3 1/11/19 10:45:00 am 11:12:00 am
4 1/11/19 11:11:00 am 11:47:00 am
I want to subtract the Time_delivered - Time_ordered to get the minutes the delivery took.
df.time_ordered = pd.to_datetime(df.time_ordered)
This doesn't output the correct time instead it adds today's date the time
Convert both time columns to datetimes, get difference, convert to seconds by Series.dt.total_seconds and then to minutes by division by 60:
df['diff'] = (pd.to_datetime(df.time_ordered, format='%I:%M:%S %p')
.sub(pd.to_datetime(df.time_delivered, format='%I:%M:%S %p'))
.dt.total_seconds()
.div(60))
Try to_datetime()
df = pd.DataFrame([['9:25:00 AM','10:58:00 AM']],
columns=['time1', 'time2'])
print(pd.to_datetime(df.time2)-pd.to_datetime(df.time1))
Output:
01:33:00
another way is using np.timedelta64
print(df)
Date Time Ordered Time Delivered
0 1/11/19 9:25:00 am 10:58:00 am
1 1/11/19 10:16:00 am 11:13:00 am
2 1/11/19 10:25:00 am 10:45:00 am
3 1/11/19 10:45:00 am 11:12:00 am
4 1/11/19 11:11:00 am 11:47:00 am
df['mins'] = (
pd.to_datetime(df["Date"] + " " + df["Time Delivered"])
- pd.to_datetime(df["Date"] + " " + df["Time Ordered"])
) / np.timedelta64(1, "m")
output:
print(df)
Date Time Ordered Time Delivered mins
0 1/11/19 9:25:00 am 10:58:00 am 93.0
1 1/11/19 10:16:00 am 11:13:00 am 57.0
2 1/11/19 10:25:00 am 10:45:00 am 20.0
3 1/11/19 10:45:00 am 11:12:00 am 27.0
4 1/11/19 11:11:00 am 11:47:00 am 36.0

Transform data by time & Class

I have a dataframe nf as following:
DateTime Class Count
0 2017-10-01 00:00:00 1 0
1 2017-10-01 00:00:00 2 240
2 2017-10-01 00:00:00 3 17
3 2017-10-01 00:00:00 4 0
4 2017-10-01 00:00:00 5 1
5 2017-10-01 00:00:00 6 0
6 2017-10-01 00:00:00 7 0
7 2017-10-01 00:00:00 8 0
8 2017-10-01 00:00:00 9 0
9 2017-10-01 00:00:00 10 0
10 2017-10-01 00:00:00 11 0
11 2017-10-01 00:00:00 12 0
12 2017-10-01 00:00:00 13 0
13 2017-10-01 00:00:00 14 0
14 2017-10-01 00:00:00 15 0
..............................
30 2017-10-01 01:00:00 1 0
31 2017-10-01 01:00:00 2 209
32 2017-10-01 01:00:00 3 14
33 2017-10-01 01:00:00 4 0
34 2017-10-01 01:00:00 5 4
35 2017-10-01 01:00:00 6 0
36 2017-10-01 01:00:00 7 0
37 2017-10-01 01:00:00 8 0
38 2017-10-01 01:00:00 9 0
39 2017-10-01 01:00:00 10 0
40 2017-10-01 01:00:00 11 0
41 2017-10-01 01:00:00 12 0
42 2017-10-01 01:00:00 13 0
43 2017-10-01 01:00:00 14 0
44 2017-10-01 01:00:00 15 0
....... and so on
There are total 15 classes and counts for each class for each hour
I want to transform the data into columnwise on a per hour for each count basis as follows
Output req
DateTime Class1 Class2 Class3 Class4.........Class15
2017-10-01 00:00:00 0 240 17 0 ......... 0
2017-10-01 00:01:00 0 209 14 0 ......... 0
....
and so on
You can use pandas to read the data into a pd.Dataframe(), select the counts for each class by slicing the dataframe with conditions and concate the data after that by using the datetime as index:
import pandas as pd
# create dataframe from file
df = pd.read_csv('fname')
# or from numpy array
df = pd.Dataframe(data=np_array, columns=['DateTime', 'Class', 'Count'])
# select the counts for each class
df_c1 = df[df.Class == 1]
df_c2 = df[df.Class == 2]
df_c3 = df[df.Class == 3]
df_c4 = df[df.Class == 4]
df_new = pd.Dataframe()
df_new['DateTime'] = df_c1['DateTime']
df_new['Class1'] = df_c1['Count']
df_new['Class2'] = df_c2['Count']
df_new['Class3'] = df_c3['Count']
df_new['Class4'] = df_c4['Count']
The code example is really dirty and I'm probably missing alot, but maybe it gives you an inspiration. I would also recommend you to check the pandas documentation for concat() and Dataframe()
I'm going to review and refactor my example code tomorrow, in case the problem is not solved already. Meanwhile you could fix the layout of the data in your question it's not readable.
Try pivot_table:
(df.pivot_table(index='DateTime',columns='Class',
values='Count',
aggfunc='sum')
.add_prefix('Class_'))

Pandas - Group into 24-hour blocks, but not midnight-to-midnight

I have a time Series. I'd like to group into into blocks of 24-hour blocks, from 8am to 7:59am the next day. I know how to group by date, but I've tried and failed to handle this 8-hour offset using TimeGroupers and DateOffsets.
I think you can use Grouper with parameter base:
print df
date name
0 2015-06-13 00:21:25 1
1 2015-06-14 01:00:25 2
2 2015-06-14 02:54:48 3
3 2015-06-15 14:38:15 2
4 2015-06-15 15:29:28 1
print df.groupby(pd.Grouper(key='date', freq='24h', base=8)).sum()
name
date
2015-06-12 08:00:00 1.0
2015-06-13 08:00:00 5.0
2015-06-14 08:00:00 NaN
2015-06-15 08:00:00 3.0
alternatively to #jezrael's method you can use your custom grouper function:
start_ts = '2016-01-01 07:59:59'
df = pd.DataFrame({'Date': pd.date_range(start_ts, freq='10min', periods=1000)})
def my_grouper(df, idx):
return df.ix[idx, 'Date'].date() if df.ix[idx, 'Date'].hour >= 8 else df.ix[idx, 'Date'].date() - pd.Timedelta('1day')
df.groupby(lambda x: my_grouper(df, x)).size()
Test:
In [468]: df.head()
Out[468]:
Date
0 2016-01-01 07:59:59
1 2016-01-01 08:09:59
2 2016-01-01 08:19:59
3 2016-01-01 08:29:59
4 2016-01-01 08:39:59
In [469]: df.tail()
Out[469]:
Date
995 2016-01-08 05:49:59
996 2016-01-08 05:59:59
997 2016-01-08 06:09:59
998 2016-01-08 06:19:59
999 2016-01-08 06:29:59
In [470]: df.groupby(lambda x: my_grouper(df, x)).size()
Out[470]:
2015-12-31 1
2016-01-01 144
2016-01-02 144
2016-01-03 144
2016-01-04 144
2016-01-05 144
2016-01-06 144
2016-01-07 135
dtype: int64