Grouping columns to form time series data (Python) - pandas

I have a dataframe df which looks like this
restaurant
opentime
closetime
group
ABX
10:00:00
21:00:00
Gold
BWZ
13:00:00
14:00:00
Silver
GTW
10:00:00
11:00:00
Gold
I want to create a time series dataframe df2 based on the start and end date of my choice which shows the restaurants open by group and indexed by all the hours. In this case, I have taken a start date of May 17th 2021 and an end date of May 18th 2021. The final dataframe should look like this
Date
Gold
Silver
2021-05-17 9:00:00
0
0
2021-05-17 10:00:00
2
0
2021-05-17 11:00:00
1
0
2021-05-17 12:00:00
1
0
2021-05-17 13:00:00
1
1
2021-05-17 14:00:00
1
1
2021-05-17 15:00:00
1
0
......................
......
........
......................
......
........
2021-05-18 23:00:00
0
0
If the Date part is too difficult to recreate, then just time would also help in such a way it looks like this
Time
Gold
Silver
9:00:00
0
0
10:00:00
2
0
11:00:00
1
0
12:00:00
1
0
13:00:00
1
1
14:00:00
1
1
15:00:00
1
0
......................
......
........
......................
......
........
23:00:00
0
0
Any help will be appreciated.

First part: Time
Create a list that contains all hours between opentime and closetime then explode the list into rows and group by (time, group) and count values for each group.
Second part: Date
Create a datetime index that contains all hours between start_date and end_date. Transform as a series and set time as index.
Last part: Merge Date and Time
Merge dfd (date dataframe) and dft (time dataframe) together to get the groups for each datetime.
start_date = "2021-05-17"
end_date = "2021-05-18"
# compute hours between opentime and closetime
df["time"] = df.apply(lambda x: pd.date_range(x["opentime"],
x["closetime"],
freq="1H").time, axis="columns")
# value count by time and group
dft = df.explode("time").value_counts(["time", "group"]).unstack("group")
# create datetime index between start_date and end_date
dti = pd.date_range(start=pd.to_datetime(start_date),
end=pd.to_datetime(end_date) + pd.DateOffset(days=1),
closed="left", freq="1H", name="datetime")
dfd = dti.to_series(index=dti.time)
# merge date and time dataframes
out = pd.merge(dfd, dft, left_index=True, right_index=True, how="left") \
.set_index("datetime").sort_index().fillna(0)
>>> out
Gold Silver
datetime
2021-05-17 00:00:00 0.0 0.0
2021-05-17 01:00:00 0.0 0.0
2021-05-17 02:00:00 0.0 0.0
2021-05-17 03:00:00 0.0 0.0
2021-05-17 04:00:00 0.0 0.0
2021-05-17 05:00:00 0.0 0.0
2021-05-17 06:00:00 0.0 0.0
2021-05-17 07:00:00 0.0 0.0
2021-05-17 08:00:00 0.0 0.0
2021-05-17 09:00:00 0.0 0.0
2021-05-17 10:00:00 2.0 0.0
2021-05-17 11:00:00 2.0 0.0
2021-05-17 12:00:00 1.0 0.0
2021-05-17 13:00:00 1.0 1.0
2021-05-17 14:00:00 1.0 1.0
2021-05-17 15:00:00 1.0 0.0
2021-05-17 16:00:00 1.0 0.0
2021-05-17 17:00:00 1.0 0.0
2021-05-17 18:00:00 1.0 0.0
2021-05-17 19:00:00 1.0 0.0
2021-05-17 20:00:00 1.0 0.0
2021-05-17 21:00:00 1.0 0.0
2021-05-17 22:00:00 0.0 0.0
2021-05-17 23:00:00 0.0 0.0
2021-05-18 00:00:00 0.0 0.0
2021-05-18 01:00:00 0.0 0.0
2021-05-18 02:00:00 0.0 0.0
2021-05-18 03:00:00 0.0 0.0
2021-05-18 04:00:00 0.0 0.0
2021-05-18 05:00:00 0.0 0.0
2021-05-18 06:00:00 0.0 0.0
2021-05-18 07:00:00 0.0 0.0
2021-05-18 08:00:00 0.0 0.0
2021-05-18 09:00:00 0.0 0.0
2021-05-18 10:00:00 2.0 0.0
2021-05-18 11:00:00 2.0 0.0
2021-05-18 12:00:00 1.0 0.0
2021-05-18 13:00:00 1.0 1.0
2021-05-18 14:00:00 1.0 1.0
2021-05-18 15:00:00 1.0 0.0
2021-05-18 16:00:00 1.0 0.0
2021-05-18 17:00:00 1.0 0.0
2021-05-18 18:00:00 1.0 0.0
2021-05-18 19:00:00 1.0 0.0
2021-05-18 20:00:00 1.0 0.0
2021-05-18 21:00:00 1.0 0.0
2021-05-18 22:00:00 0.0 0.0
>>> dfd.index
Index([00:00:00, 01:00:00, 02:00:00, 03:00:00, 04:00:00, 05:00:00, 06:00:00,
07:00:00, 08:00:00, 09:00:00, 10:00:00, 11:00:00, 12:00:00, 13:00:00,
14:00:00, 15:00:00, 16:00:00, 17:00:00, 18:00:00, 19:00:00, 20:00:00,
21:00:00, 22:00:00, 23:00:00, 00:00:00, 01:00:00, 02:00:00, 03:00:00,
04:00:00, 05:00:00, 06:00:00, 07:00:00, 08:00:00, 09:00:00, 10:00:00,
11:00:00, 12:00:00, 13:00:00, 14:00:00, 15:00:00, 16:00:00, 17:00:00,
18:00:00, 19:00:00, 20:00:00, 21:00:00, 22:00:00, 23:00:00],
dtype='object')
>>> dft.index
Index([10:00:00, 11:00:00, 12:00:00, 13:00:00, 14:00:00, 15:00:00, 16:00:00,
17:00:00, 18:00:00, 19:00:00, 20:00:00, 21:00:00],
dtype='object', name='time')

Related

Calculate the rolling average every two weeks for the same day and hour in a DataFrame

I have a Dataframe like the following:
df = pd.DataFrame()
df['datetime'] = pd.date_range(start='2023-1-2', end='2023-1-29', freq='15min')
df['week'] = df['datetime'].apply(lambda x: int(x.isocalendar()[1]))
df['day_of_week'] = df['datetime'].dt.weekday
df['hour'] = df['datetime'].dt.hour
df['minutes'] = pd.DatetimeIndex(df['datetime']).minute
df['value'] = range(len(df))
df.set_index('datetime',inplace=True)
df = week day_of_week hour minutes value
datetime
2023-01-02 00:00:00 1 0 0 0 0
2023-01-02 00:15:00 1 0 0 15 1
2023-01-02 00:30:00 1 0 0 30 2
2023-01-02 00:45:00 1 0 0 45 3
2023-01-02 01:00:00 1 0 1 0 4
... ... ... ... ... ...
2023-01-08 23:00:00 1 6 23 0 668
2023-01-08 23:15:00 1 6 23 15 669
2023-01-08 23:30:00 1 6 23 30 670
2023-01-08 23:45:00 1 6 23 45 671
2023-01-09 00:00:00 2 0 0 0 672
And I want to calculate the average of the column "value" for the same hour/minute/day, every two consecutive weeks.
What I would like to get is the following:
df=
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336
2023-01-23 00:00:00 1008
15 2023-01-02 00:15:00 NaN
2023-01-09 00:15:00 NaN
2023-01-16 00:15:00 337
2023-01-23 00:15:00 1009
So the first two weeks should have NaN values and week-3 should be the average of week-1 and week-2 and then week-4 the average of week-2 and week-3 and so on.
I tried the following code but it does not seem to do what I expect:
df = pd.DataFrame(df.groupby(['day_of_week','hour','minutes'])['value'].rolling(window='14D', min_periods=1).mean())
As what I am getting is:
value
day_of_week hour minutes. datetime
0 0 0 2023-01-02 00:00:00 0
2023-01-09 00:00:00 336
2023-01-16 00:00:00 1008
2023-01-23 00:00:00 1680
15 2023-01-02 00:15:00 1
2023-01-09 00:15:00 337
2023-01-16 00:15:00 1009
2023-01-23 00:15:00 1681
I think you want to shift within each group. Then you need another groupby:
(df.groupby(['day_of_week','hour','minutes'])['value']
.rolling(window='14D', min_periods=2).mean() # `min_periods` is different
.groupby(['day_of_week','hour','minutes']).shift() # shift within each group
.to_frame()
)
Output:
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336.0
2023-01-23 00:00:00 1008.0
15 2023-01-02 00:15:00 NaN
... ...
6 23 30 2023-01-15 23:30:00 NaN
2023-01-22 23:30:00 1006.0
45 2023-01-08 23:45:00 NaN
2023-01-15 23:45:00 NaN
2023-01-22 23:45:00 1007.0

resample dataset with one irregular datetime

I have a dataframe like the following. I wanted to check the values for each 15minutes. But I see that there is a time at 09:05:51. How can I resample the dataframe for 15minutes?
hour_min value
06:30:00 0.0
06:45:00 0.0
07:00:00 0.0
07:15:00 0.0
07:30:00 102.754717
07:45:00 130.599057
08:00:00 154.117925
08:15:00 189.061321
08:30:00 214.924528
08:45:00 221.382075
09:00:00 190.839623
09:05:51 428.0
09:15:00 170.973995
09:30:00 0.0
09:45:00 0.0
10:00:00 174.448113
10:15:00 174.900943
10:30:00 182.976415
10:45:00 195.783019
11:00:00 200.337292
11:14:00 80.0
11:15:00 206.280952
11:30:00 218.87886
11:45:00 238.251781
12:00:00 115.5
12:15:00 85.5
12:30:00 130.0
12:45:00 141.0
13:00:00 267.353774
13:15:00 257.061321
13:21:00 8.0
13:27:19 80.0
13:30:00 258.761905
13:45:00 254.703088
13:53:52 278.0
14:00:00 254.790476
14:15:00 247.165094
14:30:00 250.061321
14:45:00 264.014151
15:00:00 132.0
15:15:00 108.0
15:30:00 158.5
15:45:00 457.0
16:00:00 273.745283
16:15:00 273.962264
16:30:00 279.089623
16:45:00 280.264151
17:00:00 296.061321
17:15:00 296.481132
17:30:00 282.957547
17:45:00 279.816038
I have tried this line, but i get a typeError.
res = s.resample('15T').sum()
I tried to make the index to date, but it does not work too.

Group by Index of Row in Pandas

I want to group and sum every 7 rows together (Hence to get a total of each week). There are currently two columns. One for date and the other for a float.
1/22/2020 NaN
1/23/2020 0.0
1/24/2020 1.0
1/25/2020 0.0
1/26/2020 3.0
1/27/2020 0.0
1/28/2020 0.0
1/29/2020 0.0
1/30/2020 0.0
1/31/2020 2.0
2/1/2020 1.0
2/2/2020 0.0
2/3/2020 3.0
2/4/2020 0.0
2/5/2020 0.0
2/6/2020 0.0
2/7/2020 0.0
2/8/2020 0.0
2/9/2020 0.0
2/10/2020 0.0
2/11/2020 1.0
2/12/2020 0.0
2/13/2020 1.0
2/14/2020 0.0
2/15/2020 0.0
2/16/2020 0.0
2/17/2020 0.0
2/18/2020 0.0
2/19/2020 0.0
2/20/2020 0.0
... ...
2/28/2020 0.0
2/29/2020 8.0
3/1/2020 6.0
3/2/2020 23.0
3/3/2020 20.0
3/4/2020 31.0
3/5/2020 68.0
3/6/2020 45.0
3/7/2020 119.0
3/8/2020 114.0
3/9/2020 64.0
3/10/2020 194.0
3/11/2020 397.0
3/12/2020 452.0
3/13/2020 590.0
3/14/2020 710.0
3/15/2020 61.0
3/16/2020 1389.0
3/17/2020 1789.0
3/18/2020 906.0
3/19/2020 3068.0
3/20/2020 4009.0
3/21/2020 4017.0
3/23/2020 25568.0
3/24/2020 10074.0
3/25/2020 12043.0
3/26/2020 18058.0
3/27/2020 17822.0
3/28/2020 19825.0
3/29/2020 19408.0
Assuming your date column is called dt and your value column is val:
import numpy as np
# in case if it's not already date time format:
df["dt"]=pd.to_datetime(df["dt"])
# your data looks sorted, but in case if it's not - that's the prerequisite here:
df=df.sort_values("dt")
df=df.groupby(np.arange(len(df))//7).agg({"dt": (min, max), "val": sum})
The aggregation for dt is done only so you can explicitly indicate aggregated interval - it might be enough to just take min for instance, or ignore it at all...
Set the date column as the index and use resample
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.resample('1W').sum()

how to fill missing datatime row with pandas

index valuve
2017-01-25 01:00:00:00 1
2017-01-25 02:00:00:00 5
2017-01-25 03:00:00:00 7
2017-01-25 07:00:00:00 34
2017-01-25 20:00:00:00 45
2017-01-25 24:00:00:00 45
2017-01-26 1:00:00:00 31
This dataframe is a 24h record of each day, but it misses some record. How can i insert the missing row into the right place and fill 'nan' to the corresponding value?
Here is complicated 24H in datetimes, so necessary replace it to 23H and add one hour. Last use DataFrame.asfreq for add missing values for 24H DatetimeIndex:
mask = df.index.str.contains(' 24:')
idx = df.index.where(~mask, df.index.str.replace(' 24:', ' 23:'))
idx = pd.to_datetime(idx, format='%Y-%m-%d %H:%M:%S:%f')
df.index = idx.where(~mask, idx + pd.Timedelta(1, unit='H'))
df = df.asfreq('H')
print (df)
valuve
index
2017-01-25 01:00:00 1.0
2017-01-25 02:00:00 5.0
2017-01-25 03:00:00 7.0
2017-01-25 04:00:00 NaN
2017-01-25 05:00:00 NaN
2017-01-25 06:00:00 NaN
2017-01-25 07:00:00 34.0
2017-01-25 08:00:00 NaN
2017-01-25 09:00:00 NaN
2017-01-25 10:00:00 NaN
2017-01-25 11:00:00 NaN
2017-01-25 12:00:00 NaN
2017-01-25 13:00:00 NaN
2017-01-25 14:00:00 NaN
2017-01-25 15:00:00 NaN
2017-01-25 16:00:00 NaN
2017-01-25 17:00:00 NaN
2017-01-25 18:00:00 NaN
2017-01-25 19:00:00 NaN
2017-01-25 20:00:00 45.0
2017-01-25 21:00:00 NaN
2017-01-25 22:00:00 NaN
2017-01-25 23:00:00 NaN
2017-01-26 00:00:00 45.0
2017-01-26 01:00:00 31.0

multi index(time series) slicing error in pandas

i have below dataframe. date/time is multi-indexed indexes.
when i doing this code,
<code>
idx = pd.IndexSlice
print(df_per_wday_temp.loc[idx[:,datetime.time(4, 0, 0): datetime.time(7, 0, 0)]])"
but i got error 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'. this may be error in
index slicing but i don't know why this happened. anybody can solve it ?
a b
date time
2018-01-26 19:00:00 25.08 -7.85
19:15:00 24.86 -7.81
19:30:00 24.67 -8.24
19:45:00 NaN -9.32
20:00:00 NaN -8.29
20:15:00 NaN -8.58
20:30:00 NaN -9.48
20:45:00 NaN -8.73
21:00:00 NaN -8.60
21:15:00 NaN -8.70
21:30:00 NaN -8.53
21:45:00 NaN -8.90
22:00:00 NaN -8.55
22:15:00 NaN -8.48
22:30:00 NaN -9.90
22:45:00 NaN -9.70
23:00:00 NaN -8.98
23:15:00 NaN -9.17
23:30:00 NaN -9.07
23:45:00 NaN -9.45
00:00:00 NaN -9.64
00:15:00 NaN -10.08
00:30:00 NaN -8.87
00:45:00 NaN -9.91
01:00:00 NaN -9.91
01:15:00 NaN -9.93
01:30:00 NaN -9.55
01:45:00 NaN -9.51
02:00:00 NaN -9.75
02:15:00 NaN -9.44
... ... ...
03:45:00 NaN -9.28
04:00:00 NaN -9.96
04:15:00 NaN -10.19
04:30:00 NaN -10.20
04:45:00 NaN -9.85
05:00:00 NaN -10.33
05:15:00 NaN -10.18
05:30:00 NaN -10.81
05:45:00 NaN -10.51
06:00:00 NaN -10.41
06:15:00 NaN -10.49
06:30:00 NaN -10.13
06:45:00 NaN -10.36
07:00:00 NaN -10.71
07:15:00 NaN -12.11
07:30:00 NaN -10.76
07:45:00 NaN -10.76
08:00:00 NaN -11.63
08:15:00 NaN -11.18
08:30:00 NaN -10.49
08:45:00 NaN -11.18
09:00:00 NaN -10.67
09:15:00 NaN -10.60
09:30:00 NaN -10.36
09:45:00 NaN -9.39
10:00:00 NaN -9.77
10:15:00 NaN -9.54
10:30:00 NaN -8.99
10:45:00 NaN -9.01
11:00:00 NaN -10.01
thanks in advance
If is not possible sorting index, is necessary create boolean mask and filter by boolean indexing:
from datetime import time
mask = df1.index.get_level_values(1).to_series().between(time(4, 0, 0), time(7, 0, 0)).values
df = df1[mask]
print (df)
a b
date time
2018-01-26 04:00:00 NaN -9.96
04:15:00 NaN -10.19
04:30:00 NaN -10.20
04:45:00 NaN -9.85
05:00:00 NaN -10.33
05:15:00 NaN -10.18
05:30:00 NaN -10.81