Split DateTimeIndex data based on hour/minute/second - pandas

I have time-series data that I would like to split based on hour, or minute, or second. This is generally user-defined. I would like to know how it can be done.
For example, consider the following:
test = pd.DataFrame({'TIME': pd.date_range(start='2016-09-30',
freq='600s', periods=20)})
test['X'] = np.arange(20)
The output is:
0 2016-09-30 00:00:00 0
1 2016-09-30 00:10:00 1
2 2016-09-30 00:20:00 2
3 2016-09-30 00:30:00 3
4 2016-09-30 00:40:00 4
5 2016-09-30 00:50:00 5
6 2016-09-30 01:00:00 6
7 2016-09-30 01:10:00 7
8 2016-09-30 01:20:00 8
9 2016-09-30 01:30:00 9
10 2016-09-30 01:40:00 10
11 2016-09-30 01:50:00 11
12 2016-09-30 02:00:00 12
13 2016-09-30 02:10:00 13
14 2016-09-30 02:20:00 14
15 2016-09-30 02:30:00 15
16 2016-09-30 02:40:00 16
17 2016-09-30 02:50:00 17
18 2016-09-30 03:00:00 18
19 2016-09-30 03:10:00 19
Suppose I want to split it by hour. I would like the following as one chunk which I can then save to a file.
0 2016-09-30 00:00:00 0
1 2016-09-30 00:10:00 1
2 2016-09-30 00:20:00 2
3 2016-09-30 00:30:00 3
4 2016-09-30 00:40:00 4
5 2016-09-30 00:50:00 5
The second chunk would be:
0 2016-09-30 01:00:00 6
1 2016-09-30 01:10:00 7
2 2016-09-30 01:20:00 8
3 2016-09-30 01:30:00 9
4 2016-09-30 01:40:00 10
5 2016-09-30 01:50:00 11
and so on...
Note I can do it purely based on logical conditions such as,
df[(df['TIME'] >= '2016-09-30 00:00:00') &
(df['TIME'] <= '2016-09-30 00:50:00')]
and repeat....
but what if my sampling changes? Is there a way to create a mask or something that takes less amount of code and is efficient? I have 10 GB of data.

Option 1
you can groupby series without having them in the object you're grouping.
Option 2
use pd.TimeGrouper
test.set_index('TIME').groupby(pd.TimeGrouper('S')) # Group by seconds
test.set_index('TIME').groupby(pd.TimeGrouper('T')) # Group by minutes
test.set_index('TIME').groupby(pd.TimeGrouper('H')) # Group by hours

You need to use groupby for this, and the grouping should be based on date and hour:
test['DATE'] = test['TIME'].dt.date
test['HOUR'] = test['TIME'].dt.hour
grp = test.groupby(['DATE', 'HOUR'])
You can then loop over the groups and do the operation you want.
for key, df in grp:
print(key, df)
((datetime.date(2016, 9, 30), 0), TIME X DATE HOUR
0 2016-09-30 00:00:00 0 2016-09-30 0
1 2016-09-30 00:10:00 1 2016-09-30 0
2 2016-09-30 00:20:00 2 2016-09-30 0
3 2016-09-30 00:30:00 3 2016-09-30 0
4 2016-09-30 00:40:00 4 2016-09-30 0
5 2016-09-30 00:50:00 5 2016-09-30 0)
((datetime.date(2016, 9, 30), 1), TIME X DATE HOUR
6 2016-09-30 01:00:00 6 2016-09-30 1
7 2016-09-30 01:10:00 7 2016-09-30 1
8 2016-09-30 01:20:00 8 2016-09-30 1
9 2016-09-30 01:30:00 9 2016-09-30 1
10 2016-09-30 01:40:00 10 2016-09-30 1
11 2016-09-30 01:50:00 11 2016-09-30 1)
((datetime.date(2016, 9, 30), 2), TIME X DATE HOUR
12 2016-09-30 02:00:00 12 2016-09-30 2
13 2016-09-30 02:10:00 13 2016-09-30 2
14 2016-09-30 02:20:00 14 2016-09-30 2
15 2016-09-30 02:30:00 15 2016-09-30 2
16 2016-09-30 02:40:00 16 2016-09-30 2
17 2016-09-30 02:50:00 17 2016-09-30 2)
((datetime.date(2016, 9, 30), 3), TIME X DATE HOUR
18 2016-09-30 03:00:00 18 2016-09-30 3
19 2016-09-30 03:10:00 19 2016-09-30 3)


Get value of Same Hour value at 1,2 day before, 1 weak before , 1 month before

I have time series data with other fields.
Now I want create more columns like
If values are not present at the hour then value should be set as zero
dataframe can be loaded from here
url = 'https://drive.google.com/file/d/1BXvJqKGLwG4hqWJvh9gPAHqCbCcCKkUT/view?usp=sharing'
path = 'https://drive.google.com/uc? export=download&id='+url.split('/')[-2]
df = pd.read_csv(path,index_col=0,delimiter=",")
The DataFrame looks like the following:
| time | StartCity | District | Id | stype | EndCity | Count
2021-09-15 09:00:00 1 104 2713 21 9 2
2021-05-16 11:00:00 1 107 1044 11 6 1
2021-05-16 12:00:00 1 107 1044 11 6 0
2021-05-16 13:00:00 1 107 1044 11 6 0
2021-05-16 14:00:00 1 107 1044 11 6 0
2021-05-16 15:00:00 1 107 1044 11 6 0
2021-05-16 16:00:00 1 107 1044 11 6 0
2021-05-16 17:00:00 1 107 1044 11 6 0
2021-05-16 18:00:00 1 107 1044 11 6 0
2021-05-16 19:00:00 1 107 1044 11 6 0
2021-05-16 20:00:00 1 107 1044 11 6 0
2021-05-16 21:00:00 1 107 1044 11 6 0
2021-05-16 22:00:00 1 107 1044 11 6 0
2021-05-16 23:00:00 1 107 1044 11 6 0
2021-05-17 00:00:00 1 107 1044 11 6 0
2021-05-17 01:00:00 1 107 1044 11 6 0
2021-05-17 02:00:00 1 107 1044 11 6 0
2021-05-17 03:00:00 1 107 1044 11 6 0
2021-05-17 04:00:00 1 107 1044 11 6 0
2021-05-17 05:00:00 1 107 1044 11 6 0
2021-05-17 06:00:00 1 107 1044 11 6 0
2021-05-17 07:00:00 1 107 1044 11 6 0
2021-05-17 08:00:00 1 107 1044 11 6 0
2021-05-17 09:00:00 1 107 1044 11 6 0
2021-05-17 10:00:00 1 107 1044 11 6 0
2021-05-17 11:00:00 1 107 1044 11 6 0

Pandas: create a period based on date column

I have a dataframe
ID datetime
11 01-09-2021 10:00:00
11 01-09-2021 10:15:15
11 01-09-2021 15:00:00
12 01-09-2021 15:10:00
11 01-09-2021 18:00:00
I need to add period based just on datetime if it increases to 2 hours
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 2
11 01-09-2021 18:00:00 3
And the same thing but based on ID and datetime
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 1
11 01-09-2021 18:00:00 3
How can I do that?
You can get difference by Series.diff, convert to hours Series.dt.total_seconds, comapre for 2 and add cumulative sum:
df['period'] = df['datetime'].diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 2
4 11 2021-01-09 18:00:00 3
Similar idea per groups:
f = lambda x: x.diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
df['period'] = df.groupby('ID')['datetime'].transform(f)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 1
4 11 2021-01-09 18:00:00 3

Daily calculations in intraday data

Let's say I have a DataFrame with date_time index:
date_time a b
2020-11-23 04:00:00 10 5
2020-11-23 05:00:00 11 5
2020-11-23 06:00:00 12 5
2020-11-24 04:30:00 13 6
2020-11-24 05:30:00 14 6
2020-11-24 06:30:00 15 6
2020-11-25 06:00:00 16 7
2020-11-25 07:00:00 17 7
2020-11-25 08:00:00 18 7
"a" column is intraday data (every row - different value). "b" column - DAILY data - same data during the current day.
I need to make some calculations with "b" (daily) column and create "c" column with the result. For example, sum for two last days.
date_time a b c
2020-11-23 04:00:00 10 5 NaN
2020-11-23 05:00:00 11 5 NaN
2020-11-23 06:00:00 12 5 NaN
2020-11-24 04:30:00 13 6 11
2020-11-24 05:30:00 14 6 11
2020-11-24 06:30:00 15 6 11
2020-11-25 06:00:00 16 7 13
2020-11-25 07:00:00 17 7 13
2020-11-25 08:00:00 18 7 13
I guesss I should use something like
df['c'] = df.resample('D').b.rolling(3).sum ...
but I got "NaN" values in "c".
Could you help me? Thanks!
One thing you can do is to drop duplicates on the date and work on that:
# get the dates
df['date'] = df['date_time'].dt.normalize()
df['c'] = (df.drop_duplicates('date')['b'] # drop duplicates on dates
.rolling(2).sum() # rolling sum
df['c'] = df['c'].ffill() # fill the missing data
date_time a b date c
0 2020-11-23 04:00:00 10 5 2020-11-23 NaN
1 2020-11-23 05:00:00 11 5 2020-11-23 NaN
2 2020-11-23 06:00:00 12 5 2020-11-23 NaN
3 2020-11-24 04:30:00 13 6 2020-11-24 11.0
4 2020-11-24 05:30:00 14 6 2020-11-24 11.0
5 2020-11-24 06:30:00 15 6 2020-11-24 11.0
6 2020-11-25 06:00:00 16 7 2020-11-25 13.0
7 2020-11-25 07:00:00 17 7 2020-11-25 13.0
8 2020-11-25 08:00:00 18 7 2020-11-25 13.0

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

How to get count incremental by date

I am trying to get a count of rows with incremental dates.
My table looks like this:
ID name status create_date
1 John AC 2016-01-01 00:00:26.513
2 Jane AC 2016-01-02 00:00:26.513
3 Kane AC 2016-01-02 00:00:26.513
4 Carl AC 2016-01-03 00:00:26.513
5 Dave AC 2016-01-04 00:00:26.513
6 Gina AC 2016-01-04 00:00:26.513
Now what I want to return from the SQL is something like this:
Date Count
2016-01-01 1
2016-01-02 3
2016-01-03 4
2016-01-04 6
You can make use of COUNT() OVER () without PARTITION BY,by using ORDER BY. It will give you the cumulative sum.Use DISTINCT to filter out the duplicate values.
COUNT(create_date) OVER (ORDER BY CAST(create_date AS DATE)) as [COUNT]
FROM [YourTable]
SELECT create_date, COUNT(create_date) as [COUNT]
SELECT CAST(create_date AS DATE) create_date
FROM [YourTable]
) T
GROUP BY create_date
Per your description, you need a continuous dates list, Does it make sense?
This sample only generating one-month data.
CREATE TABLE #tt(ID INT, name VARCHAR(10), status VARCHAR(10), create_date DATETIME)
SELECT 1,'John','AC','2016-01-01 00:00:26.513' UNION
SELECT 2,'Jane','AC','2016-01-02 00:00:26.513' UNION
SELECT 3,'Kane','AC','2016-01-02 00:00:26.513' UNION
SELECT 4,'Carl','AC','2016-01-03 00:00:26.513' UNION
SELECT 5,'Dave','AC','2016-01-04 00:00:26.513' UNION
SELECT 6,'Gina','AC','2016-01-04 00:00:26.513' UNION
SELECT 7,'Tina','AC','2016-01-08 00:00:26.513'
SELECT CONVERT(DATE,DATEADD(d,sv.number,n.FirstDate)) AS [Date],COUNT(n.num) AS [Count]
FROM master.dbo.spt_values AS sv
SELECT MIN(t.create_date)OVER() AS FirstDate,DATEDIFF(d,MIN(t.create_date)OVER(),t.create_date) AS num FROM #tt AS t
) AS n ON n.num<=sv.number
WHERE sv.type='P' AND sv.number>=0 AND MONTH(DATEADD(d,sv.number,n.FirstDate))=MONTH(n.FirstDate)
GROUP BY CONVERT(DATE,DATEADD(d,sv.number,n.FirstDate))
Date Count
---------- -----------
2016-01-01 1
2016-01-02 3
2016-01-03 4
2016-01-04 6
2016-01-05 6
2016-01-06 6
2016-01-07 6
2016-01-08 7
2016-01-09 7
2016-01-10 7
2016-01-11 7
2016-01-12 7
2016-01-13 7
2016-01-14 7
2016-01-15 7
2016-01-16 7
2016-01-17 7
2016-01-18 7
2016-01-19 7
2016-01-20 7
2016-01-21 7
2016-01-22 7
2016-01-23 7
2016-01-24 7
2016-01-25 7
2016-01-26 7
2016-01-27 7
2016-01-28 7
2016-01-29 7
2016-01-30 7
2016-01-31 7
2017-01-01 7
2017-01-02 7
2017-01-03 7
2017-01-04 7
2017-01-05 7
2017-01-06 7
2017-01-07 7
2017-01-08 7
2017-01-09 7
2017-01-10 7
2017-01-11 7
2017-01-12 7
2017-01-13 7
2017-01-14 7
2017-01-15 7
2017-01-16 7
2017-01-17 7
2017-01-18 7
2017-01-19 7
2017-01-20 7
2017-01-21 7
2017-01-22 7
2017-01-23 7
2017-01-24 7
2017-01-25 7
2017-01-26 7
2017-01-27 7
2017-01-28 7
2017-01-29 7
2017-01-30 7
2017-01-31 7
2018-01-01 7
2018-01-02 7
2018-01-03 7
2018-01-04 7
2018-01-05 7
2018-01-06 7
2018-01-07 7
2018-01-08 7
2018-01-09 7
2018-01-10 7
2018-01-11 7
2018-01-12 7
2018-01-13 7
2018-01-14 7
2018-01-15 7
2018-01-16 7
2018-01-17 7
2018-01-18 7
2018-01-19 7
2018-01-20 7
2018-01-21 7
2018-01-22 7
2018-01-23 7
2018-01-24 7
2018-01-25 7
2018-01-26 7
2018-01-27 7
2018-01-28 7
2018-01-29 7
2018-01-30 7
2018-01-31 7
2019-01-01 7
2019-01-02 7
2019-01-03 7
2019-01-04 7
2019-01-05 7
2019-01-06 7
2019-01-07 7
2019-01-08 7
2019-01-09 7
2019-01-10 7
2019-01-11 7
2019-01-12 7
2019-01-13 7
2019-01-14 7
2019-01-15 7
2019-01-16 7
2019-01-17 7
2019-01-18 7
2019-01-19 7
2019-01-20 7
2019-01-21 7
2019-01-22 7
2019-01-23 7
2019-01-24 7
2019-01-25 7
2019-01-26 7
2019-01-27 7
2019-01-28 7
2019-01-29 7
2019-01-30 7
2019-01-31 7
2020-01-01 7
2020-01-02 7
2020-01-03 7
2020-01-04 7
2020-01-05 7
2020-01-06 7
2020-01-07 7
2020-01-08 7
2020-01-09 7
2020-01-10 7
2020-01-11 7
2020-01-12 7
2020-01-13 7
2020-01-14 7
2020-01-15 7
2020-01-16 7
2020-01-17 7
2020-01-18 7
2020-01-19 7
2020-01-20 7
2020-01-21 7
2020-01-22 7
2020-01-23 7
2020-01-24 7
2020-01-25 7
2020-01-26 7
2020-01-27 7
2020-01-28 7
2020-01-29 7
2020-01-30 7
2020-01-31 7
2021-01-01 7
2021-01-02 7
2021-01-03 7
2021-01-04 7
2021-01-05 7
2021-01-06 7
2021-01-07 7
2021-01-08 7
2021-01-09 7
2021-01-10 7
2021-01-11 7
2021-01-12 7
2021-01-13 7
2021-01-14 7
2021-01-15 7
2021-01-16 7
2021-01-17 7
2021-01-18 7
2021-01-19 7
2021-01-20 7
2021-01-21 7
2021-01-22 7
2021-01-23 7
2021-01-24 7
2021-01-25 7
2021-01-26 7
2021-01-27 7
2021-01-28 7
2021-01-29 7
2021-01-30 7
2021-01-31 7
select r.date,count(r.date) count
select id,name,substring(convert(nvarchar(50),create_date),1,10) date
from tblName
) r
group by r.date
In this code, in the subquery part,
I select the first 10 letter of date which is converted from dateTime to nvarchar so I make like '2016-01-01'. (which is not also necessary but for make code more readable I prefer to do it in this way).
Then with a simple group by I have date and date's count.