I have a dataframe
ID datetime
11 01-09-2021 10:00:00
11 01-09-2021 10:15:15
11 01-09-2021 15:00:00
12 01-09-2021 15:10:00
11 01-09-2021 18:00:00
I need to add period based just on datetime if it increases to 2 hours
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 2
11 01-09-2021 18:00:00 3
And the same thing but based on ID and datetime
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 1
11 01-09-2021 18:00:00 3
How can I do that?
You can get difference by Series.diff, convert to hours Series.dt.total_seconds, comapre for 2 and add cumulative sum:
df['period'] = df['datetime'].diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 2
4 11 2021-01-09 18:00:00 3
Similar idea per groups:
f = lambda x: x.diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
df['period'] = df.groupby('ID')['datetime'].transform(f)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 1
4 11 2021-01-09 18:00:00 3
I have a dataframe that is increases with 15minute frequency, but sometimes has a gap, which I want to group them together by a task.
Heres a sample data:
Input:
data = {'Date':['2019-01-05 00:00:00', '2019-01-05 00:15:00',
'2019-01-05 00:30:00', '2019-01-05 00:45:00',
'2019-01-05 01:00:00', '2019-01-05 01:15:00',
'2019-01-05 01:30:00', '2019-01-05 01:45:00',
'2019-01-06 15:00:00', '2019-01-06 15:15:00',
'2020-01-06 15:30:00', '2020-01-06 15:45:00',
'2020-02-10 22:15:00', '2020-02-10 22:30:00',
'2020-02-10 22:45:00', '2020-02-10 23:00:00',
'2020-02-11 23:15:00', '2020-02-11 23:30:00',
'2020-02-11 23:45:00', '2020-02-11 00:00:00'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 5.0,
1.0, 5.2, 4.5, 8.9, 4.5,
4.5, 7.6, 8.3, 5.6, 5.3,
3.4, 5.5, 2.4, 5.3, 5.4]}
df = pd.DataFrame(data, index =[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
# print the data
print(df)
Output:
Date Ratings
1 2019-01-05 00:00:00 9.0
2 2019-01-05 00:15:00 8.0
3 2019-01-05 00:30:00 5.0
4 2019-01-05 00:45:00 3.0
5 2019-01-05 01:00:00 5.0
6 2019-01-05 01:15:00 1.0
7 2019-01-05 01:30:00 5.2
8 2019-01-05 01:45:00 4.5
9 2019-01-06 15:00:00 8.9
10 2019-01-06 15:15:00 4.5
11 2020-01-06 15:30:00 4.5
12 2020-01-06 15:45:00 7.6
13 2020-02-10 22:15:00 8.3
14 2020-02-10 22:30:00 5.6
15 2020-02-10 22:45:00 5.3
16 2020-02-10 23:00:00 3.4
17 2020-02-11 23:15:00 5.5
18 2020-02-11 23:30:00 2.4
19 2020-02-11 23:45:00 5.3
20 2020-02-11 00:00:00 5.4
Where I need it to sort by creating a new column if the frequency exceeds more than what it is supposed to be, which in this case is 15minutes.
Desired:
Date Ratings Task
1 2019-01-05 00:00:00 9.0 1
2 2019-01-05 00:15:00 8.0 1
3 2019-01-05 00:30:00 5.0 1
4 2019-01-05 00:45:00 3.0 1
5 2019-01-05 01:00:00 5.0 1
6 2019-01-05 01:15:00 1.0 1
7 2019-01-05 01:30:00 5.2 1
8 2019-01-05 01:45:00 4.5 1
9 2019-01-06 15:00:00 8.9 2
10 2019-01-06 15:15:00 4.5 2
11 2019-01-06 15:30:00 4.5 2
12 2019-01-06 15:45:00 7.6 2
13 2019-02-10 22:15:00 8.3 3
14 2019-02-10 22:30:00 5.6 3
15 2019-02-10 22:45:00 5.3 3
16 2019-02-10 23:00:00 3.4 3
17 2019-02-11 00:00:00 5.5 4
18 2019-02-11 00:15:00 2.4 4
19 2019-02-11 00:30:00 5.3 4
20 2019-02-11 00:45:00 5.4 4
As you can see, there are 4 tasks grouped together, if there is a jump in time of >15minutes.
The date column is in datetime64 format currently, and I can set it to any format required. Thank you!
Try:
df['Task'] = df['Date'].sub(df['Date'].shift()) \
.gt(pd.Timedelta(minutes=15)) \
.cumsum() + 1
>>> df
Date Ratings Task
1 2019-01-05 00:00:00 9.0 1
2 2019-01-05 00:15:00 8.0 1
3 2019-01-05 00:30:00 5.0 1
4 2019-01-05 00:45:00 3.0 1
5 2019-01-05 01:00:00 5.0 1
6 2019-01-05 01:15:00 1.0 1
7 2019-01-05 01:30:00 5.2 1
8 2019-01-05 01:45:00 4.5 1
9 2019-01-06 15:00:00 8.9 2
10 2019-01-06 15:15:00 4.5 2
11 2020-01-06 15:30:00 4.5 3
12 2020-01-06 15:45:00 7.6 3
13 2020-02-10 22:15:00 8.3 4
14 2020-02-10 22:30:00 5.6 4
15 2020-02-10 22:45:00 5.3 4
16 2020-02-10 23:00:00 3.4 4
17 2020-02-11 23:15:00 5.5 5
18 2020-02-11 23:30:00 2.4 5
19 2020-02-11 23:45:00 5.3 5
20 2020-02-11 00:00:00 5.4 5
Let's say I have a DataFrame with date_time index:
date_time a b
2020-11-23 04:00:00 10 5
2020-11-23 05:00:00 11 5
2020-11-23 06:00:00 12 5
2020-11-24 04:30:00 13 6
2020-11-24 05:30:00 14 6
2020-11-24 06:30:00 15 6
2020-11-25 06:00:00 16 7
2020-11-25 07:00:00 17 7
2020-11-25 08:00:00 18 7
"a" column is intraday data (every row - different value). "b" column - DAILY data - same data during the current day.
I need to make some calculations with "b" (daily) column and create "c" column with the result. For example, sum for two last days.
Result:
date_time a b c
2020-11-23 04:00:00 10 5 NaN
2020-11-23 05:00:00 11 5 NaN
2020-11-23 06:00:00 12 5 NaN
2020-11-24 04:30:00 13 6 11
2020-11-24 05:30:00 14 6 11
2020-11-24 06:30:00 15 6 11
2020-11-25 06:00:00 16 7 13
2020-11-25 07:00:00 17 7 13
2020-11-25 08:00:00 18 7 13
I guesss I should use something like
df['c'] = df.resample('D').b.rolling(3).sum ...
but I got "NaN" values in "c".
Could you help me? Thanks!
One thing you can do is to drop duplicates on the date and work on that:
# get the dates
df['date'] = df['date_time'].dt.normalize()
df['c'] = (df.drop_duplicates('date')['b'] # drop duplicates on dates
.rolling(2).sum() # rolling sum
)
df['c'] = df['c'].ffill() # fill the missing data
Output:
date_time a b date c
0 2020-11-23 04:00:00 10 5 2020-11-23 NaN
1 2020-11-23 05:00:00 11 5 2020-11-23 NaN
2 2020-11-23 06:00:00 12 5 2020-11-23 NaN
3 2020-11-24 04:30:00 13 6 2020-11-24 11.0
4 2020-11-24 05:30:00 14 6 2020-11-24 11.0
5 2020-11-24 06:30:00 15 6 2020-11-24 11.0
6 2020-11-25 06:00:00 16 7 2020-11-25 13.0
7 2020-11-25 07:00:00 17 7 2020-11-25 13.0
8 2020-11-25 08:00:00 18 7 2020-11-25 13.0
I'm trying to understand why the DATEDIFF does not work consistently.
I have a table Projects with below values:
Task_ID Start_Date End_Date
--------------------------------------
1 2015-10-01 2015-10-02
24 2015-10-02 2015-10-03
2 2015-10-03 2015-10-04
23 2015-10-04 2015-10-05
3 2015-10-11 2015-10-12
22 2015-10-12 2015-10-13
4 2015-10-15 2015-10-16
21 2015-10-17 2015-10-18
5 2015-10-19 2015-10-20
20 2015-10-21 2015-10-22
6 2015-10-25 2015-10-26
19 2015-10-26 2015-10-27
7 2015-10-27 2015-10-28
18 2015-10-28 2015-10-29
8 2015-10-29 2015-10-30
17 2015-10-30 2015-10-31
9 2015-11-01 2015-11-02
16 2015-11-04 2015-11-05
10 2015-11-07 2015-11-08
15 2015-11-06 2015-11-07
11 2015-11-05 2015-11-06
14 2015-11-11 2015-11-12
12 2015-11-12 2015-11-13
13 2015-11-17 2015-11-18
When I run the below query on it;
WITH t AS
(
SELECT
Start_Date s,
End_Date e,
ROW_NUMBER() OVER(ORDER BY Start_Date) rn
FROM
Projects
GROUP BY
Start_Date, End_Date
)
SELECT
s, e, rn, DATEDIFF(day, rn, s)
FROM t
I get this output:
2015-10-01 2015-10-02 1 42275
2015-10-02 2015-10-03 2 42275
2015-10-03 2015-10-04 3 42275
2015-10-04 2015-10-05 4 42275
2015-10-11 2015-10-12 5 42281
2015-10-12 2015-10-13 6 42281
2015-10-15 2015-10-16 7 42283
2015-10-17 2015-10-18 8 42284
2015-10-19 2015-10-20 9 42285
2015-10-21 2015-10-22 10 42286
2015-10-25 2015-10-26 11 42289
2015-10-26 2015-10-27 12 42289
2015-10-27 2015-10-28 13 42289
2015-10-28 2015-10-29 14 42289
2015-10-29 2015-10-30 15 42289
2015-10-30 2015-10-31 16 42289
2015-11-01 2015-11-02 17 42290
2015-11-04 2015-11-05 18 42292
2015-11-05 2015-11-06 19 42292
2015-11-06 2015-11-07 20 42292
2015-11-07 2015-11-08 21 42292
2015-11-11 2015-11-12 22 42295
2015-11-12 2015-11-13 23 42295
2015-11-17 2015-11-18 24 42299
But when I individually execute DATEDIFF, I get different results:
select DATEDIFF(day, 1, 2015-10-01)
2003
select DATEDIFF(day, 2, 2015-10-02)
2001
Can someone please explain this to me? Am I doing something wrong with the individual select statement?
Thanks for the help.
This is what the arguments for datediff look like.
DATEDIFF ( datepart , startdate , enddate )
Judging by the parameters you passed, I assume you are trying to substract 1 or 2 days from a date. You should use
DATEADD (datepart , number , date )
So substracting just becomes adding with a minus like DATEADD (day,-1,'2015-10-02')
If you really wanted to use the DATEDIFF function as intended, make sure you use single quotes around your dates and read the datepart boundaries section in the documentation, because a nanosecond difference at the boundary can turn into a year difference in your result.
Also when using a number X as a date, SQL Server interprets it as (1900-01-01 + X days).
I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429