How to get count incremental by date - sql

I am trying to get a count of rows with incremental dates.
My table looks like this:
ID name status create_date
1 John AC 2016-01-01 00:00:26.513
2 Jane AC 2016-01-02 00:00:26.513
3 Kane AC 2016-01-02 00:00:26.513
4 Carl AC 2016-01-03 00:00:26.513
5 Dave AC 2016-01-04 00:00:26.513
6 Gina AC 2016-01-04 00:00:26.513
Now what I want to return from the SQL is something like this:
Date Count
2016-01-01 1
2016-01-02 3
2016-01-03 4
2016-01-04 6

You can make use of COUNT() OVER () without PARTITION BY,by using ORDER BY. It will give you the cumulative sum.Use DISTINCT to filter out the duplicate values.
SELECT DISTINCT CAST(create_date AS DATE) [Date],
COUNT(create_date) OVER (ORDER BY CAST(create_date AS DATE)) as [COUNT]
FROM [YourTable]

SELECT create_date, COUNT(create_date) as [COUNT]
FROM (
SELECT CAST(create_date AS DATE) create_date
FROM [YourTable]
) T
GROUP BY create_date

Per your description, you need a continuous dates list, Does it make sense?
This sample only generating one-month data.
CREATE TABLE #tt(ID INT, name VARCHAR(10), status VARCHAR(10), create_date DATETIME)
INSERT INTO #tt
SELECT 1,'John','AC','2016-01-01 00:00:26.513' UNION
SELECT 2,'Jane','AC','2016-01-02 00:00:26.513' UNION
SELECT 3,'Kane','AC','2016-01-02 00:00:26.513' UNION
SELECT 4,'Carl','AC','2016-01-03 00:00:26.513' UNION
SELECT 5,'Dave','AC','2016-01-04 00:00:26.513' UNION
SELECT 6,'Gina','AC','2016-01-04 00:00:26.513' UNION
SELECT 7,'Tina','AC','2016-01-08 00:00:26.513'
SELECT * FROM #tt
SELECT CONVERT(DATE,DATEADD(d,sv.number,n.FirstDate)) AS [Date],COUNT(n.num) AS [Count]
FROM master.dbo.spt_values AS sv
LEFT JOIN (
SELECT MIN(t.create_date)OVER() AS FirstDate,DATEDIFF(d,MIN(t.create_date)OVER(),t.create_date) AS num FROM #tt AS t
) AS n ON n.num<=sv.number
WHERE sv.type='P' AND sv.number>=0 AND MONTH(DATEADD(d,sv.number,n.FirstDate))=MONTH(n.FirstDate)
GROUP BY CONVERT(DATE,DATEADD(d,sv.number,n.FirstDate))
Date Count
---------- -----------
2016-01-01 1
2016-01-02 3
2016-01-03 4
2016-01-04 6
2016-01-05 6
2016-01-06 6
2016-01-07 6
2016-01-08 7
2016-01-09 7
2016-01-10 7
2016-01-11 7
2016-01-12 7
2016-01-13 7
2016-01-14 7
2016-01-15 7
2016-01-16 7
2016-01-17 7
2016-01-18 7
2016-01-19 7
2016-01-20 7
2016-01-21 7
2016-01-22 7
2016-01-23 7
2016-01-24 7
2016-01-25 7
2016-01-26 7
2016-01-27 7
2016-01-28 7
2016-01-29 7
2016-01-30 7
2016-01-31 7
2017-01-01 7
2017-01-02 7
2017-01-03 7
2017-01-04 7
2017-01-05 7
2017-01-06 7
2017-01-07 7
2017-01-08 7
2017-01-09 7
2017-01-10 7
2017-01-11 7
2017-01-12 7
2017-01-13 7
2017-01-14 7
2017-01-15 7
2017-01-16 7
2017-01-17 7
2017-01-18 7
2017-01-19 7
2017-01-20 7
2017-01-21 7
2017-01-22 7
2017-01-23 7
2017-01-24 7
2017-01-25 7
2017-01-26 7
2017-01-27 7
2017-01-28 7
2017-01-29 7
2017-01-30 7
2017-01-31 7
2018-01-01 7
2018-01-02 7
2018-01-03 7
2018-01-04 7
2018-01-05 7
2018-01-06 7
2018-01-07 7
2018-01-08 7
2018-01-09 7
2018-01-10 7
2018-01-11 7
2018-01-12 7
2018-01-13 7
2018-01-14 7
2018-01-15 7
2018-01-16 7
2018-01-17 7
2018-01-18 7
2018-01-19 7
2018-01-20 7
2018-01-21 7
2018-01-22 7
2018-01-23 7
2018-01-24 7
2018-01-25 7
2018-01-26 7
2018-01-27 7
2018-01-28 7
2018-01-29 7
2018-01-30 7
2018-01-31 7
2019-01-01 7
2019-01-02 7
2019-01-03 7
2019-01-04 7
2019-01-05 7
2019-01-06 7
2019-01-07 7
2019-01-08 7
2019-01-09 7
2019-01-10 7
2019-01-11 7
2019-01-12 7
2019-01-13 7
2019-01-14 7
2019-01-15 7
2019-01-16 7
2019-01-17 7
2019-01-18 7
2019-01-19 7
2019-01-20 7
2019-01-21 7
2019-01-22 7
2019-01-23 7
2019-01-24 7
2019-01-25 7
2019-01-26 7
2019-01-27 7
2019-01-28 7
2019-01-29 7
2019-01-30 7
2019-01-31 7
2020-01-01 7
2020-01-02 7
2020-01-03 7
2020-01-04 7
2020-01-05 7
2020-01-06 7
2020-01-07 7
2020-01-08 7
2020-01-09 7
2020-01-10 7
2020-01-11 7
2020-01-12 7
2020-01-13 7
2020-01-14 7
2020-01-15 7
2020-01-16 7
2020-01-17 7
2020-01-18 7
2020-01-19 7
2020-01-20 7
2020-01-21 7
2020-01-22 7
2020-01-23 7
2020-01-24 7
2020-01-25 7
2020-01-26 7
2020-01-27 7
2020-01-28 7
2020-01-29 7
2020-01-30 7
2020-01-31 7
2021-01-01 7
2021-01-02 7
2021-01-03 7
2021-01-04 7
2021-01-05 7
2021-01-06 7
2021-01-07 7
2021-01-08 7
2021-01-09 7
2021-01-10 7
2021-01-11 7
2021-01-12 7
2021-01-13 7
2021-01-14 7
2021-01-15 7
2021-01-16 7
2021-01-17 7
2021-01-18 7
2021-01-19 7
2021-01-20 7
2021-01-21 7
2021-01-22 7
2021-01-23 7
2021-01-24 7
2021-01-25 7
2021-01-26 7
2021-01-27 7
2021-01-28 7
2021-01-29 7
2021-01-30 7
2021-01-31 7

select r.date,count(r.date) count
from
(
select id,name,substring(convert(nvarchar(50),create_date),1,10) date
from tblName
) r
group by r.date
In this code, in the subquery part,
I select the first 10 letter of date which is converted from dateTime to nvarchar so I make like '2016-01-01'. (which is not also necessary but for make code more readable I prefer to do it in this way).
Then with a simple group by I have date and date's count.

Related

Pandas: create a period based on date column

I have a dataframe
ID datetime
11 01-09-2021 10:00:00
11 01-09-2021 10:15:15
11 01-09-2021 15:00:00
12 01-09-2021 15:10:00
11 01-09-2021 18:00:00
I need to add period based just on datetime if it increases to 2 hours
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 2
11 01-09-2021 18:00:00 3
And the same thing but based on ID and datetime
ID datetime period
11 01-09-2021 10:00:00 1
11 01-09-2021 10:15:15 1
11 01-09-2021 15:00:00 2
12 01-09-2021 15:10:00 1
11 01-09-2021 18:00:00 3
How can I do that?
You can get difference by Series.diff, convert to hours Series.dt.total_seconds, comapre for 2 and add cumulative sum:
df['period'] = df['datetime'].diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 2
4 11 2021-01-09 18:00:00 3
Similar idea per groups:
f = lambda x: x.diff().dt.total_seconds().div(3600).gt(2).cumsum().add(1)
df['period'] = df.groupby('ID')['datetime'].transform(f)
print (df)
ID datetime period
0 11 2021-01-09 10:00:00 1
1 11 2021-01-09 10:15:15 1
2 11 2021-01-09 15:00:00 2
3 12 2021-01-09 15:10:00 1
4 11 2021-01-09 18:00:00 3

Create a new column if time frequency is more than a certain value

I have a dataframe that is increases with 15minute frequency, but sometimes has a gap, which I want to group them together by a task.
Heres a sample data:
Input:
data = {'Date':['2019-01-05 00:00:00', '2019-01-05 00:15:00',
'2019-01-05 00:30:00', '2019-01-05 00:45:00',
'2019-01-05 01:00:00', '2019-01-05 01:15:00',
'2019-01-05 01:30:00', '2019-01-05 01:45:00',
'2019-01-06 15:00:00', '2019-01-06 15:15:00',
'2020-01-06 15:30:00', '2020-01-06 15:45:00',
'2020-02-10 22:15:00', '2020-02-10 22:30:00',
'2020-02-10 22:45:00', '2020-02-10 23:00:00',
'2020-02-11 23:15:00', '2020-02-11 23:30:00',
'2020-02-11 23:45:00', '2020-02-11 00:00:00'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 5.0,
1.0, 5.2, 4.5, 8.9, 4.5,
4.5, 7.6, 8.3, 5.6, 5.3,
3.4, 5.5, 2.4, 5.3, 5.4]}
df = pd.DataFrame(data, index =[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
# print the data
print(df)
Output:
Date Ratings
1 2019-01-05 00:00:00 9.0
2 2019-01-05 00:15:00 8.0
3 2019-01-05 00:30:00 5.0
4 2019-01-05 00:45:00 3.0
5 2019-01-05 01:00:00 5.0
6 2019-01-05 01:15:00 1.0
7 2019-01-05 01:30:00 5.2
8 2019-01-05 01:45:00 4.5
9 2019-01-06 15:00:00 8.9
10 2019-01-06 15:15:00 4.5
11 2020-01-06 15:30:00 4.5
12 2020-01-06 15:45:00 7.6
13 2020-02-10 22:15:00 8.3
14 2020-02-10 22:30:00 5.6
15 2020-02-10 22:45:00 5.3
16 2020-02-10 23:00:00 3.4
17 2020-02-11 23:15:00 5.5
18 2020-02-11 23:30:00 2.4
19 2020-02-11 23:45:00 5.3
20 2020-02-11 00:00:00 5.4
Where I need it to sort by creating a new column if the frequency exceeds more than what it is supposed to be, which in this case is 15minutes.
Desired:
Date Ratings Task
1 2019-01-05 00:00:00 9.0 1
2 2019-01-05 00:15:00 8.0 1
3 2019-01-05 00:30:00 5.0 1
4 2019-01-05 00:45:00 3.0 1
5 2019-01-05 01:00:00 5.0 1
6 2019-01-05 01:15:00 1.0 1
7 2019-01-05 01:30:00 5.2 1
8 2019-01-05 01:45:00 4.5 1
9 2019-01-06 15:00:00 8.9 2
10 2019-01-06 15:15:00 4.5 2
11 2019-01-06 15:30:00 4.5 2
12 2019-01-06 15:45:00 7.6 2
13 2019-02-10 22:15:00 8.3 3
14 2019-02-10 22:30:00 5.6 3
15 2019-02-10 22:45:00 5.3 3
16 2019-02-10 23:00:00 3.4 3
17 2019-02-11 00:00:00 5.5 4
18 2019-02-11 00:15:00 2.4 4
19 2019-02-11 00:30:00 5.3 4
20 2019-02-11 00:45:00 5.4 4
As you can see, there are 4 tasks grouped together, if there is a jump in time of >15minutes.
The date column is in datetime64 format currently, and I can set it to any format required. Thank you!
Try:
df['Task'] = df['Date'].sub(df['Date'].shift()) \
.gt(pd.Timedelta(minutes=15)) \
.cumsum() + 1
>>> df
Date Ratings Task
1 2019-01-05 00:00:00 9.0 1
2 2019-01-05 00:15:00 8.0 1
3 2019-01-05 00:30:00 5.0 1
4 2019-01-05 00:45:00 3.0 1
5 2019-01-05 01:00:00 5.0 1
6 2019-01-05 01:15:00 1.0 1
7 2019-01-05 01:30:00 5.2 1
8 2019-01-05 01:45:00 4.5 1
9 2019-01-06 15:00:00 8.9 2
10 2019-01-06 15:15:00 4.5 2
11 2020-01-06 15:30:00 4.5 3
12 2020-01-06 15:45:00 7.6 3
13 2020-02-10 22:15:00 8.3 4
14 2020-02-10 22:30:00 5.6 4
15 2020-02-10 22:45:00 5.3 4
16 2020-02-10 23:00:00 3.4 4
17 2020-02-11 23:15:00 5.5 5
18 2020-02-11 23:30:00 2.4 5
19 2020-02-11 23:45:00 5.3 5
20 2020-02-11 00:00:00 5.4 5

Daily calculations in intraday data

Let's say I have a DataFrame with date_time index:
date_time a b
2020-11-23 04:00:00 10 5
2020-11-23 05:00:00 11 5
2020-11-23 06:00:00 12 5
2020-11-24 04:30:00 13 6
2020-11-24 05:30:00 14 6
2020-11-24 06:30:00 15 6
2020-11-25 06:00:00 16 7
2020-11-25 07:00:00 17 7
2020-11-25 08:00:00 18 7
"a" column is intraday data (every row - different value). "b" column - DAILY data - same data during the current day.
I need to make some calculations with "b" (daily) column and create "c" column with the result. For example, sum for two last days.
Result:
date_time a b c
2020-11-23 04:00:00 10 5 NaN
2020-11-23 05:00:00 11 5 NaN
2020-11-23 06:00:00 12 5 NaN
2020-11-24 04:30:00 13 6 11
2020-11-24 05:30:00 14 6 11
2020-11-24 06:30:00 15 6 11
2020-11-25 06:00:00 16 7 13
2020-11-25 07:00:00 17 7 13
2020-11-25 08:00:00 18 7 13
I guesss I should use something like
df['c'] = df.resample('D').b.rolling(3).sum ...
but I got "NaN" values in "c".
Could you help me? Thanks!
One thing you can do is to drop duplicates on the date and work on that:
# get the dates
df['date'] = df['date_time'].dt.normalize()
df['c'] = (df.drop_duplicates('date')['b'] # drop duplicates on dates
.rolling(2).sum() # rolling sum
)
df['c'] = df['c'].ffill() # fill the missing data
Output:
date_time a b date c
0 2020-11-23 04:00:00 10 5 2020-11-23 NaN
1 2020-11-23 05:00:00 11 5 2020-11-23 NaN
2 2020-11-23 06:00:00 12 5 2020-11-23 NaN
3 2020-11-24 04:30:00 13 6 2020-11-24 11.0
4 2020-11-24 05:30:00 14 6 2020-11-24 11.0
5 2020-11-24 06:30:00 15 6 2020-11-24 11.0
6 2020-11-25 06:00:00 16 7 2020-11-25 13.0
7 2020-11-25 07:00:00 17 7 2020-11-25 13.0
8 2020-11-25 08:00:00 18 7 2020-11-25 13.0

SQL Server, inconsistency in result of DATEDIFF

I'm trying to understand why the DATEDIFF does not work consistently.
I have a table Projects with below values:
Task_ID Start_Date End_Date
--------------------------------------
1 2015-10-01 2015-10-02
24 2015-10-02 2015-10-03
2 2015-10-03 2015-10-04
23 2015-10-04 2015-10-05
3 2015-10-11 2015-10-12
22 2015-10-12 2015-10-13
4 2015-10-15 2015-10-16
21 2015-10-17 2015-10-18
5 2015-10-19 2015-10-20
20 2015-10-21 2015-10-22
6 2015-10-25 2015-10-26
19 2015-10-26 2015-10-27
7 2015-10-27 2015-10-28
18 2015-10-28 2015-10-29
8 2015-10-29 2015-10-30
17 2015-10-30 2015-10-31
9 2015-11-01 2015-11-02
16 2015-11-04 2015-11-05
10 2015-11-07 2015-11-08
15 2015-11-06 2015-11-07
11 2015-11-05 2015-11-06
14 2015-11-11 2015-11-12
12 2015-11-12 2015-11-13
13 2015-11-17 2015-11-18
When I run the below query on it;
WITH t AS
(
SELECT
Start_Date s,
End_Date e,
ROW_NUMBER() OVER(ORDER BY Start_Date) rn
FROM
Projects
GROUP BY
Start_Date, End_Date
)
SELECT
s, e, rn, DATEDIFF(day, rn, s)
FROM t
I get this output:
2015-10-01 2015-10-02 1 42275
2015-10-02 2015-10-03 2 42275
2015-10-03 2015-10-04 3 42275
2015-10-04 2015-10-05 4 42275
2015-10-11 2015-10-12 5 42281
2015-10-12 2015-10-13 6 42281
2015-10-15 2015-10-16 7 42283
2015-10-17 2015-10-18 8 42284
2015-10-19 2015-10-20 9 42285
2015-10-21 2015-10-22 10 42286
2015-10-25 2015-10-26 11 42289
2015-10-26 2015-10-27 12 42289
2015-10-27 2015-10-28 13 42289
2015-10-28 2015-10-29 14 42289
2015-10-29 2015-10-30 15 42289
2015-10-30 2015-10-31 16 42289
2015-11-01 2015-11-02 17 42290
2015-11-04 2015-11-05 18 42292
2015-11-05 2015-11-06 19 42292
2015-11-06 2015-11-07 20 42292
2015-11-07 2015-11-08 21 42292
2015-11-11 2015-11-12 22 42295
2015-11-12 2015-11-13 23 42295
2015-11-17 2015-11-18 24 42299
But when I individually execute DATEDIFF, I get different results:
select DATEDIFF(day, 1, 2015-10-01)
2003
select DATEDIFF(day, 2, 2015-10-02)
2001
Can someone please explain this to me? Am I doing something wrong with the individual select statement?
Thanks for the help.
This is what the arguments for datediff look like.
DATEDIFF ( datepart , startdate , enddate )
Judging by the parameters you passed, I assume you are trying to substract 1 or 2 days from a date. You should use
DATEADD (datepart , number , date )
So substracting just becomes adding with a minus like DATEADD (day,-1,'2015-10-02')
If you really wanted to use the DATEDIFF function as intended, make sure you use single quotes around your dates and read the datepart boundaries section in the documentation, because a nanosecond difference at the boundary can turn into a year difference in your result.
Also when using a number X as a date, SQL Server interprets it as (1900-01-01 + X days).

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429