Calculating moving average over irregular data - sql

I am trying to calculate a moving average of several fields in a SQL Server database that involved irregularly-spaced values over time. I realized that for regularly-spaced data I can use an SELECT grp, AVG(count) FROM t ... OVER (PARTITION BY grp ... ROWS 7 PRECEDING) to create a moving average of the prior week's data. However, I have data organized as follows:
DATE GRP COUNT
2018-07-05 1 10
2018-07-08 1 4
2018-07-11 1 6
2018-07-12 1 6
2018-07-11 2 5
2018-07-15 2 10
2018-07-17 2 8
2018-07-20 2 10
...
Where for most groups there are no observations for some dates. The output I'm looking for is:
DATE GRP MOVING_AVG
2018-07-05 1 10
2018-07-08 1 7
2018-07-11 1 6.67
2018-07-13 1 5.33
2018-07-11 2 5
2018-07-15 2 7.5
2018-07-16 2 7.67
2018-07-20 2 9.33
Is there a way of specifying dates instead of rows in the PRECEDING clause, or do I have to create some sort of mask to average over?
EDITED FOR CLARIFICATION BASED ON COMMENTS

In SQL Server, I think this might be simpler achieved with a lateral join:
select
date,
grp,
(
select avg(count)
from mytable t1
where
t1.grp = t.grp
and t1.date >= dateadd(year, -1, t.date)
and t1.date <= t.date
) as cnt
from mytable

If i'm not misunderstanding. You want 7 or whatever days but rows before a date.
DATE GRP COUNT
2018-07-11 2 5
2018-07-15 2 10
2018-07-17 2 8
2018-07-20 2 10 <--- the AVG of this row must include 7 days before,so 2018-07-11 not include
In that case :
select
date,
grp,
(
select avg(count)
from t t1
where
t1.grp = t.grp
and DATEDIFF(day, t1.date, t.date) <= 7 /*7 or whatever day you want*/
and t1.date <= t.date
) as MOVING_AVG
from t

Related

Group items from the first time + certain time period

I want to group orders from the same customer if they happen within 10 minutes of the first order, then find the next first order and group them and so on.
Ex:
Customer group orders
6 1 3
2 4,5
3 8
7 1 9,10
2 11,12
3 13
id customer time
3 6 2021-05-12 12:14:22.000000
4 6 2021-05-12 12:24:24.000000
5 6 2021-05-12 12:29:16.000000
8 6 2021-05-12 13:01:40.000000
9 7 2021-05-14 12:13:11.000000
10 7 2021-05-14 12:20:01.000000
11 7 2021-05-14 12:45:00.000000
12 7 2021-05-14 12:48:41.000000
13 7 2021-05-14 12:58:16.000000
18 9 2021-05-18 12:22:13.000000
25 15 2021-05-18 13:44:02.000000
26 16 2021-05-17 09:39:02.000000
27 16 2021-05-18 19:38:43.000000
28 17 2021-05-18 15:40:02.000000
29 18 2021-05-19 15:32:53.000000
30 18 2021-05-19 15:45:56.000000
31 18 2021-05-19 16:29:09.000000
34 15 2021-05-24 15:45:14.000000
35 15 2021-05-24 15:45:14.000000
36 19 2021-05-24 17:14:53.000000
Here is what I have currently, I think that it is currently not grouping by customer when case when d.StartTime > dateadd(minute, 10, c.first_time) so it compares StartTime of all orders for all customers.
with
data as (select Customer,StartTime,Id, row_number() over(partition by Customer order by StartTime) rn from orders t),
cte as (
select d.*, StartTime as first_time
from data d
where rn = 1
union all
select d.*,
case when d.StartTime > dateadd(minute, 10, c.first_time)
then d.StartTime
else c.first_time
end
from cte c
inner join data d on d.rn = c.rn + 1
)
select c.*, dense_rank() over(partition by Customer order by first_time) grp
from cte c;'
I have two databases (MySQL & SQL Server) having similar schema so either would work for me.
Try the following on SQL Server:
SELECT customer,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY grp) AS group_no,
STRING_AGG(id, ',') AS orders
FROM
(
SELECT id,customer, [time],
(DATEDIFF(SECOND, MIN([time]) OVER (PARTITION BY CUSTOMER), [time])/60)/10 grp
FROM orders
) T
GROUP BY customer, grp
ORDER BY customer
See a demo.
According to your posted requirement, you are trying to divide the period between the first order date and the last order date into groups (or let's say time frames) each one is 10 minutes long.
What I did in this query: for each customer order, find the difference between the order date and the minimum date (first customer order date) in seconds and then divide it by 10 to get it's time frame number. i.e. for a difference = 599s the frame number = 599/60 =9m /10 = 0. for a difference = 620s the frame number = 620/60 =10m /10 = 1.
After defining the correct groups/time frames for each order you can simply use the STRING_AGG function to get the desired output. Noting that the STRING_AGG function applies to SQL Server 2017 (14.x) and later.

Group Records based on predefined date range in SQL (Oracle)

Is it possible to group records based on a predefined date range differences (e.g. 30 days) based on the start_date of a row and the end_date of the previous row for non-consecutive dates? I want to take the min(start_date) and max(end_date) of each group. I tried the lead and lag function with partition by in Oracle but couldn't come up with a proper solution. A related but unanswered post related to my question can be found here.
E.g.
ROW_NUM PROJECT_ID START_DATE END_DATE
1 1 2016-01-14 2016-08-15
2 1 2016-08-16 2016-09-10 --- Date diff Row 1&2 = 1 Day
3 1 2016-11-15 2017-01-10 --- Date diff Row 2&3 = 66 Days
4 1 2016-01-17 2017-04-10 --- Date diff Row 3&4 = 7 Days
5 2 2018-04-28 2018-06-01 --- Other Project
6 2 2019-02-01 2019-04-05 --- Diff > 30 Days
7 2 2019-04-08 2019-07-28 --- Diff 3 Days
Expected Result:
ROW_NUM PROJECT_ID START_DATE END_DATE
1 1 2016-01-14 2016-09-10
3 1 2016-11-15 2017-04-10
5 2 2018-04-28 2018-06-01
6 2 2019-02-01 2019-07-28
Use lag() and a cumulative sum to define where the groups begin. Then aggregate:
select project_id, min(start_date), max(end_date)
from (select t.*,
sum(case when prev_end_date > start_date - interval '30' day then 0 else 1 end) over
(partition by project_id order by start_date) as grp
from (select t.*,
lag(end_date) over (partition by project_id order by start_date) as prev_end_date
from t
) t
) t
group by project_id, grp;

Estimation of Cumulative value every 3 months in SQL

I have a table like this:
ID Date Prod
1 1/1/2009 5
1 2/1/2009 5
1 3/1/2009 5
1 4/1/2009 5
1 5/1/2009 5
1 6/1/2009 5
1 7/1/2009 5
1 8/1/2009 5
1 9/1/2009 5
And I need to get the following result:
ID Date Prod CumProd
1 2009/03/01 5 15 ---Each 3 months
1 2009/06/01 5 30 ---Each 3 months
1 2009/09/01 5 45 ---Each 3 months
What could be the best approach to take in SQL?
You can try the below - using window function
DEMO Here
select * from
(
select *,sum(prod) over(order by DATEPART(qq,dateval)) as cum_sum,
row_number() over(partition by DATEPART(qq,dateval) order by dateval) as rn
from t
)A where rn=1
How about just filtering on the month number?
select t.*
from (select id, date, prod, sum(prod) over (partition by id order by date) as running_prod
from t
) t
where month(date) in (3, 6, 9, 12);

Using the earliest date of a partition to determine what other dates belong to that partition

Assume this is my table:
ID DATE
--------------
1 2018-11-12
2 2018-11-13
3 2018-11-14
4 2018-11-15
5 2018-11-16
6 2019-03-05
7 2019-05-07
8 2019-05-08
9 2019-05-08
I need to have partitions be determined by the first date in the partition. Where, any date that is within 2 days of the first date, belongs in the same partition.
The table would end up looking like this if each partition was ranked
PARTITION ID DATE
------------------------
1 1 2018-11-12
1 2 2018-11-13
1 3 2018-11-14
2 4 2018-11-15
2 5 2018-11-16
3 6 2019-03-05
4 7 2019-05-07
4 8 2019-05-08
4 9 2019-05-08
I've tried using datediff with lag to compare to the previous date but that would allow a partition to be inappropriately sized based on spacing, for example all of these dates would be included in the same partition:
ID DATE
--------------
1 2018-11-12
2 2018-11-14
3 2018-11-16
4 2018-11-18
3 2018-11-20
4 2018-11-22
Previous flawed attempt:
Mark when a date is more than 2 days past the previous date:
(case when datediff(day, lag(event_time, 1) over (partition by user_id, stage order by event_time), event_time) > 2 then 1 else 0 end)
You need to use a recursive CTE for this, so the operation is expensive.
with t as (
-- add an incrementing column with no gaps
select t.*, row_number() over (order by date) as seqnum
from t
),
cte as (
select id, date, date as mindate, seqnum
from t
where seqnum = 1
union all
select t.id, t.date,
(case when t.date <= dateadd(day, 2, cte.mindate)
then cte.mindate else t.date
end) as mindate,
t.seqnum
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select cte.*, dense_rank() over (partition by mindate) as partition_num
from cte;

Subtract subsequent row from previous row based on User

I have the following data and I want to subtract current row from previous row based on the UserID. I tried the code below is not given me what I want
DECLARE #DATETBLE TABLE (UserID INT, Dates DATE)
INSERT INTO #DATETBLE VALUES
(1,'2018-01-01'), (1,'2018-01-02'), (1,'2018-01-03'),(1,'2018-01-13'),
(2,'2018-01-15'),(2,'2018-01-16'),(2,'2018-01-17'), (5,'2018-02-04'),
(5,'2018-02-05'),(5,'2018-02-06'),(5,'2018-02-11'), (5,'2018-02-17')
;with cte as (
select UserID,Dates, row_number() over (order by UserID) as seqnum
from #DATETBLE t
)
select t.UserID,t.Dates, datediff(day,tprev.Dates,t.Dates)as diff
from cte t left outer join
cte tprev
on t.seqnum = tprev.seqnum + 1;
Current Output
UserID Dates diff
1 2018-01-01 NULL
1 2018-01-02 1
1 2018-01-03 1
1 2018-01-13 10
2 2018-01-15 2
2 2018-01-16 1
2 2018-01-17 1
5 2018-02-04 18
5 2018-02-05 1
5 2018-02-06 1
5 2018-02-11 5
5 2018-02-17 6
My Expected Output
UserID Dates diff
1 2018-01-01 NULL
1 2018-01-02 1
1 2018-01-03 1
1 2018-01-13 10
2 2018-01-15 NULL
2 2018-01-16 1
2 2018-01-17 1
5 2018-02-04 NULL
5 2018-02-05 1
5 2018-02-06 1
5 2018-02-11 5
5 2018-02-17 6
Your tag (sql-server-2008) suggests me to use APPLY :
select t.userid, t.dates, datediff(day, t1.dates, t.dates) as diff
from #DATETBLE t outer apply
( select top (1) t1.*
from #DATETBLE t1
where t1.userid = t.userid and
t1.dates < t.dates
order by t1.dates desc
) t1;
If you have SQL Server version 2012 or higher, you could use LAG() with a partition by UserID:
SELECT UserID
, DATEDIFF(dd,COALESCE(LAG_DATES, Dates), Dates) as diff
FROM
(
SELECT UserID
, Dates
, LAG(Dates) OVER (PARTITION BY UserID ORDER BY Dates) as LAG_DATES
FROM #DATETBLE
) exp
This will give you a 0 value instead of a NULL value for the first date in the sequence though.
Since you tagged the post with SQL Server 2008, however, you may need to use a method that doesn't rely on this windowed function.