Rolling sum based on date (when dates are missing)

Rolling sum based on date (when dates are missing) - sql

You may be aware of rolling the results of an aggregate over a specific number of preceding rows. I.e.: how many hot dogs did I eat over the last 7 days
SELECT HotDogCount,
DateKey,
SUM(HotDogCount) OVER (ORDER BY DateKey ROWS 6 PRECEDING) AS HotDogsLast7Days
FROM dbo.HotDogConsumption
Results:
+-------------+------------+------------------+
| HotDogCount | DateKey | HotDogsLast7Days |
+-------------+------------+------------------+
| 3 | 09/21/2020 | 3 |
| 2 | 9/22/2020 | 5 |
| 1 | 09/23/2020 | 6 |
| 1 | 09/24/2020 | 7 |
| 1 | 09/25/2020 | 8 |
| 4 | 09/26/2020 | 12 |
| 1 | 09/27/2020 | 13 |
| 3 | 09/28/2020 | 13 |
| 2 | 09/29/2020 | 13 |
| 1 | 09/30/2020 | 13 |
+-------------+------------+------------------+
Now, the problem I am having is when there are gaps in the dates. So, basically, one day my intestines and circulatory system are screaming at me: "What the heck are you doing, you're going to kill us all!!!" So, I decide to give my body a break for a day and now there is no record for that day. When I use the "ROWS 6 PRECEDING" method, I will now be reaching back 8 days, rather than 7, because one day was missed.
So, the question is, do any of you know how I could use the OVER clause to truly use a date value (something like "DATEADD(day,-7,DateKey)") to determine how many previous rows should be summed up for a true 7 day rolling sum, regardless of whether I only ate hot dogs on one day or on all 7 days?
Side note, to have a record of 0 for the days I didn't eat any hotdogs is not an option. I understand that I could use an array of dates and left join to it and do a
CASE WHEN Datekey IS NULL THEN 0 END
type of deal, but I would like to find out if there is a different way where the rows preceding value can somehow be determined dynamically based on the date.

Window functions are the right approach in theory. But to look back at the 7 preceding days (not rows), we need a range frame specification - which, unfornately, SQL Server does not support.
I am going to recommend a subquery, or a lateral join:
select hdc.*, hdc1.*
from dbo.HotDogConsumption hdc
cross apply (
select coalesce(sum(HotDogCount), 0) HotDogsLast7Days
from dbo.HotDogConsumption hdc1
where hdc1.datekey >= dateadd(day, -7, hdc.datekey)
and hdc1.datekey < hdc.datekey
) hdc1
You might want to adjust the conditions in the where clause of the subquery to the precise frame that you want. The above code computes over the last 7 days, not including today. Something equivalent to your current attempt would be like:
where hdc1.datekey >= dateadd(day, -6, hdc.datekey)
and hdc1.datekey <= hdc.datekey

I'm kind of old school, but this is how I'd go about it:
SELECT
HDC1.HotDogCount
,HDC1.DateKey
,( SELECT SUM( HDC2.HotDogCount )
FROM HotDogConsumption HDC2
WHERE HDC2.DateKey BETWEEN DATEADD( DD, -7, HDC1.DateKey )
AND HDC1.DateKey ) AS 'HotDogsLast7Days'
FROM
HotDogConsumption HDC1
;
Someone younger might use an OUTER APPLY or something.

Related

3 month rolling average with missing months

I've been reading the related questions here, and so far the solutions require that there are no missing months. Would love to get some help on what I can do if there are missing months?
For example, I'd like to calculate the 3 month rolling average of orders per item. If there is a missing month for an item, the calculation assumes that the number of orders for that item for that month is 0. If there are fewer than three months left, the rolling average isn't so important (it can be null or otherwise).
MONTH | ITEM | ORDERS | ROLLING_AVG
2021-04 | A | 5 | 3.33
2021-04 | B | 4 | 3
2021-03 | A | 3 | 1.66
2021-03 | B | 5 | null
2021-02 | A | 2 | null
2021-01 | B | 2 | null
Big thanks in advance!
Also, is there a way to "add" the missing month rows without using a cross join with a list of items? For example if I have 10 million items, the cross join takes quite a while to execute.

You can use a range window frame -- and some conditional logic:
select t.*,
(case when min(month) over (partition by item) <= month - interval '2 month'
then sum(orders) over (partition by item
order by month
range between interval '2 month' preceding and current row
) / 3.0
end) as rolling_average
from t;
Here is a db<>fiddle. The results are slightly different from what is in your question, because there is not enough info for A in 2021-03 but there is enough for B in 2021-03.

How to dynamically call date instead of hardcoding in WHERE clause?

In my code using SQL Server, I am comparing data between two months where I have the exact dates identified. I am trying to find if the value in a certain column changes in a bunch of different scenarios. That part works, but what I'd like to do is make it so that I don't have to always go back to change the date each time I wanted to get the results I'm looking for. Is this possible?
My thought was that adding a WITH clause, but it is giving me an aggregation error. Is there anyway I can go about making this date problem simpler? Thanks in advance
EDIT
Ok I'd like to clarify. In my WITH statement, I have:
select distinct
d.Date
from Database d
Which returns:
+------+-------------+
| | Date |
+------+-------------|
| 1 | 01-06-2017 |
| 2 | 01-13-2017 |
| 3 | 01-20-2017 |
| 4 | 01-27-2017 |
| 5 | 02-03-2017 |
| 6 | 02-10-2017 |
| 7 | 02-17-2017 |
| 8 | 02-24-2017 |
| 9 | ........ |
+------+-------------+
If I select this statement and execute, it will return just the dates from my table as shown above. What I'd like to do is be able to have sql that will pull from these date values and compare the last date value from one month to the last date value of the next month. In essence, it should compare the values from date 8 to values from date 4, but it should be dynamic enough that it can do the same for any two dates without much tinkering.

If I didn't misunderstand your request, it seems you need a numbers table, also known as a tally table, or in this case a calendar table.
Recommended post: https://dba.stackexchange.com/questions/11506/why-are-numbers-tables-invaluable
Basically, you create a table and populate it with numbers of year's week o start and end dates. Then join your main query to this table.
+------+-----------+----------+
| week | startDate | endDate |
+------+-----------+----------+
| 1 | 20170101 | 20170107 |
| 2 | 20170108 | 20170114 |
+------+-----------+----------+
Select b.week, max(a.data) from yourTable a
inner join calendarTable b
on a.Date between b.startDate and b.endDate
group by b.week

dynamic dates to filter by BETWEEN
select dateadd(m,-1,dateadd(day,-(datepart(day,cast(getdate() as date))-1),cast(getdate() as date))) -- 1st date of last month
select dateadd(day,-datepart(day,cast(getdate() as date)),cast(getdate() as date)) -- last date of last month
select dateadd(day,-(datepart(day,cast(getdate() as date))-1),cast(getdate() as date)) -- 1st date of current month
select dateadd(day,-datepart(day,dateadd(m,1,cast(getdate() as date))),dateadd(m,1,cast(getdate() as date))) -- last date of the month

sql index same column two directions for traversing window functions

I'm trying use windowing functions to group records close to each other (within the same partition) into sequential groups. There's probably a better way to solve the problem, but right now what I would like to try is running too slow to be useful. It involves an order by on the select:
order by person_id, rollup_class, rollup_concept_id, exp_num
and another order by in the window function:
lead(days_from_latest) over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num DESC)
Because I have that last column (exp_num) ordered in opposite directions, the query takes forever. I even have two indexes on the table to handle the two directions:
create index deeIdx on results.drug_exposure_extra (person_id,rollup_class, rollup_concept_id,
exp_num);
create index deeIdx2 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num desc);
But that doesn't help. So I'm trying one that orders exp_num in both directions:
create index deeIdx3 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num, exp_num desc);
Does that even make sense? When the index finally finishes building, if it solves the problem, I'll answer my own question...
Nope.
Even with all three indexes, if the two order bys (in select and in over clause) go the same direction, the query runs super fast, if they go opposite directions the query runs super slow. So, at this point I guess I should explain my use case better and ask for ideas for a better approach.
I've got drug exposure records (this is for a cool open-source project http://www.ohdsi.org/, btw), and when a person has drug exposures that begin less than N days from the end of any previous exposure, it should be combined with the earlier ones into a single 'era'. Whenever there is a gap of more than N days, a new era begins.
Over the course of composing this question, it turns out I solved it. It raises some interesting issues, though, so I'll post it and answer it below.

Like asking a doctor, "It hurts when I move my arm like this, what should I do?" the answer is obviously, "Don't move your arm like that." So -- don't try to make windowing functions proceed in a different order from the main query (or probably from each other) -- there's probably a better solution.
Early in working on this I had somehow convinced myself that it would be easier to aggregate eras relative to their ending records rather than their starting records, but that was where I went wrong.
So the expression that gives me the era number I want looks like this:
sum(case when exp_num = 1 or days_from_latest > 30 then 1 else 0 end)
over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num)
as era_num
Explanation: if it's the patient's first exposure to the drug (well, the combination of rollup_class and rollup_concept_id in this case), then that's the beginning of a drug era. It's also the beginning of a drug era if the exposure is more than N days from any earlier exposure. (This point is what makes it a little complicated: say exposure 1 starts at day 1 and is 60 days, exposure 2 starts at day 20 and is 10 days, exposure 3 starts at day 70: it's 40 days after the end of the most recent exposure, 2, which would put it in a new era, but it's only 10 days after exposure 1, which puts it in the same era with 1 and 2.) So, for each record that starts an era the case statement gives us a 1, the rest get 0s. Then we sum that, partitioning over the same partition we used in an earlier query to establish the exp_num, and order by exp_num. I could have specified the rows to sum explicitly by adding rows between unbounded preceding and current row, but that's the default behavior anyway. So the era number increments only at the beginning of new eras.
Here is a much simplified example in response to gordon-linoff's comment below.
create table junk_numbers (x int);
insert into junk_numbers values (1),(2),(3),(5),(7),(9),(10),(15),(20),(25),(26),(28),(30);
-- break into series with gaps of at least 1
select x, gap, 1+sum(case when gap > 1 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 2
7 | 2 | 3
9 | 2 | 4
10 | 1 | 4
15 | 5 | 5
20 | 5 | 6
25 | 5 | 7
26 | 1 | 7
28 | 2 | 8
30 | 2 | 9
-- same query but bigger gaps:
select x, gap, 1+sum(case when gap > 4 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 1
7 | 2 | 1
9 | 2 | 1
10 | 1 | 1
15 | 5 | 2
20 | 5 | 3
25 | 5 | 4
26 | 1 | 4
28 | 2 | 4
30 | 2 | 4

SQL Query X Days back excluding date ranges (Confusing!)

Ok, I have a tough SQL query, and I'm not sure how to go about writing it.
I am summing the number of "bananas collected" by an employee within the last X days, but what I could really use help on is determining X.
The "last X days" value is defined to be the last 100 days that the employee was NOT out due to Purple Fever, starting from some ChosenDate (we'll say today, 6/24/14). That is to say, if the person was sick with Purple Fever for 3 days, then I want to look back over the last 103 days from ChosenDate rather than the last 100 days. Any other reason the employee may have been out does not affect our calculation.
Table PersonOutIncident
+----------------------+----------+-------------+
| PersonOutIncidentID | PersonID | ReasonOut |
+----------------------+----------+-------------+
| 1 | Sarah | PurpleFever |
| 2 | Sarah | PaperCut |
| 3 | Jon | PurpleFever |
| 4 | Sarah | PurpleFever |
+----------------------+----------+-------------+
Table PersonOutDetail
+-------------------+----------------------+-----------+-----------+
| PersonOutDetailID | PersonOutIncidentID | BeginDate | EndDate |
+-------------------+----------------------+-----------+-----------+
| 1 | 1 | 1/1/2014 | 1/3/2014 |
| 2 | 1 | 1/7/2014 | 1/13/2014 |
| 3 | 2 | 2/1/2014 | 2/3/2014 |
| 4 | 3 | 1/15/2014 | 1/20/2014 |
| 5 | 4 | 5/1/2014 | 5/15/2014 |
+-------------------+----------------------+-----------+-----------+
The tables are established. Many PersonOutDetail records can be associated with one PersonOutIncident record and there may be multiple PersonOutIncident records for a single employee (That is to say, there could be two or three PersonOutIncident records with an identical ReasonOut column, because they represent a particular incident or event and the not-necessarily-continuous days lost due to that particular incident)
The nature of this requirement complicates things, even conceptually to me.
The best I can think of is to check for a BeginDate/EndDate pair within the 100 day base period, then determine the number of days from BeginDate to EndDate and add that to the base 100 days. But then I would have to check again that this new range doesn't overlap or contain additional BeginDate/EndDate pairs and add, if so, add those days as well. I can tell already that this isn't the method I want to use, but I can't wrap my mind quite around how exactly what I need to start/structure this query. Does anyone have an idea that might steer me in the correct direction? I realize this might not be clear and I apologize if I'm just confusing things.

One way to do this is to work with a table or WITH CLAUSE that contains a list of days. Let's say days is a table with one column that contains the last 200 days. (This means the query will break if the employee had more than 100 sick days in the last 200 days).
Now you can get a list of all working days of an employee like this (replace ? with the employee id):
WITH t1 AS
(
SELECT day,
ROW_NUMBER() OVER (ORDER BY day DESC) AS 'RowNumber'
FROM days d
WHERE NOT EXISTS (SELECT * FROM PersonOutDetail pd
INNER JOIN PersonOutIncidentID po ON po.PersonOutIncidentID = pd.PersonOutIncidentID
WHERE d.day BETWEEN pd.BeginDate AND pd.EndDate
AND po.ReasonOut = 'PurpleFever'
AND po.PersonID = ?)
)
SELECT * FROM t1
WHERE RowNumber <= 100;
Alternatively, you can obtain the '100th day' by replacing RowNumber <= 100 with RowNumber = 100.

Best Practice in Scenario

I am currently trying to accomplish the following:
get the Last Weekstamp for the last 6 Months, the following ilustrates how the end result might look like:
Month | Weekstamp |
2013-12| 2013-52 |
2014-01| 2014-05 |
.... and so on
I have a auxiliary Table, which has all Weeks in it and allows me to connect to a Calender Table, which in turn has all months, meaning i am able to get all weekstamps per Month,
but how do i get all of the Last Week Numbers for the Last 6 Months ?
my idea was a Temporary table of some sor (never used one, am a beginner when it Comes to SQL)
which calculates all of the Weekstamps needing to be filtered out per month, and than gives out only values which i could than use to filter a query which contains all the data i Need.
Anybody have a better idea?
As i said I am just a beginner so i can't really say what the best way would be
Thanks a lot in Advance!

My guess is that your challenge is determining what the last six months are. To do this you can use a tally table (spt_values) and DateDiff to determine when the last six months are.
You can also depending on which DB and version easily do this without a calander or weeks table.
This
WITH rnge
AS (SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number > 0
AND number < 7),
dates
AS (SELECT EOMONTH(Dateadd(m, number * -1, Getdate())) dt
FROM rnge)
SELECT Year(dt) year,
Month(dt) month,
Datepart(wk, dt) week
FROM dates
Produces this output
| YEAR | MONTH | WEEK |
|------|-------|------|
| 2014 | 1 | 5 |
| 2013 | 12 | 53 |
| 2013 | 11 | 48 |
| 2013 | 10 | 44 |
| 2013 | 9 | 40 |
| 2013 | 8 | 35 |
Demo
I'll leave it to you to format the values
This assumes SQL Server 2012 since it uses EOMONTH see Get the last day of the month in SQL for previous versions of SQL Server

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas