Comparing Two Sets of Date Ranges in SQL - sql

I have two sets of data with different date ranges.
Tbl 1:
ID, Date_Start, Date_End
1, 2010-01-01, 2010-01-09
1, 2010-01-10, 2010-01-19
1, 2010-01-30, 2010-01-31
Tbl 2:
ID, Date_Start, Date_End
1, 2010-01-01, 2010-01-04
1, 2010-01-08, 2010-01-17
1, 2010-01-30, 2010-01-31
I'd like to find cases date ranges do not entirely overlap date ranges in Tbl 2. So for instance, in this example, I'd like output that looks something like this --
Output:
ID, Gap_Start, Gap_End
1, 2010-01-05, 2010-01-07
1, 2010-01-18, 2010-01-19
Date ranges will never overlap within a table. To do this, I'm using either DB2 SQL or SAS. Unfortunately, the datasets are big enough (millions of records) that I can't just brute force it.
Thank you!

Following on from Jon of All Trades' approach, this is a more completed solution. The crucial features are:
Use an auxiliary calendar table, which is just a list of all dates.
From the calendar table, JOIN to Tbl1 to get a list of dates which are in range.
Also do an anti-JOIN to Tbl2 to get only the dates which aren't in Tbl2's ranges.
I've enclosed those results in a Common Table Expression (CTE) called OutDates.
Define another CTE based on OutDates to get just the dates which start a gap; call this EarliestDates.
Define another CTE based on OutDates to get just the dates which end a gap; call this LatestDates.
JOIN EarliestDates and LatestDates to put each gap into a single row.
WITH
OutDates(ID, dt) AS
( SELECT Tbl1.ID, Calendar.dt FROM Calendar
INNER JOIN Tbl1 ON Calendar.dt BETWEEN Tbl1.Date_Start AND Tbl1.Date_End
LEFT OUTER JOIN Tbl2 ON Calendar.dt BETWEEN Tbl2.Date_Start AND Tbl2.Date_End
WHERE Tbl2.ID IS NULL
)
,
EarliestDates AS
( SELECT earliest.ID, earliest.dt FROM OutDates earliest
LEFT OUTER JOIN OutDates nonesuch_earlier ON DateAdd(day, -1, earliest.dt) = nonesuch_earlier.dt
WHERE nonesuch_earlier.ID IS NULL
)
,
LatestDates AS
( SELECT latest.ID, latest.dt FROM OutDates latest
LEFT OUTER JOIN OutDates nonesuch_later ON DATEADD(day, 1, latest.dt) = nonesuch_later.dt
WHERE nonesuch_later.ID IS NULL
)
SELECT rangestart.ID, rangestart.dt AS Gap_Start, rangeend.dt AS Gap_End
FROM EarliestDates rangestart JOIN LatestDates rangeend
ON rangestart.dt <= rangeend.dt
LEFT OUTER JOIN EarliestDates nonesuch_inner1
ON nonesuch_inner1.dt <= rangeend.dt AND nonesuch_inner1.dt > rangestart.dt
LEFT OUTER JOIN LatestDates nonesuch_inner2
ON nonesuch_inner2.dt >= rangestart.dt AND nonesuch_inner2.dt < rangeend.dt
WHERE nonesuch_inner1.dt IS NULL AND nonesuch_inner2.dt IS NULL
This is a working implementation using Sql Server syntax for the common table expressions, but it should be easy to convert to DB2 syntax. I don't know how well it well scale to be honest, I've only tested it with a very small dataset.

I don't think there is the efficient and general solution for all the cases. Under certain circumstances, however, we can figure out some efficient ones. For instance, below assumes that: (1) datasets one and two have the same set of ids in the same order; and (2) there are relatively short possible date ranges (assumed here to be all the dates in the year of 2010 only). Notice that one input range may generate two gaps.
/* test data */
data one;
input id1 (start1 finish1) (:anydtdte.);
format start1 finish1 e8601da.;
cards;
1 2010-01-01 2010-01-09
1 2010-01-10 2010-01-19
1 2010-01-30 2010-01-31
2 2010-01-02 2010-01-10
;
run;
data two;
input id2 (start2 finish2) (:anydtdte.);
format start2 finish2 e8601da.;
cards;
1 2010-01-01 2010-01-04
1 2010-01-08 2010-01-17
1 2010-01-30 2010-01-31
2 2010-01-05 2010-01-06
;
run;
/* assumptions:
(1) datasets one and two have the same set of ids in the same
sorted order;
(2) only possible dates are in the year of 2010
*/
%let minDate = %sysevalf('01jan2010'd - 1);
%let maxDate = %sysevalf('31dec2010'd + 1);
data gaps;
array inRange[&minDate:&maxDate] _temporary_;
array covered[&minDate:&maxDate] _temporary_;
do i = &minDate to &maxDate; inRange[i] = 0; covered[i] = 0; end;
do until (last.id1);
set one;
by id1;
do i = start1 to finish1; inRange[i] = 1; end;
end;
do until (last.id2);
set two;
by id2;
do i = start2 to finish2; covered[i] = 1; end;
end;
format startGap finishGap e8601da.;
startGap = .;
finishGap = .;
do i = &minDate+1 to &maxDate;
if inRange[i] and not covered[i] and missing(startGap) then startGap = i;
if (covered[i] or not inRange[i]) and not missing(startGap) and not covered[i-1] then do;
finishGap = i - 1;
output;
call missing(startGap, finishGap);
keep id1 startGap finishGap;
end;
end;
run;
/* check */
proc print data=gaps noobs;
run;
/* on lst
id1 startGap finishGap
1 2010-01-05 2010-01-07
1 2010-01-18 2010-01-19
2 2010-01-02 2010-01-04
2 2010-01-07 2010-01-10
*/

This is not a complete solution, as it returns a list of dates rather than ranges, but maybe it will be of use:
SELECT
R1.ID, D.Date
FROM
#Ranges1 AS R1
INNER JOIN Dates AS D ON D.Date BETWEEN R1.StartDate AND R1.EndDate
EXCEPT
SELECT
R2.ID, D.Date
FROM
#Ranges2 AS R2
INNER JOIN Dates AS D ON D.Date BETWEEN R2.StartDate AND R2.EndDate
Note that this solution requires a dates table: a table with one record per day, for all the dates you're likely to use. It has the advantages of being succinct, and handling overlapping date ranges (not necessary in your case, but maybe for the next guy).

For what it's worth, this is the method I ended up using. I think you could do it in pure SQL, but it got horrifically ugly and difficult to debug.
Step 1 -- I consolidated the date ranges in both datasets. This means that something like
ID, Start_Date, End_Date
1, 2010-01-01, 2010-01-31
1, 2010-02-01, 2010-02-28
got transformed into this --
ID, Start_Date, End_Date
1, 2010-01-01, 2010-02-28.
The query I used to produce this was --
WITH Cte_recomb (Id, Start_date, End_date, Hopcount) AS
(SELECT Id,
Start_date,
End_date,
1 AS Hopcount
FROM Table1
UNION ALL
SELECT Cte_recomb.Id,
Cte_recomb.Start_date,
Table1.End_date,
(Recomb.Hopcount + 1) AS Hopcount
FROM Cte_recomb, Table1
WHERE (Cte_recomb.Id = Table1.Id) AND
(Cte_recomb.End_date + 1 day = Table1.Start_date)),
Cte_maxenddate AS
(SELECT Id,
Start_date,
Max (End_date) AS End_date
FROM Cte_recomb
GROUP BY Id, Start_date
ORDER BY Id, Start_date)
SELECT Maxend.*
FROM Cte_maxenddate AS Maxend
LEFT JOIN
Cte_recomb AS Nextrec
ON (Nextrec.Id = Maxend.Id) AND
(Nextrec.Start_date < Maxend.Start_date) AND
(Nextrec.End_date >= Maxend.End_date)
WHERE Nextrec.Id IS NULL;
Step 2 --
I produced another dataset that created a record for every overlap between the two datasets. You'll need an additional step to find cases where a given record in Table1 doesn't have a corresponding record in Table2 at all.
SELECT Table1.Id,
Table1.Start_date AS Table1_start_date,
Table1.End_date AS Table1_end_date,
Table2.Start_date AS Table2_start_date,
Table2.End_date AS Table2_end_date
FROM Table1
INNER JOIN
Table2
ON (Table1.Plcy_id_sk = Id) AND
( (Table1.Start_date BETWEEN Table2.Start_date AND Table2.End_date) OR
(Table2.Start_date BETWEEN Table1.Start_date AND Table1.End_date)) AND
( (Table1.Start_date <> Table2.Start_date) OR
(Table1.End_date <> Table2.End_date))
ORDER BY Table1.Id, Table1.Start_date, Table2.Start_date;
Step 3 --
I take the above dataset, and run the following SAS job. I tried to do this in pure SQL with recursive queries, but it got uglier and uglier every time I looked at it.
Data Table1_Gaps;
Set Table1_Compare;
By ID Table1_Start_Date Table2_Start_Date;
format Gap_Start_Date yymmdd10.;
format Gap_End_Date yymmdd10.;
format Old_Start_Date yymmdd10.;
format Old_End_Date yymmdd10.;
Retain Old_Start_Date Old_End_Date;
IF (Table2_End_Date = .) then do;
Gap_Start_Date = Table1_Start_Date;
Gap_End_Date = Table1_End_Date;
output;
end;
else do;
If (Table2_Start_Date > Table1_Start_Date) then do;
if first.Table1_Start_Date then do;
Gap_Start_Date = Table1_Start_Date;
Gap_End_Date = Table2_Start_Date - 1;
output;
end;
else do;
Gap_Start_Date = Old_End_Date + 1;
Gap_End_Date = Table2_Start_Date - 1;
output;
end;
end;
If (Table2_End_Date < Table1_End_Date) then do;
if Last.Table1_Start_Date then do;
Gap_Start_Date = Table2_End_Date + 1;
Gap_End_Date = Table1_End_Date;
output;
end;
end;
end;
Old_Start_Date = Table2_Start_Date;
Old_End_Date = Table2_End_Date;
drop Old_Start_Date Old_End_Date;
run;
I haven't verified it entirely yet, but this approach does seem to have given me the results I wanted. Any thoughts?

Related

Hash join equivalent on PROC SQL between

I usually use PROC SQL for when I'm joining a table on that also has a date condition (i.e. target_date falls between start_date and end_date).
I've been able to successfully translate this to a hash join when considering an INNER JOIN:
data hash_join;
if _n_ = 1 then do;
declare hash add1(dataset:'table_2',multidata: 'Y');
add1.defineKey('key_1');
add1.defineData('start_date','end_date','value_1');
add1.defineDone();
end;
format
start_date date9.
end_date date9.
value_1 10.5
;
set table_1 (keep=key_1 target_date);
if add1.find() = 0 then do until (add1.find_next());
if start_date le target_date le end_date then output;
end;
run;
Which is the same thing as:
proc sql;
create table sql_join as select
b.start_date,
b.end_date,
b.value_1,
a.key_1,
a.target_date
from table_1 a
inner join table_2 b
on a.key_1 = b.key_1 and
a.target_date between b.start_date and b.end_date
;quit;
I'm having trouble figuring out what the equivalent would be to a LEFT JOIN though. For instance, if something doesn't JOIN, I'd want to output, which I think is straightforward:
if add1.find() ne 0 then output;
And if it JOINs and the date is between, that seems straightforward as well:
if add1.find() = 0 then do until (add1.find_next());
if start_date le target_date le end_date then output;
end;
But how do I get the rest of the records from table_1 that might join, but don't have the target_date between the start_date and end_date? For instance, let's say table_2 is a start_date and end_date of a sale, and that sale didn't start until February 1st for a key_1 = 'Clothes'. If my table_1 has 'Clothes' and sales on January 1st, it will JOIN on the key, but I want to output the blank value. Any ideas on how to do this?
Any help would be much appreciated!
You just need to keep track of whether you've found a match or not. Since you're not using the hash find to track the 'between' part of things, you can't use that, so you just have to do it yourself.
See this example. Here I modify SASHELP.CLASS to look like your input tables, then add a bit of logic to see if anything was found.
data table_1;
set sashelp.class;
rename age=target_date name=key_1;
drop height weight;
run;
data table_2;
set sashelp.class;
do _i = 1 to mod(_n_,3);
start_date = age-3+_i;
end_date = age+1-_i;
if start_date le end_date then output;
end;
rename name=key_1 height=value_1;
keep height weight start_date age end_date name;
run;
data hash_join;
if _n_ = 1 then do;
declare hash add1(dataset:'table_2',multidata: 'Y');
add1.defineKey('key_1');
add1.defineData('start_date','end_date','value_1');
add1.defineDone();
end;
format
start_date date9.
end_date date9.
value_1 10.5
;
set table_1 (keep=key_1 target_date);
if add1.find() = 0 then do until (add1.find_next());
if start_date le target_date le end_date then do;
found=1;
output;
end;
end;
call missing(of value_1); *full list of values to clear - all of hash data elements;
if not (found) then output;
run;
I think you just need to track if something has the key, but not in the range:
if add1.find() ^=0 then output;
else do;
found = 0;
do until (add1.find_next());
if start_date le target_date le end_date then do;
output;
found=1;
end;
end;
if ^found then output;
end;
No data to test with, so this is just me coding in SO. Let me know if it doesn't work.

SQL Command Not Properly Ended - Oracle Subquery

I'm using a Crystal Reports 13 Add Command for record selection from an Oracle database connected through the Oracle 11g Client. The error I am receiving is ORA-00933: SQL command not properly ended, but I can't find anything the matter with my code (incomplete):
/* Determine units with billing code effective dates in the previous month */
SELECT "UNITS"."UnitNumber", "BILL"."EFF_DT"
FROM "MFIVE"."BILL_UNIT_ACCT" "BILL"
LEFT OUTER JOIN "MFIVE"."VIEW_ALL_UNITS" "UNITS" ON "BILL"."UNIT_ID" = "UNITS"."UNITID"
WHERE "UNITS"."OwnerDepartment" LIKE '580' AND TO_CHAR("BILL"."EFF_DT", 'MMYYYY') = TO_CHAR(ADD_MONTHS(TRUNC(SYSDATE), -1), 'MMYYYY')
INNER JOIN
/* Loop through previously identified units and determine last billing code change prior to preious month */
(
SELECT "BILL2"."UNIT_ID", MAX("BILL2"."EFF_DT")
FROM "MFIVE"."BILL_UNIT_ACCT" "BILL2"
WHERE TO_CHAR("BILL2"."EFF_DT", 'MMYYYY') < TO_CHAR(ADD_MONTHS(TRUNC(SYSDATE), -1), 'MMYYYY')
GROUP BY "BILL2"."UNIT_ID"
)
ON "BILL"."UNIT_ID" = "BILL2"."UNIT_ID"
ORDER BY "UNITS"."UnitNumber", "BILL"."EFF_DT" DESC
We are a state entity that leases vehicles (units) to other agencies. Each unit has a billing code with an associated effective date. The application is to develop a report of units with billing codes changes in the previous month.
Complicating the matter is that for each unit above, the report must also show the latest billing code and associated effective date prior to the previous month. A brief example:
Given this data and assuming it is now April 2016 (ordered for clarity)...
Unit Billing Code Effective Date Excluded
---- ------------ -------------- --------
1 A 04/15/2016 Present month
1 B 03/29/2016
1 A 03/15/2016
1 C 03/02/2016
1 B 01/01/2015
2 C 03/25/2016
2 A 03/04/2016
2 B 07/24/2014
2 A 01/01/2014 A later effective date prior to previous month exists
3 D 11/28/2014 No billing code change during previous month
The report should return the following...
Unit Billing Code Effective Date
---- ------------ --------------
1 B 03/29/2016
1 A 03/15/2016
1 C 03/02/2016
1 B 01/01/2015
2 C 03/25/2016
2 A 03/04/2016
2 B 07/24/2014
Any assistance resolving the error will be appreciated.
You have a WHERE clause before the INNER JOIN clause. This is invalid syntax - if you swap them it should work:
SELECT "UNITS"."UnitNumber",
"BILL"."EFF_DT"
FROM "MFIVE"."BILL_UNIT_ACCT" "BILL"
LEFT OUTER JOIN
"MFIVE"."VIEW_ALL_UNITS" "UNITS"
ON "BILL"."UNIT_ID" = "UNITS"."UNITID"
INNER JOIN
/* Loop through previously identified units and determine last billing code change prior to preious month */
(
SELECT "UNIT_ID",
MAX("EFF_DT")
FROM "MFIVE"."BILL_UNIT_ACCT"
WHERE TO_CHAR("EFF_DT", 'MMYYYY') < TO_CHAR(ADD_MONTHS(TRUNC(SYSDATE), -1), 'MMYYYY')
GROUP BY "UNIT_ID"
) "BILL2"
ON "BILL"."UNIT_ID" = "BILL2"."UNIT_ID"
WHERE "UNITS"."OwnerDepartment" LIKE '580'
AND TO_CHAR("BILL"."EFF_DT", 'MMYYYY') = TO_CHAR(ADD_MONTHS(TRUNC(SYSDATE), -1), 'MMYYYY')
ORDER BY "UNITS"."UnitNumber", "BILL"."EFF_DT" DESC
Also, you need to move the "BILL2" alias outside the () brackets as you do not need the alias inside the brackets but you do outside.
Are you really sure you need the double-quotes ""? Double-quotes enforce case sensitivity in column names - the default behaviour is for Oracle to convert all table and column names to upper case to abstract the case-sensitivity from the user - since you are using both double-quotes and upper-case names the quotes seems redundant.
SOLUTION:
SELECT DISTINCT "BILL2".*
FROM "MFIVE"."BILL_UNIT_ACCT" "BILL"
LEFT JOIN
/* Returns all rows for each unit with effective date >= maximum prior effective date and effective date < current month */
(
SELECT "UNITS"."UnitNumber" AS "UNITNO",
UPPER("UNITS"."SpecNoDescription") AS "TS_DESCR",
"UNITS"."UsingDepartment" AS "USING_DEPT",
"UNITS"."OwnerDepartment" AS "OWNING_DEPT",
"BILL"."BILLING_CODE",
"BILL"."EFF_DT",
"BILL"."UNIT_ID"
FROM "MFIVE"."BILL_UNIT_ACCT" "BILL"
LEFT OUTER JOIN "MFIVE"."VIEW_ALL_UNITS" "UNITS" ON "BILL"."UNIT_ID" = "UNITS"."UNITID"
WHERE TO_NUMBER(TO_CHAR(TRUNC("BILL"."EFF_DT"), 'YYYYMM'), '999999') < TO_NUMBER(TO_CHAR(TRUNC(ADD_MONTHS(SYSDATE, 0)), 'YYYYMM'), '999999') AND
"UNITS"."OwnerDepartment" = '580' AND
"BILL"."EFF_DT" >=
/* Returns maximum effective date prior to previous month for each unit */
(
SELECT MAX("BILL3"."EFF_DT")
FROM "MFIVE"."BILL_UNIT_ACCT" "BILL3"
WHERE "BILL3"."UNIT_ID" = "BILL"."UNIT_ID" AND
TO_NUMBER(TO_CHAR(TRUNC("BILL3"."EFF_DT"), 'YYYYMM'), '999999') < TO_NUMBER(TO_CHAR(TRUNC(ADD_MONTHS(SYSDATE, -1)), 'YYYYMM'), '999999')
GROUP BY "BILL3"."UNIT_ID"
)
) BILL2
ON "BILL"."UNIT_ID" = "BILL2"."UNIT_ID"
WHERE TO_CHAR("BILL"."EFF_DT", 'YYYYMM') = TO_CHAR(ADD_MONTHS(TRUNC(SYSDATE), -1), 'YYYYMM')
ORDER BY "BILL2"."UNITNO", "BILL2"."EFF_DT" DESC
If the where before Join really matters to you, use a CTE. (Employing with clause for temporary table and joining on the same.)
With c as (SELECT "UNITS"."UnitNumber", "BILL"."EFF_DT","BILL"."UNIT_ID" -- Correction: Was " BILL"."UNIT_ID" (spacetanker)
FROM "MFIVE"."BILL_UNIT_ACCT" "BILL" -- Returning unit id column too, to be used in join
LEFT OUTER JOIN "MFIVE"."VIEW_ALL_UNITS" "UNITS" ON "BILL"."UNIT_ID" = "UNITS"."UNITID"
WHERE "UNITS"."OwnerDepartment" LIKE '580' AND TO_CHAR("BILL"."EFF_DT", 'MMYYYY') = TO_CHAR(ADD_MONTHS(TRUNC(SYSDATE), -1), 'MMYYYY'))
select * from c --Filter out your required columns from c and d alias results, e.g c.UNIT_ID
INNER JOIN
--Loop through previously identified units and determine last billing code change prior to preious month */
(
SELECT "BILL2"."UNIT_ID", MAX("BILL2"."EFF_DT")
FROM "MFIVE"."BILL_UNIT_ACCT" "BILL2"
WHERE TO_CHAR("BILL2"."EFF_DT", 'MMYYYY') < TO_CHAR(ADD_MONTHS(TRUNC(SYSDATE), -1), 'MMYYYY')
GROUP BY "BILL2"."UNIT_ID"
) d
ON c."UNIT_ID" = d."UNIT_ID"
order by c."UnitNumber", c."EFF_DT" desc -- COrrection: Removed semicolon that Crystal Reports didn't like (spacetanker)
It seems this query has lots of scope for tuning though. However, one who has access to the data and requirement specification is the best judge.
EDIT :
You are not able to see data PRIOR to previous month since you are using BILL.EFF_DT in your original question's select statement, which is filtered to give only dates of previous month(..AND TO_CHAR("BILL"."EFF_DT", 'MMYYYY') = TO_CHAR(ADD_MONTHS(TRUNC(SYSDATE), -1), 'MMYYYY'))
If you want the data as you want, I guess you have to use the BILL2 section (d part in my subquery), by giving an alias to Max(EFF_DT), and using that alias in your select clause.

Counting records given criterias

I give up! I've been trying to make this work for some time now but I can't get the logic/and or code right.
What I'm trying to do is to count ongoing cases for every date in a defined period. I have a table looking basically like this:
CaseId OpenDate CloseDate
1 01JAN2014 05JAN2014
2 02JAN2014 04JAN2014
3 02JAN2014 .
4 03JAN2014 04JAN2014
5 06JAN2014 08JAN2014
6 07JAN2014 .
I created a data set iterating dates from today-30 to today as comparative dates (CompDate).
Definition of ongoing case is (CompDate <= OpenDate and CompDate < CloseDate) or CloseDate = .
My plan was to join the dates with the table and get something like this
CompDate OngoingCases
01JAN2014 1
02JAN2014 3
03JAN2014 4
04JAN2014 2
05JAN2014 1
06JAN2014 2
07JAN2014 3
So far I've came up with this code which gives me something else..
proc sql;
create table Ongoing as
select distinct
a.CompDate,
count(distinct case when (a.CompDate <= datepart(b.caseopendate) and (a.CompDate < datepart(b.caseclosedate) or b.caseclosedate = . )) then b.caseid end) as Cases
from List_of_dates as a
left outer join dcms_cases as b
on a.Date
where
.
.
group by a.Date
;
quit;
I think all you need is to join two tables applying your condition with dates in ON-statement. The following SQL will do the trick:
proc sql;
create table Ongoing as
select a.CompDate, count(b.CaseId) as OngoingCases
from List_of_dates a
left join dcms_cases b
on b.OpenDate<=a.CompDate and (b.CloseDate>a.CompDate or b.CloseDate=.)
group by a.CompDate
;
quit;
I made a few assumptions:
1) You meant to say the comp was greater than open date but less than close date or closed date is missing
2) You want the final result set to show how many ongoing cases were there based off the open date rather than the comp date since the comp date would be one static value... based on your description.
I have code below that used your sample and named it 'AA1'. It will give you the result set below.
Data AA1;
set AA1;
Format CompDate date9. Ongoing $3.;
CompDate = today() - 100; /*Change this to whatever your criteria is for the comparison*/
If ((CompDate >= opendate) and (CompDate < closedate)) or (closedate = .) then Ongoing = 'Y'; else Ongoing = 'N';
run;
/*Sort table in order to do a count of the ongoing cases*/
proc sort data = AA1;
by caseID CompDate Ongoing;
run;
/*Count how many ongoing cases exist based on the Open date values */
data AA1(rename= (count=OngoingCases));
set AA1;
count + 1;
by opendate;
if first.opendate then count = 1;
run;
/*Clean up to keep the variables you want in your result set */
data final;
set new(keep=opendate ongoingCases);
run;
One way is to iterate over your date range for every record in your dataset, and output records which satisfy your criteria...
%LET START = today()-30 ;
%LET END = today() ;
data want ;
set have ;
do CompDate = &START to &END ;
OngoingCase = (OpenDate <= CompDate < CloseDate)
or (OpenDate <= CompDate and missing(CloseDate)) ;
if OngoingCase then output ;
end ;
format CompDate date9. ;
run ;
proc summary data=want nway ;
class CompDate ;
var OngoingCase ;
output out=case_sum (drop=_:) sum= ;
run ;
I modified the current query slightly and moved the CASE condition to the WHERE clause.
proc sql;
create table Ongoing as
select distinct
a.CompDate,
count(distinct b.caseid ) as Cases
from List_of_dates as a
left outer join dcms_cases as b
on a.Date
where (a.CompDate <= datepart(b.caseopendate) and a.CompDate < datepart(b.caseclosedate))
or b.caseclosedate = .
group by a.Date
;
quit;

SQL moving average

How do you create a moving average in SQL?
Current table:
Date Clicks
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520
2012-05-04 1,330
2012-05-05 2,260
2012-05-06 3,540
2012-05-07 2,330
Desired table or output:
Date Clicks 3 day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010
This is an Evergreen Joe Celko question.
I ignore which DBMS platform is used. But in any case Joe was able to answer more than 10 years ago with standard SQL.
Joe Celko SQL Puzzles and Answers citation:
"That last update attempt suggests that we could use the predicate to
construct a query that would give us a moving average:"
SELECT S1.sample_time, AVG(S2.load) AS avg_prev_hour_load
FROM Samples AS S1, Samples AS S2
WHERE S2.sample_time
BETWEEN (S1.sample_time - INTERVAL 1 HOUR)
AND S1.sample_time
GROUP BY S1.sample_time;
Is the extra column or the query approach better? The query is
technically better because the UPDATE approach will denormalize the
database. However, if the historical data being recorded is not going
to change and computing the moving average is expensive, you might
consider using the column approach.
MS SQL Example:
CREATE TABLE #TestDW
( Date1 datetime,
LoadValue Numeric(13,6)
);
INSERT INTO #TestDW VALUES('2012-06-09' , '3.540' );
INSERT INTO #TestDW VALUES('2012-06-08' , '2.260' );
INSERT INTO #TestDW VALUES('2012-06-07' , '1.330' );
INSERT INTO #TestDW VALUES('2012-06-06' , '5.520' );
INSERT INTO #TestDW VALUES('2012-06-05' , '3.150' );
INSERT INTO #TestDW VALUES('2012-06-04' , '2.230' );
SQL Puzzle query:
SELECT S1.date1, AVG(S2.LoadValue) AS avg_prev_3_days
FROM #TestDW AS S1, #TestDW AS S2
WHERE S2.date1
BETWEEN DATEADD(d, -2, S1.date1 )
AND S1.date1
GROUP BY S1.date1
order by 1;
One way to do this is to join on the same table a few times.
select
(Current.Clicks
+ isnull(P1.Clicks, 0)
+ isnull(P2.Clicks, 0)
+ isnull(P3.Clicks, 0)) / 4 as MovingAvg3
from
MyTable as Current
left join MyTable as P1 on P1.Date = DateAdd(day, -1, Current.Date)
left join MyTable as P2 on P2.Date = DateAdd(day, -2, Current.Date)
left join MyTable as P3 on P3.Date = DateAdd(day, -3, Current.Date)
Adjust the DateAdd component of the ON-Clauses to match whether you want your moving average to be strictly from the past-through-now or days-ago through days-ahead.
This works nicely for situations where you need a moving average over only a few data points.
This is not an optimal solution for moving averages with more than a few data points.
select t2.date, round(sum(ct.clicks)/3) as avg_clicks
from
(select date from clickstable) as t2,
(select date, clicks from clickstable) as ct
where datediff(t2.date, ct.date) between 0 and 2
group by t2.date
Example here.
Obviously you can change the interval to whatever you need. You could also use count() instead of a magic number to make it easier to change, but that will also slow it down.
General template for rolling averages that scales well for large data sets
WITH moving_avg AS (
SELECT 0 AS [lag] UNION ALL
SELECT 1 AS [lag] UNION ALL
SELECT 2 AS [lag] UNION ALL
SELECT 3 AS [lag] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1]) AS [avg_value1],
AVG([value2]) AS [avg_value2]
FROM [data_table]
CROSS JOIN moving_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
And for weighted rolling averages:
WITH weighted_avg AS (
SELECT 0 AS [lag], 1.0 AS [weight] UNION ALL
SELECT 1 AS [lag], 0.6 AS [weight] UNION ALL
SELECT 2 AS [lag], 0.3 AS [weight] UNION ALL
SELECT 3 AS [lag], 0.1 AS [weight] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1] * [weight]) / AVG([weight]) AS [wavg_value1],
AVG([value2] * [weight]) / AVG([weight]) AS [wavg_value2]
FROM [data_table]
CROSS JOIN weighted_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
select *
, (select avg(c2.clicks) from #clicks_table c2
where c2.date between dateadd(dd, -2, c1.date) and c1.date) mov_avg
from #clicks_table c1
Use a different join predicate:
SELECT current.date
,avg(periods.clicks)
FROM current left outer join current as periods
ON current.date BETWEEN dateadd(d,-2, periods.date) AND periods.date
GROUP BY current.date HAVING COUNT(*) >= 3
The having statement will prevent any dates without at least N values from being returned.
assume x is the value to be averaged and xDate is the date value:
SELECT avg(x) from myTable WHERE xDate BETWEEN dateadd(d, -2, xDate) and xDate
In hive, maybe you could try
select date, clicks, avg(clicks) over (order by date rows between 2 preceding and current row) as moving_avg from clicktable;
For the purpose, I'd like to create an auxiliary/dimensional date table like
create table date_dim(date date, date_1 date, dates_2 date, dates_3 dates ...)
while date is the key, date_1 for this day, date_2 contains this day and the day before; date_3...
Then you can do the equal join in hive.
Using a view like:
select date, date from date_dim
union all
select date, date_add(date, -1) from date_dim
union all
select date, date_add(date, -2) from date_dim
union all
select date, date_add(date, -3) from date_dim
NOTE: THIS IS NOT AN ANSWER but an enhanced code sample of Diego Scaravaggi's answer. I am posting it as answer as the comment section is insufficient. Note that I have parameter-ized the period for Moving aveage.
declare #p int = 3
declare #t table(d int, bal float)
insert into #t values
(1,94),
(2,99),
(3,76),
(4,74),
(5,48),
(6,55),
(7,90),
(8,77),
(9,16),
(10,19),
(11,66),
(12,47)
select a.d, avg(b.bal)
from
#t a
left join #t b on b.d between a.d-(#p-1) and a.d
group by a.d
--#p1 is period of moving average, #01 is offset
declare #p1 as int
declare #o1 as int
set #p1 = 5;
set #o1 = 3;
with np as(
select *, rank() over(partition by cmdty, tenor order by markdt) as r
from p_prices p1
where
1=1
)
, x1 as (
select s1.*, avg(s2.val) as avgval from np s1
inner join np s2
on s1.cmdty = s2.cmdty and s1.tenor = s2.tenor
and s2.r between s1.r - (#p1 - 1) - (#o1) and s1.r - (#o1)
group by s1.cmdty, s1.tenor, s1.markdt, s1.val, s1.r
)
I'm not sure that your expected result (output) shows classic "simple moving (rolling) average" for 3 days. Because, for example, the first triple of numbers by definition gives:
ThreeDaysMovingAverage = (2.230 + 3.150 + 5.520) / 3 = 3.6333333
but you expect 4.360 and it's confusing.
Nevertheless, I suggest the following solution, which uses window-function AVG. This approach is much more efficient (clear and less resource-intensive) than SELF-JOIN introduced in other answers (and I'm surprised that no one has given a better solution).
-- Oracle-SQL dialect
with
data_table as (
select date '2012-05-01' AS dt, 2.230 AS clicks from dual union all
select date '2012-05-02' AS dt, 3.150 AS clicks from dual union all
select date '2012-05-03' AS dt, 5.520 AS clicks from dual union all
select date '2012-05-04' AS dt, 1.330 AS clicks from dual union all
select date '2012-05-05' AS dt, 2.260 AS clicks from dual union all
select date '2012-05-06' AS dt, 3.540 AS clicks from dual union all
select date '2012-05-07' AS dt, 2.330 AS clicks from dual
),
param as (select 3 days from dual)
select
dt AS "Date",
clicks AS "Clicks",
case when rownum >= p.days then
avg(clicks) over (order by dt
rows between p.days - 1 preceding and current row)
end
AS "3 day Moving Average"
from data_table t, param p;
You see that AVG is wrapped with case when rownum >= p.days then to force NULLs in first rows, where "3 day Moving Average" is meaningless.
We can apply Joe Celko's "dirty" left outer join method (as cited above by Diego Scaravaggi) to answer the question as it was asked.
declare #ClicksTable table ([Date] date, Clicks int)
insert into #ClicksTable
select '2012-05-01', 2230 union all
select '2012-05-02', 3150 union all
select '2012-05-03', 5520 union all
select '2012-05-04', 1330 union all
select '2012-05-05', 2260 union all
select '2012-05-06', 3540 union all
select '2012-05-07', 2330
This query:
SELECT
T1.[Date],
T1.Clicks,
-- AVG ignores NULL values so we have to explicitly NULLify
-- the days when we don't have a full 3-day sample
CASE WHEN count(T2.[Date]) < 3 THEN NULL
ELSE AVG(T2.Clicks)
END AS [3-Day Moving Average]
FROM #ClicksTable T1
LEFT OUTER JOIN #ClicksTable T2
ON T2.[Date] BETWEEN DATEADD(d, -2, T1.[Date]) AND T1.[Date]
GROUP BY T1.[Date]
Generates the requested output:
Date Clicks 3-Day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010

sql db2 select records from either table

I have an order file, with order id and ship date. Orders can only be shipped monday - friday. This means there are no records selected for Saturday and Sunday.
I use the same order file to get all order dates, with date in the same format (yyyymmdd).
i want to select a count of all the records from the order file based on order date... and (i believe) full outer join (or maybe right join?) the date file... because i would like to see
20120330 293
20120331 0
20120401 0
20120402 920
20120403 430
20120404 827
etc...
however, my sql statement is still not returning a zero record for the 31st and 1st.
with DatesTable as (
select ohordt "Date" from kivalib.orhdrpf
where ohordt between 20120315 and 20120406
group by ohordt order by ohordt
)
SELECT ohscdt, count(OHTXN#) "Count"
FROM KIVALIB.ORHDRPF full outer join DatesTable dts on dts."Date" = ohordt
--/*order status = filled & order type = 1 & date between (some fill date range)*/
WHERE OHSTAT = 'F' AND OHTYP = 1 and ohscdt between 20120401 and 20120406
GROUP BY ohscdt ORDER BY ohscdt
any ideas what i'm doing wrong?
thanks!
It's because there is no data for those days, they do not show up as rows. You can use a recursive CTE to build a contiguous list of dates between two values that the query can join on:
It will look something like:
WITH dates (val) AS (
SELECT CAST('2012-04-01' AS DATE)
FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT Val + 1 DAYS
FROM dates
WHERE Val < CAST('2012-04-06' AS DATE)
)
SELECT d.val AS "Date", o.ohscdt, COALESCE(COUNT(o.ohtxn#), 0) AS "Count"
FROM dates AS d
LEFT JOIN KIVALIB.ORDHRPF AS o
ON o.ohordt = TO_CHAR(d.val, 'YYYYMMDD')
WHERE o.ohstat = 'F'
AND o.ohtyp = 1