SQL Grouping around gaps - sql

In SQL Server 2005 I have a table with data that looks something like this:
WTN------------Date
555-111-1212 2009-01-01
555-111-1212 2009-01-02
555-111-1212 2009-01-03
555-111-1212 2009-01-15
555-111-1212 2009-01-16
212-999-5555 2009-01-01
212-999-5555 2009-01-10
212-999-5555 2009-01-11
From this I would like to extract WTN, Min(Date), Max(Date) the twist is I would like to also break whenever there is a gap in the dates, so from the above data, my results should look like:
WTN------------ MinDate---- MaxDate
555-111-1212 2009-01-01 2009-01-03
555-111-1212 2009-01-15 2009-01-16
212-999-5555 2009-01-01 2009-01-01
212-999-5555 2009-01-10 2009-01-11
How can I do this in a SQL Select/ Group By?
Can this be done without a table or list enumerating the values I want to identify gaps in (Dates here)?

Why is everyone so dead set against using a table for this kind of thing? A table of numbers or a calendar table takes up such little space and is probably in memory if referenced enough anyway. You can also derive a numbers table pretty easily on the fly using ROW_NUMBER(). Using a numbers table can help with the understanding of the query. But here is a not-so-straightforward example, a trick I picked up from Plamen Ratchev a while back, hope it helps.
DECLARE #wtns TABLE
(
WTN CHAR(12),
[Date] SMALLDATETIME
);
INSERT #wtns(WTN, [Date])
SELECT '555-111-1212','2009-01-01'
UNION ALL SELECT '555-111-1212','2009-01-02'
UNION ALL SELECT '555-111-1212','2009-01-03'
UNION ALL SELECT '555-111-1212','2009-01-15'
UNION ALL SELECT '555-111-1212','2009-01-16'
UNION ALL SELECT '212-999-5555','2009-01-01'
UNION ALL SELECT '212-999-5555','2009-01-10'
UNION ALL SELECT '212-999-5555','2009-01-11';
WITH x AS
(
SELECT
[Date],
wtn,
part = DATEDIFF(DAY, 0, [Date])
+ DENSE_RANK() OVER
(
PARTITION BY wtn
ORDER BY [Date] DESC
)
FROM #wtns
)
SELECT
WTN,
MinDate = MIN([Date]),
MaxDate = MAX([Date])
FROM
x
GROUP BY
part,
WTN
ORDER BY
WTN DESC,
MaxDate;

Your problem has to do with INTERVAL TYPES and a thing called PACKED NORMAL FORM of a relation.
The issues are discussed at large in "Temporal Data and the Relational Model".
Don't expect any SQL system to really help you with such problems.
Some tutorial systems notwithstanding, the only DBMS that offers decent support for such problems, and that I know of, is my own. No link because I don't want to be doing too much "plugging" here.

You can do this with the GROUP BY, by detecting the boundaries:
WITH Boundaries
AS (
SELECT m.WTN
,m.Date
,CASE WHEN p.Date IS NULL THEN 1
ELSE 0
END AS IsStart
,CASE WHEN n.Date IS NULL THEN 1
ELSE 0
END AS IsEnd
FROM so1590166 AS m
LEFT JOIN so1590166 AS p
ON p.WTN = m.WTN
AND p.Date = DATEADD(d, -1, m.Date)
LEFT JOIN so1590166 AS n
ON n.WTN = m.WTN
AND n.Date = DATEADD(d, 1, m.Date)
WHERE p.Date IS NULL
OR n.Date IS NULL
)
SELECT l.WTN
,l.Date AS MinDate
,MIN(r.Date) AS MaxDate
FROM Boundaries l
INNER JOIN Boundaries r
ON r.WTN = l.WTN
AND r.Date >= l.Date
AND l.IsStart = 1
AND r.IsEnd = 1
GROUP BY l.WTN
,l.Date

Related

Query without WHILE Loop

We have appointment table as shown below. Each appointment need to be categorized as "New" or "Followup". Any appointment (for a patient) within 30 days of first appointment (of that patient) is Followup. After 30 days, appointment is again "New". Any appointment within 30 days become "Followup".
I am currently doing this by typing while loop.
How to achieve this without WHILE loop?
Table
CREATE TABLE #Appt1 (ApptID INT, PatientID INT, ApptDate DATE)
INSERT INTO #Appt1
SELECT 1,101,'2020-01-05' UNION
SELECT 2,505,'2020-01-06' UNION
SELECT 3,505,'2020-01-10' UNION
SELECT 4,505,'2020-01-20' UNION
SELECT 5,101,'2020-01-25' UNION
SELECT 6,101,'2020-02-12' UNION
SELECT 7,101,'2020-02-20' UNION
SELECT 8,101,'2020-03-30' UNION
SELECT 9,303,'2020-01-28' UNION
SELECT 10,303,'2020-02-02'
You need to use recursive query.
The 30days period is counted starting from prev(and no it is not possible to do it without recursion/quirky update/loop). That is why all the existing answer using only ROW_NUMBER failed.
WITH f AS (
SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY PatientId ORDER BY ApptDate)
FROM Appt1
), rec AS (
SELECT Category = CAST('New' AS NVARCHAR(20)), ApptId, PatientId, ApptDate, rn, startDate = ApptDate
FROM f
WHERE rn = 1
UNION ALL
SELECT CAST(CASE WHEN DATEDIFF(DAY, rec.startDate,f.ApptDate) <= 30 THEN N'FollowUp' ELSE N'New' END AS NVARCHAR(20)),
f.ApptId,f.PatientId,f.ApptDate, f.rn,
CASE WHEN DATEDIFF(DAY, rec.startDate, f.ApptDate) <= 30 THEN rec.startDate ELSE f.ApptDate END
FROM rec
JOIN f
ON rec.rn = f.rn - 1
AND rec.PatientId = f.PatientId
)
SELECT ApptId, PatientId, ApptDate, Category
FROM rec
ORDER BY PatientId, ApptDate;
db<>fiddle demo
Output:
+---------+------------+-------------+----------+
| ApptId | PatientId | ApptDate | Category |
+---------+------------+-------------+----------+
| 1 | 101 | 2020-01-05 | New |
| 5 | 101 | 2020-01-25 | FollowUp |
| 6 | 101 | 2020-02-12 | New |
| 7 | 101 | 2020-02-20 | FollowUp |
| 8 | 101 | 2020-03-30 | New |
| 9 | 303 | 2020-01-28 | New |
| 10 | 303 | 2020-02-02 | FollowUp |
| 2 | 505 | 2020-01-06 | New |
| 3 | 505 | 2020-01-10 | FollowUp |
| 4 | 505 | 2020-01-20 | FollowUp |
+---------+------------+-------------+----------+
How it works:
f - get starting point(anchor - per every PatientId)
rec - recursibe part, if the difference between current value and prev is > 30 change the category and starting point, in context of PatientId
Main - display sorted resultset
Similar class:
Conditional SUM on Oracle - Capping a windowed function
Session window (Azure Stream Analytics)
Running Total until specific condition is true - Quirky update
Addendum
Do not ever use this code on production!
But another option, that is worth mentioning besides using cte, is to use temp table and update in "rounds"
It could be done in "single" round(quirky update):
CREATE TABLE Appt_temp (ApptID INT , PatientID INT, ApptDate DATE, Category NVARCHAR(10))
INSERT INTO Appt_temp(ApptId, PatientId, ApptDate)
SELECT ApptId, PatientId, ApptDate
FROM Appt1;
CREATE CLUSTERED INDEX Idx_appt ON Appt_temp(PatientID, ApptDate);
Query:
DECLARE #PatientId INT = 0,
#PrevPatientId INT,
#FirstApptDate DATE = NULL;
UPDATE Appt_temp
SET #PrevPatientId = #PatientId
,#PatientId = PatientID
,#FirstApptDate = CASE WHEN #PrevPatientId <> #PatientId THEN ApptDate
WHEN DATEDIFF(DAY, #FirstApptDate, ApptDate)>30 THEN ApptDate
ELSE #FirstApptDate
END
,Category = CASE WHEN #PrevPatientId <> #PatientId THEN 'New'
WHEN #FirstApptDate = ApptDate THEN 'New'
ELSE 'FollowUp'
END
FROM Appt_temp WITH(INDEX(Idx_appt))
OPTION (MAXDOP 1);
SELECT * FROM Appt_temp ORDER BY PatientId, ApptDate;
db<>fiddle Quirky update
You could do this with a recursive cte. You should first order by apptDate within each patient. That can be accomplished by a run-of-the-mill cte.
Then, in the anchor portion of your recursive cte, select the first ordering for each patient, mark the status as 'new', and also mark the apptDate as the date of the most recent 'new' record.
In the recursive portion of your recursive cte, increment to the next appointment, calculate the difference in days between the present appointment and the most recent 'new' appointment date. If it's greater than 30 days, mark it 'new' and reset the most recent new appointment date. Otherwise mark it as 'follow up' and just pass along the existing days since new appointment date.
Finallly, in the base query, just select the columns you want.
with orderings as (
select *,
rn = row_number() over(
partition by patientId
order by apptDate
)
from #appt1 a
),
markings as (
select apptId,
patientId,
apptDate,
rn,
type = convert(varchar(10),'new'),
dateOfNew = apptDate
from orderings
where rn = 1
union all
select o.apptId, o.patientId, o.apptDate, o.rn,
type = convert(varchar(10),iif(ap.daysSinceNew > 30, 'new', 'follow up')),
dateOfNew = iif(ap.daysSinceNew > 30, o.apptDate, m.dateOfNew)
from markings m
join orderings o
on m.patientId = o.patientId
and m.rn + 1 = o.rn
cross apply (select daysSinceNew = datediff(day, m.dateOfNew, o.apptDate)) ap
)
select apptId, patientId, apptDate, type
from markings
order by patientId, rn;
I should mention that I initially deleted this answer because Abhijeet Khandagale's answer seemed to meet your needs with a simpler query (after reworking it a bit). But with your comment to him about your business requirement and your added sample data, I undeleted mine because believe this one meets your needs.
I'm not sure that it's exactly what you implemented. But another option, that is worth mentioning besides using cte, is to use temp table and update in "rounds". So we are going to update temp table while all statuses are not set correctly and build result in an iterative way. We can control number of iteration using simply local variable.
So we split each iteration into two stages.
Set all Followup values that are near to New records. That's pretty easy to do just using right filter.
For the rest of the records that dont have status set we can select first in group with same PatientID. And say that they are new since they not processed by the first stage.
So
CREATE TABLE #Appt2 (ApptID INT, PatientID INT, ApptDate DATE, AppStatus nvarchar(100))
select * from #Appt1
insert into #Appt2 (ApptID, PatientID, ApptDate, AppStatus)
select a1.ApptID, a1.PatientID, a1.ApptDate, null from #Appt1 a1
declare #limit int = 0;
while (exists(select * from #Appt2 where AppStatus IS NULL) and #limit < 1000)
begin
set #limit = #limit+1;
update a2
set
a2.AppStatus = IIF(exists(
select *
from #Appt2 a
where
0 > DATEDIFF(day, a2.ApptDate, a.ApptDate)
and DATEDIFF(day, a2.ApptDate, a.ApptDate) > -30
and a.ApptID != a2.ApptID
and a.PatientID = a2.PatientID
and a.AppStatus = 'New'
), 'Followup', a2.AppStatus)
from #Appt2 a2
--select * from #Appt2
update a2
set a2.AppStatus = 'New'
from #Appt2 a2 join (select a.*, ROW_NUMBER() over (Partition By PatientId order by ApptId) rn from (select * from #Appt2 where AppStatus IS NULL) a) ar
on a2.ApptID = ar.ApptID
and ar.rn = 1
--select * from #Appt2
end
select * from #Appt2 order by PatientID, ApptDate
drop table #Appt1
drop table #Appt2
Update. Read the comment provided by Lukasz. It's by far smarter way. I leave my answer just as an idea.
I believe the recursive common expression is great way to optimize queries avoiding loops, but in some cases it can lead to bad performance and should be avoided if possible.
I use the code below to solve the issue and test it will more values, but encourage you to test it with your real data, too.
WITH DataSource AS
(
SELECT *
,CEILING(DATEDIFF(DAY, MIN([ApptDate]) OVER (PARTITION BY [PatientID]), [ApptDate]) * 1.0 / 30 + 0.000001) AS [GroupID]
FROM #Appt1
)
SELECT *
,IIF(ROW_NUMBER() OVER (PARTITION BY [PatientID], [GroupID] ORDER BY [ApptDate]) = 1, 'New', 'Followup')
FROM DataSource
ORDER BY [PatientID]
,[ApptDate];
The idea is pretty simple - I want separate the records in group (30 days), in which group the smallest record is new, the others are follow ups. Check how the statement is built:
SELECT *
,DATEDIFF(DAY, MIN([ApptDate]) OVER (PARTITION BY [PatientID]), [ApptDate])
,DATEDIFF(DAY, MIN([ApptDate]) OVER (PARTITION BY [PatientID]), [ApptDate]) * 1.0 / 30
,CEILING(DATEDIFF(DAY, MIN([ApptDate]) OVER (PARTITION BY [PatientID]), [ApptDate]) * 1.0 / 30 + 0.000001)
FROM #Appt1
ORDER BY [PatientID]
,[ApptDate];
So:
first, we are getting the first date, for each group and calculating the differences in days with the current one
then, we are want to get groups - * 1.0 / 30 is added
as for 30, 60, 90, etc days we are getting whole number and we wanted to start a new period, I have added + 0.000001; also, we are using ceiling function to get the smallest integer greater than, or equal to, the specified numeric expression
That's it. Having such group we simply use ROW_NUMBER to find our start date and make it as new and leaving the rest as follow ups.
With due respect to everybody and in IMHO,
There is not much difference between While LOOP and Recursive CTE in terms of RBAR
There is not much performance gain when using Recursive CTE and Window Partition function all in one.
Appid should be int identity(1,1) , or it should be ever increasing clustered index.
Apart from other benefit it also ensure that all successive row APPDate of that patient must be greater.
This way you can easily play with APPID in your query which will be more efficient than putting inequality operator like >,< in APPDate.
Putting inequality operator like >,< in APPID will aid Sql Optimizer.
Also there should be two date column in table like
APPDateTime datetime2(0) not null,
Appdate date not null
As these are most important columns in most important table,so not much cast ,convert.
So Non clustered index can be created on Appdate
Create NonClustered index ix_PID_AppDate_App on APP (patientid,APPDate) include(other column which is not i predicate except APPID)
Test my script with other sample data and lemme know for which sample data it not working.
Even if it do not work then I am sure it can be fix in my script logic itself.
CREATE TABLE #Appt1 (ApptID INT, PatientID INT, ApptDate DATE)
INSERT INTO #Appt1
SELECT 1,101,'2020-01-05' UNION ALL
SELECT 2,505,'2020-01-06' UNION ALL
SELECT 3,505,'2020-01-10' UNION ALL
SELECT 4,505,'2020-01-20' UNION ALL
SELECT 5,101,'2020-01-25' UNION ALL
SELECT 6,101,'2020-02-12' UNION ALL
SELECT 7,101,'2020-02-20' UNION ALL
SELECT 8,101,'2020-03-30' UNION ALL
SELECT 9,303,'2020-01-28' UNION ALL
SELECT 10,303,'2020-02-02'
;With CTE as
(
select a1.* ,a2.ApptDate as NewApptDate
from #Appt1 a1
outer apply(select top 1 a2.ApptID ,a2.ApptDate
from #Appt1 A2
where a1.PatientID=a2.PatientID and a1.ApptID>a2.ApptID
and DATEDIFF(day,a2.ApptDate, a1.ApptDate)>30
order by a2.ApptID desc )A2
)
,CTE1 as
(
select a1.*, a2.ApptDate as FollowApptDate
from CTE A1
outer apply(select top 1 a2.ApptID ,a2.ApptDate
from #Appt1 A2
where a1.PatientID=a2.PatientID and a1.ApptID>a2.ApptID
and DATEDIFF(day,a2.ApptDate, a1.ApptDate)<=30
order by a2.ApptID desc )A2
)
select *
,case when FollowApptDate is null then 'New'
when NewApptDate is not null and FollowApptDate is not null
and DATEDIFF(day,NewApptDate, FollowApptDate)<=30 then 'New'
else 'Followup' end
as Category
from cte1 a1
order by a1.PatientID
drop table #Appt1
Although it's not clearly addressed in the question, it's easy to figure out that the appointment dates cannot be simply categorized by 30-day groups. It makes no business sense. And you cannot use the appt id either. One can make a new appointment today for 2020-09-06.
Here is how I address this issue. First, get the first appointment, then calculate the date difference between each appointment and the first appt. If it's 0, set to 'New'. If <= 30 'Followup'. If > 30, set as 'Undecided' and do the next round check until there is no more 'Undecided'. And for that, you really need a while loop, but it does not loop through each appointment date, rather only a few datasets. I checked the execution plan. Even though there are only 10 rows, the query cost is significantly lower than that using recursive CTE, but not as low as Lukasz Szozda's addendum method.
IF OBJECT_ID('tempdb..#TEMPTABLE') IS NOT NULL DROP TABLE #TEMPTABLE
SELECT ApptID, PatientID, ApptDate
,CASE WHEN (DATEDIFF(DAY, MIN(ApptDate) OVER (PARTITION BY PatientID), ApptDate) = 0) THEN 'New'
WHEN (DATEDIFF(DAY, MIN(ApptDate) OVER (PARTITION BY PatientID), ApptDate) <= 30) THEN 'Followup'
ELSE 'Undecided' END AS Category
INTO #TEMPTABLE
FROM #Appt1
WHILE EXISTS(SELECT TOP 1 * FROM #TEMPTABLE WHERE Category = 'Undecided') BEGIN
;WITH CTE AS (
SELECT ApptID, PatientID, ApptDate
,CASE WHEN (DATEDIFF(DAY, MIN(ApptDate) OVER (PARTITION BY PatientID), ApptDate) = 0) THEN 'New'
WHEN (DATEDIFF(DAY, MIN(ApptDate) OVER (PARTITION BY PatientID), ApptDate) <= 30) THEN 'Followup'
ELSE 'Undecided' END AS Category
FROM #TEMPTABLE
WHERE Category = 'Undecided'
)
UPDATE #TEMPTABLE
SET Category = CTE.Category
FROM #TEMPTABLE t
LEFT JOIN CTE ON CTE.ApptID = t.ApptID
WHERE t.Category = 'Undecided'
END
SELECT ApptID, PatientID, ApptDate, Category
FROM #TEMPTABLE
I hope this will help you.
WITH CTE AS
(
SELECT #Appt1.*, RowNum = ROW_NUMBER() OVER (PARTITION BY PatientID ORDER BY ApptDate, ApptID) FROM #Appt1
)
SELECT A.ApptID , A.PatientID , A.ApptDate ,
Expected_Category = CASE WHEN (DATEDIFF(MONTH, B.ApptDate, A.ApptDate) > 0) THEN 'New'
WHEN (DATEDIFF(DAY, B.ApptDate, A.ApptDate) <= 30) then 'Followup'
ELSE 'New' END
FROM CTE A
LEFT OUTER JOIN CTE B on A.PatientID = B.PatientID
AND A.rownum = B.rownum + 1
ORDER BY A.PatientID, A.ApptDate
You could use a Case statement.
select
*,
CASE
WHEN DATEDIFF(d,A1.ApptDate,A2.ApptDate)>30 THEN 'New'
ELSE 'FollowUp'
END 'Category'
from
(SELECT PatientId, MIN(ApptId) 'ApptId', MIN(ApptDate) 'ApptDate' FROM #Appt1 GROUP BY PatientID) A1,
#Appt1 A2
where
A1.PatientID=A2.PatientID AND A1.ApptID<A2.ApptID
The question is, should this category be assigned based off the initial appointment, or the one prior? That is, if a Patient has had three appointments, should we compare the third appointment to the first, or the second?
You problem states the first, which is how I've answered. If that's not the case, you'll want to use lag.
Also, keep in mind that DateDiff makes not exception for weekends. If this should be weekdays only, you'll need to create your own Scalar-Valued function.
using Lag function
select apptID, PatientID , Apptdate ,
case when date_diff IS NULL THEN 'NEW'
when date_diff < 30 and (date_diff_2 IS NULL or date_diff_2 < 30) THEN 'Follow Up'
ELSE 'NEW'
END AS STATUS FROM
(
select
apptID, PatientID , Apptdate ,
DATEDIFF (day,lag(Apptdate) over (PARTITION BY PatientID order by ApptID asc),Apptdate) date_diff ,
DATEDIFF(day,lag(Apptdate,2) over (PARTITION BY PatientID order by ApptID asc),Apptdate) date_diff_2
from #Appt1
) SRC
Demo --> https://rextester.com/TNW43808
with cte
as
(
select
tmp.*,
IsNull(Lag(ApptDate) Over (partition by PatientID Order by PatientID,ApptDate),ApptDate) PriorApptDate
from #Appt1 tmp
)
select
PatientID,
ApptDate,
PriorApptDate,
DateDiff(d,PriorApptDate,ApptDate) Elapsed,
Case when DateDiff(d,PriorApptDate,ApptDate)>30
or DateDiff(d,PriorApptDate,ApptDate)=0 then 'New' else 'Followup' end Category from cte
Mine is correct. The authors was incorrect, see elapsed

Previous row end date as the next row start date in SQL

Need Some help Please,
I have a Field called 'hist_lastupdated' that contains the last updated date of the modification of the price of a product.
Based in this field, i want to extract the start date and the end date of the modification.
In fact i have this:
**Product_id , Price , hist_lastupdated**
284849 18.95 2015-05-29 00:53:55
284849 15.95 2015-08-14 01:04:46
284849 18.95 2016-06-11 00:50:31
284849 15.95 2016-08-24 00:45:11
And i want to get the result like that :
**Product_id , Price , hist_lastupdated ,start_date , End_date**
284849 18.95 2015-05-29 00:53:55 2014-05-01 00:00:00 2015-05-29 00:53:55
284849 15.95 2015-08-14 01:04:46 2015-05-29 00:53:55 2015-08-14 01:04:46
284849 18.95 2016-06-11 00:50:31 2015-08-14 01:04:46 2016-06-11 00:50:31
284849 15.95 2016-08-24 00:45:11 2016-06-11 00:50:31 2016-08-24 00:45:11
In two word, the start date is the end date of the previous line
i have many product id
Something like this:
select Product_id,
Price,
hist_lastupdated,
lag(hist_lastupdated) over (partition by product_id order by hist_lastupdated) as start_date,
hist_lastupdated as end_date
from the_table
You didn't explain where the start_date for the first column is calculated. If that is beginning of the month from hist_lastupdated you can do something like this:
lag(hist_lastupdated, 1, date_trunc('month', hist_lastupdated)) over (...)
I'm not sure how you would do this with just SQL but if you're able to do a bit of scripting you can write up a quick program that goes something like this (pseudocode):
lines = execute(SELECT product_id, price, hist_lastupdated FROM ProductTable)
startDate = 00:00:00 2014-05-01
outputLines = []
for row in lines:
outLine = []
outline.append(row[0])
outline.append(row[1])
outline.append(row[2])
outline.append(startDate)
outline.append(row[2])
startDate = row[2]
#Now do what you want with the output you have in a nice list of lists in the format you need, insert into a different table, write to a file, whatever you want.
I would use one of these solutions with MS SQL Server. Hopefully one of them will apply to your problem.
Pure SQL statement would look like this:
select
t.product_id, t.price, s.start_date, t.end_date
from
product t
outer apply
(
select top 1
end_date start_date
from
product o
where
o.end_date < t.end_date
order by
o.end_date desc
) s
The cross apply for each record returned can be a performance problem even with good indexing.
If your SQL server supports the LAG function:
select
t.product_id, t.price,
LAG(T.end_date) over (order by t.end_date),
t.end_date
from
product t
Or you may find a way to do the same thing with variables in an update statement to "remember" the value in the previously updated record like the T-SQL:
-- Insert the desired output into a table variable that also has a start_date field.
-- Be sure to insert the records ordered by the date value.
declare #output table (product_id int, price numeric(10,2), [start_date] datetime, [end_date] datetime)
insert #output (product_id, price, end_date)
select 1, 10, '1/1/2015'
union all select 2, 11, '2/1/2015'
union all select 3, 15, '3/1/2015'
union all select 4, 20, '4/1/2015'
order by 3
-- Update the start date using the end date from the previous record
declare #start_date datetime, #end_date datetime
update
#output
set
#start_date = #end_date,
#end_date = end_date,
start_date = #start_date
select * from #output
I don't think this technique is recommended by Microsoft, but it has served me well and worked consistently. I only used this technique with table variables. I would be less inclined to trust the update sequence of records in in an actual table. Now I would use LAG() instead.
This is the solution that i find it,i wanted to work with the lag function but the result is not what i wanted to have.
The solution :
WITH
price_table_1 as (
select
-1 + ROW_NUMBER() over (partition by t1.product_id,t1.id ,t1.channel_id) as rownum_w1,
t1.id,
t1.product_id,
t1.channel_id,
t1.member_id,
t1.quantity,
t1.price,
t1.promo_dt_start,
t1.promo_dt_end,
t1.hist_lastupdated
FROM dwh_prod.hist_prices t1
where t1.channel_id='1004' and t1.product_id = '5896' and t1.quantity = '1' and t1.promo_dt_start is null
order by t1.product_id,t1.channel_id,t1.hist_lastupdated
),price_table_2 as (
select
ROW_NUMBER() over (partition by t2.product_id,t2.id ,t2.channel_id) as rownum_w2,
t2.id,
t2.product_id,
t2.channel_id,
t2.member_id,
t2.quantity,
t2.price,
t2.promo_dt_start,
t2.promo_dt_end,
t2.hist_lastupdated
FROM dwh_prod.hist_prices t2
where t2.channel_id='1004' and t2.product_id = '5896' and t2.quantity = '1' and t2.promo_dt_start is null
order by t2.product_id,t2.channel_id,t2.hist_lastupdated
)
select
t1.id,
t1.product_id,
t1.channel_id,
t1.member_id,
t1.quantity,
t1.price,
t1.promo_dt_start,
t1.promo_dt_end,
t2.hist_lastupdated as start_date,
t1.hist_lastupdated as end_date
FROM price_table_1 t1
inner join price_table_2 t2
on t2.product_id = t1.product_id and t2.id = t1.id and t2.channel_id = t1.channel_id
and rownum_w1 = (rownum_w2)
UNION ALL
select
t1.id,
t1.product_id,
t1.channel_id,
t1.member_id,
t1.quantity,
t1.price,
t1.promo_dt_start,
t1.promo_dt_end,
CONVERT(TIMESTAMP,'2014-01-01') as start_date,
t1.hist_lastupdated as end_date
FROM price_table_1 t1
where rownum_w1 = '0';

Filter LEFT JOINed table with dates to display current event, else future, else past?

I have a table that lists vacation information for different users (username, vacation start, and vacation end dates) -- 4 users are listed below:
Username VacationStart DeploymentEnd
rsuarez 2014-03-10 2014-03-26
studd 2014-01-18 2014-01-29
studd 2014-02-11 2014-02-26
studd 2014-03-02 2014-03-04
ssteele 2014-03-11 2014-03-26
ssteele 2014-03-18 2014-03-28
atidball 2014-03-05 2014-03-20
atidball 2014-03-06 2014-03-26
atidball 2014-03-13 2014-03-20
atidball 2014-03-18 2014-03-31
For a new query, I want to display only 4 rows, with each user having only one set of vacation dates displayed, either current/in-progress vacation, future/next vacation (if no current exists) or most recent (if two above are false).
The end result should be following (assuming today is 3/9/2014):
Username VacationStart DeploymentEnd
rsuarez 2014-03-10 2014-03-26
studd 2014-03-02 2014-03-04
ssteele 2014-03-11 2014-03-26
atidball 2014-03-05 2014-03-20
Vacation dates are actually coming from another table (data_vacations), which I left join to data_users. I am trying to perform case selection inside left join statement.
Here is what I tried before, but my logic fails there, since I ended up to mix different vacation end dates to vacation start dates:
SELECT Username, VacationStart, VacationEnd
FROM data_users
LEFT JOIN
(
SELECT userGUID,
CASE WHEN MIN(CASE WHEN (VacationEnd < getdate()) THEN NULL ELSE VacationStart END) IS NULL THEN MAX(VacationStart)
ELSE MIN(VacationStart) END AS VacationStart,
CASE WHEN MIN(CASE WHEN (VacationEnd < getdate()) THEN NULL ELSE VacationEnd END) IS NULL THEN MAX(VacationEnd)
ELSE MIN(VacationEnd) END AS VacationEnd
FROM data_vacations
GROUP BY userGUID
) b ON(data_empl_master.userGUID= b.userGUID)
What am I doing wrong? How could I fix it?
Also.. on side note.. Do I perform this filtering in LEFT JOIN correctly? Since data_users is much bigger, having distinct user ids... and I would like to join the available vacation information based on example above, while still displaying all unique user ids.
Using a common table expression to rank by category (current = 1, future = 2, past = 3) and each category individually by start date/differene from GETDATE(), you can get the result you want by ranking the result using ROW_NUMBER();
DECLARE #DATE DATETIME = GETDATE()
;WITH cte AS (
SELECT *, 1 r, VacationStart s FROM data_users
WHERE #DATE BETWEEN VacationStart and DeploymentEnd
UNION ALL
SELECT *,2 r, VacationStart - #DATE s FROM data_users
WHERE VacationStart > #DATE
UNION ALL
SELECT *,3 r, #DATE - DeploymentEnd s FROM data_users
WHERE DeploymentEnd < #DATE
), cte2 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY username ORDER BY r,s) rn FROM cte
)
SELECT Username, VacationStart, DeploymentEnd FROM cte2 WHERE rn=1;
An SQLfiddle to test with.
Getting the date as a variable is necessary to get a consistent GETDATE() value over the whole query, otherwise it may not be consistent if called multiple times.
select u.name,s.startdate,s.enddate
from users u
left join
(
select su.name,
max(su.start) as startdate,
max(su.end) as enddate from users su group by su.name
)s on u.name= s.name
group by u.name
Since you are asking two questions I will answer the one about getting the vacation dates and let you figure out the join.
I don't think you can get the desired vacations dates in one simple query. First you need to establish if the given date range is in past, present or future. Then you need to order those ranges by start/end dates to get the most recent or next upcoming. You need sort the past vacations in descending and upcoming in ascending order. Funny enough user atidball has two vacations in-progress, I sorted that in the same manner as future vacation. Finally apply your rules, I did that by sorting by state.
declare #currentDate date = '20140309'
;
with cte1 as
(
-- state: the lower number the higher priority
select Username, VacationStart, DeploymentEnd,
case
when VacationStart <= #currentDate and DeploymentEnd >= #currentDate
then 0 -- in progress
when VacationStart > #currentDate
then 1 -- future
when DeploymentEnd < #currentDate
then 2 -- past
else NULL
end as state
from data_vacations
)
, cte2 as
(
select *,
row_number() over(partition by username, state order by VacationStart, DeploymentEnd) as rn
from cte1
where state < 2 -- current or upcoming
union all
select *,
row_number() over(partition by username, state order by DeploymentEnd desc, VacationStart desc) as rn
from cte1
where state = 2 -- past
)
, cte3 as
(
-- apply the rules: find the record with highest priority
select Username, min(state) as minstate
from cte1
group by Username
)
select cte2.Username, cte2.VacationStart, cte2.DeploymentEnd
from cte2
inner join cte3
on cte2.Username = cte3.Username
and cte2.state = cte3.minstate
and cte2.rn = 1 -- most recent or next upcoming
See the SQLFiddle.

Filling in missing dates DB2 SQL

My initial query looks like this:
select process_date, count(*) batchCount
from T1.log_comments
order by process_date asc;
I need to be able to do some quick analysis for weekends that are missing, but wanted to know if there was a quick way to fill in the missing dates not present in process_date.
I've seen the solution here but am curious if there's any magic hidden in db2 that could do this with only a minor modification to my original query.
Note: Not tested, framed it based on my exposure to SQL Server/Oracle. I guess this gives you the idea though:
*now amended and tested on DB2*
WITH MaxDateQry(MaxDate) AS
(
SELECT MAX(process_date) FROM T1.log_comments
),
MinDateQry(MinDate) AS
(
SELECT MIN(process_date) FROM T1.log_comments
),
DatesData(ProcessDate) AS
(
SELECT MinDate from MinDateQry
UNION ALL
SELECT (ProcessDate + 1 DAY) FROM DatesData WHERE ProcessDate < (SELECT MaxDate FROM MaxDateQry)
)
SELECT a.ProcessDate, b.batchCount
FROM DatesData a LEFT JOIN
(
SELECT process_date, COUNT(*) batchCount
FROM T1.log_comments
) b
ON a.ProcessDate = b.process_date
ORDER BY a.ProcessDate ASC;

SQL moving average

How do you create a moving average in SQL?
Current table:
Date Clicks
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520
2012-05-04 1,330
2012-05-05 2,260
2012-05-06 3,540
2012-05-07 2,330
Desired table or output:
Date Clicks 3 day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010
This is an Evergreen Joe Celko question.
I ignore which DBMS platform is used. But in any case Joe was able to answer more than 10 years ago with standard SQL.
Joe Celko SQL Puzzles and Answers citation:
"That last update attempt suggests that we could use the predicate to
construct a query that would give us a moving average:"
SELECT S1.sample_time, AVG(S2.load) AS avg_prev_hour_load
FROM Samples AS S1, Samples AS S2
WHERE S2.sample_time
BETWEEN (S1.sample_time - INTERVAL 1 HOUR)
AND S1.sample_time
GROUP BY S1.sample_time;
Is the extra column or the query approach better? The query is
technically better because the UPDATE approach will denormalize the
database. However, if the historical data being recorded is not going
to change and computing the moving average is expensive, you might
consider using the column approach.
MS SQL Example:
CREATE TABLE #TestDW
( Date1 datetime,
LoadValue Numeric(13,6)
);
INSERT INTO #TestDW VALUES('2012-06-09' , '3.540' );
INSERT INTO #TestDW VALUES('2012-06-08' , '2.260' );
INSERT INTO #TestDW VALUES('2012-06-07' , '1.330' );
INSERT INTO #TestDW VALUES('2012-06-06' , '5.520' );
INSERT INTO #TestDW VALUES('2012-06-05' , '3.150' );
INSERT INTO #TestDW VALUES('2012-06-04' , '2.230' );
SQL Puzzle query:
SELECT S1.date1, AVG(S2.LoadValue) AS avg_prev_3_days
FROM #TestDW AS S1, #TestDW AS S2
WHERE S2.date1
BETWEEN DATEADD(d, -2, S1.date1 )
AND S1.date1
GROUP BY S1.date1
order by 1;
One way to do this is to join on the same table a few times.
select
(Current.Clicks
+ isnull(P1.Clicks, 0)
+ isnull(P2.Clicks, 0)
+ isnull(P3.Clicks, 0)) / 4 as MovingAvg3
from
MyTable as Current
left join MyTable as P1 on P1.Date = DateAdd(day, -1, Current.Date)
left join MyTable as P2 on P2.Date = DateAdd(day, -2, Current.Date)
left join MyTable as P3 on P3.Date = DateAdd(day, -3, Current.Date)
Adjust the DateAdd component of the ON-Clauses to match whether you want your moving average to be strictly from the past-through-now or days-ago through days-ahead.
This works nicely for situations where you need a moving average over only a few data points.
This is not an optimal solution for moving averages with more than a few data points.
select t2.date, round(sum(ct.clicks)/3) as avg_clicks
from
(select date from clickstable) as t2,
(select date, clicks from clickstable) as ct
where datediff(t2.date, ct.date) between 0 and 2
group by t2.date
Example here.
Obviously you can change the interval to whatever you need. You could also use count() instead of a magic number to make it easier to change, but that will also slow it down.
General template for rolling averages that scales well for large data sets
WITH moving_avg AS (
SELECT 0 AS [lag] UNION ALL
SELECT 1 AS [lag] UNION ALL
SELECT 2 AS [lag] UNION ALL
SELECT 3 AS [lag] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1]) AS [avg_value1],
AVG([value2]) AS [avg_value2]
FROM [data_table]
CROSS JOIN moving_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
And for weighted rolling averages:
WITH weighted_avg AS (
SELECT 0 AS [lag], 1.0 AS [weight] UNION ALL
SELECT 1 AS [lag], 0.6 AS [weight] UNION ALL
SELECT 2 AS [lag], 0.3 AS [weight] UNION ALL
SELECT 3 AS [lag], 0.1 AS [weight] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1] * [weight]) / AVG([weight]) AS [wavg_value1],
AVG([value2] * [weight]) / AVG([weight]) AS [wavg_value2]
FROM [data_table]
CROSS JOIN weighted_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
select *
, (select avg(c2.clicks) from #clicks_table c2
where c2.date between dateadd(dd, -2, c1.date) and c1.date) mov_avg
from #clicks_table c1
Use a different join predicate:
SELECT current.date
,avg(periods.clicks)
FROM current left outer join current as periods
ON current.date BETWEEN dateadd(d,-2, periods.date) AND periods.date
GROUP BY current.date HAVING COUNT(*) >= 3
The having statement will prevent any dates without at least N values from being returned.
assume x is the value to be averaged and xDate is the date value:
SELECT avg(x) from myTable WHERE xDate BETWEEN dateadd(d, -2, xDate) and xDate
In hive, maybe you could try
select date, clicks, avg(clicks) over (order by date rows between 2 preceding and current row) as moving_avg from clicktable;
For the purpose, I'd like to create an auxiliary/dimensional date table like
create table date_dim(date date, date_1 date, dates_2 date, dates_3 dates ...)
while date is the key, date_1 for this day, date_2 contains this day and the day before; date_3...
Then you can do the equal join in hive.
Using a view like:
select date, date from date_dim
union all
select date, date_add(date, -1) from date_dim
union all
select date, date_add(date, -2) from date_dim
union all
select date, date_add(date, -3) from date_dim
NOTE: THIS IS NOT AN ANSWER but an enhanced code sample of Diego Scaravaggi's answer. I am posting it as answer as the comment section is insufficient. Note that I have parameter-ized the period for Moving aveage.
declare #p int = 3
declare #t table(d int, bal float)
insert into #t values
(1,94),
(2,99),
(3,76),
(4,74),
(5,48),
(6,55),
(7,90),
(8,77),
(9,16),
(10,19),
(11,66),
(12,47)
select a.d, avg(b.bal)
from
#t a
left join #t b on b.d between a.d-(#p-1) and a.d
group by a.d
--#p1 is period of moving average, #01 is offset
declare #p1 as int
declare #o1 as int
set #p1 = 5;
set #o1 = 3;
with np as(
select *, rank() over(partition by cmdty, tenor order by markdt) as r
from p_prices p1
where
1=1
)
, x1 as (
select s1.*, avg(s2.val) as avgval from np s1
inner join np s2
on s1.cmdty = s2.cmdty and s1.tenor = s2.tenor
and s2.r between s1.r - (#p1 - 1) - (#o1) and s1.r - (#o1)
group by s1.cmdty, s1.tenor, s1.markdt, s1.val, s1.r
)
I'm not sure that your expected result (output) shows classic "simple moving (rolling) average" for 3 days. Because, for example, the first triple of numbers by definition gives:
ThreeDaysMovingAverage = (2.230 + 3.150 + 5.520) / 3 = 3.6333333
but you expect 4.360 and it's confusing.
Nevertheless, I suggest the following solution, which uses window-function AVG. This approach is much more efficient (clear and less resource-intensive) than SELF-JOIN introduced in other answers (and I'm surprised that no one has given a better solution).
-- Oracle-SQL dialect
with
data_table as (
select date '2012-05-01' AS dt, 2.230 AS clicks from dual union all
select date '2012-05-02' AS dt, 3.150 AS clicks from dual union all
select date '2012-05-03' AS dt, 5.520 AS clicks from dual union all
select date '2012-05-04' AS dt, 1.330 AS clicks from dual union all
select date '2012-05-05' AS dt, 2.260 AS clicks from dual union all
select date '2012-05-06' AS dt, 3.540 AS clicks from dual union all
select date '2012-05-07' AS dt, 2.330 AS clicks from dual
),
param as (select 3 days from dual)
select
dt AS "Date",
clicks AS "Clicks",
case when rownum >= p.days then
avg(clicks) over (order by dt
rows between p.days - 1 preceding and current row)
end
AS "3 day Moving Average"
from data_table t, param p;
You see that AVG is wrapped with case when rownum >= p.days then to force NULLs in first rows, where "3 day Moving Average" is meaningless.
We can apply Joe Celko's "dirty" left outer join method (as cited above by Diego Scaravaggi) to answer the question as it was asked.
declare #ClicksTable table ([Date] date, Clicks int)
insert into #ClicksTable
select '2012-05-01', 2230 union all
select '2012-05-02', 3150 union all
select '2012-05-03', 5520 union all
select '2012-05-04', 1330 union all
select '2012-05-05', 2260 union all
select '2012-05-06', 3540 union all
select '2012-05-07', 2330
This query:
SELECT
T1.[Date],
T1.Clicks,
-- AVG ignores NULL values so we have to explicitly NULLify
-- the days when we don't have a full 3-day sample
CASE WHEN count(T2.[Date]) < 3 THEN NULL
ELSE AVG(T2.Clicks)
END AS [3-Day Moving Average]
FROM #ClicksTable T1
LEFT OUTER JOIN #ClicksTable T2
ON T2.[Date] BETWEEN DATEADD(d, -2, T1.[Date]) AND T1.[Date]
GROUP BY T1.[Date]
Generates the requested output:
Date Clicks 3-Day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010