SQL grouping and running total of open items for a date range - sql

I have a table of items that, for sake of simplicity, contains the ItemID, the StartDate, and the EndDate for a list of items.
ItemID StartDate EndDate
1 1/1/2011 1/15/2011
2 1/2/2011 1/14/2011
3 1/5/2011 1/17/2011
...
My goal is to be able to join this table to a table with a sequential list of dates,
and say both how many items are open on a particular date, and also how many items are cumulatively open.
Date ItemsOpened CumulativeItemsOpen
1/1/2011 1 1
1/2/2011 1 2
...
I can see how this would be done with a WHILE loop,
but that has performance implications. I'm wondering how
this could be done with a set-based approach?

SELECT COUNT(CASE WHEN d.CheckDate = i.StartDate THEN 1 ELSE NULL END)
AS ItemsOpened
, COUNT(i.StartDate)
AS ItemsOpenedCumulative
FROM Dates AS d
LEFT JOIN Items AS i
ON d.CheckDate BETWEEN i.StartDate AND i.EndDate
GROUP BY d.CheckDate

This may give you what you want
SELECT DATE,
SUM(ItemOpened) AS ItemsOpened,
COUNT(StartDate) AS ItemsOpenedCumulative
FROM
(
SELECT d.Date, i.startdate, i.enddate,
CASE WHEN i.StartDate = d.Date THEN 1 ELSE 0 END AS ItemOpened
FROM Dates d
LEFT OUTER JOIN Items i ON d.Date BETWEEN i.StartDate AND i.EndDate
) AS x
GROUP BY DATE
ORDER BY DATE
This assumes that your date values are DATE data type. Or, the dates are DATETIME with no time values.

You may find this useful. The recusive part can be replaced with a table. To demonstrate it works I had to populate some sort of date table. As you can see, the actual sql is short and simple.
DECLARE #i table (itemid INT, startdate DATE, enddate DATE)
INSERT #i VALUES (1,'1/1/2011', '1/15/2011')
INSERT #i VALUES (2,'1/2/2011', '1/14/2011')
INSERT #i VALUES (3,'1/5/2011', '1/17/2011')
DECLARE #from DATE
DECLARE #to DATE
SET #from = '1/1/2011'
SET #to = '1/18/2011'
-- the recusive sql is strictly to make a datelist between #from and #to
;WITH cte(Date)
AS (
SELECT #from DATE
UNION ALL
SELECT DATEADD(day, 1, DATE)
FROM cte ch
WHERE DATE < #to
)
SELECT cte.Date, sum(case when cte.Date=i.startdate then 1 else 0 end) ItemsOpened, count(i.itemid) ItemsOpenedCumulative
FROM cte
left join #i i on cte.Date between i.startdate and i.enddate
GROUP BY cte.Date
OPTION( MAXRECURSION 0)

If you are on SQL Server 2005+, you could use a recursive CTE to obtain running totals, with the additional help of the ranking function ROW_NUMBER(), like this:
WITH grouped AS (
SELECT
d.Date,
ItemsOpened = COUNT(i.ItemID),
rn = ROW_NUMBER() OVER (ORDER BY d.Date)
FROM Dates d
LEFT JOIN Items i ON d.Date BETWEEN i.StartDate AND i.EndDate
GROUP BY d.Date
WHERE d.Date BETWEEN #FilterStartDate AND #FilterEndDate
),
cumulative AS (
SELECT
Date,
ItemsOpened,
ItemsOpenedCumulative = ItemsOpened
FROM grouped
WHERE rn = 1
UNION ALL
SELECT
g.Date,
g.ItemsOpened,
ItemsOpenedCumulative = g.ItemsOpenedCumulative + c.ItemsOpened
FROM grouped g
INNER JOIN cumulative c ON g.Date = DATEADD(day, 1, c.Date)
)
SELECT *
FROM cumulative

Related

SQL with while loop to DAX conversion

Trying to convert the SQL with while loop code into DAX. Trying to build this query without using temp tables as access is an issue on the database and only have views to work with. I believe best option for me is to code it in DAX. Could someone help with it.
DECLARE #sd DATETIME
DECLARE #ed DATETIME
SELECT #sd = CONVERT(DATETIME, '2021-01-31')
SELECT #ed = GETDATE()
DECLARE #date DATETIME = EOMONTH(#sd)
WHILE ( (#date) <= #ed )
BEGIN
SELECT MONTH(#date) as Month, YEAR(#date) as Year, DAY(#date) as Day, A.*
FROM [people] A
WHERE A.effective_date = (SELECT MAX(B.effective_date)
FROM [people] B
WHERE B.employee_id = A.employee_id
AND B.record_id = A.record_id
AND B.effective_date <= #date)
AND A.effective_sequence = (SELECT MAX(C.effective_sequence)
FROM [people] C
WHERE C.employee_id = A.employee_id
AND C.record_id = A.record_id
AND C.effective_date = A.effective_date)
ORDER BY A.employee_id;
SET #date = EOMONTH(DATEADD(MONTH,1,#date))
END
While you could do this as a view, you would either have to hard-code the start and end dates, or filter them afterwards (which is likely to be inefficient). Instead you can do this as an inline Table Valued Function.
We can use a virtual tally-table (generated with a couple cross-joins) to generate a row for each month
We can use row-numbering instead of the two subqueries
CREATE FUNCTION dbo.GetData (#sd DATETIME, #ed DATETIME)
RETURNS TABLE AS RETURN
WITH L0 AS (
SELECT *
FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)
),
L1 AS (
SELECT 1 n FROM L0 a CROSS JOIN L0 b
)
SELECT
MONTH(m.Month) as Month,
YEAR(m.Month) as Year,
DAY(m.Month) as Day,
p.* -- specify columns
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY m.Month, p.employee_id, p.record_id ORDER BY p.effective_date, p.effective_sequence) AS rn
FROM [people] p
CROSS JOIN (
SELECT TOP (DATEDIFF(month, #sd, #ed) + 1)
DATEADD(month, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1, EOMONTH(#sd)) AS Month
FROM L1
) m
WHERE p.effective_date <= m.Month
) p
WHERE p.rn = 1
;
Then in PowerBI you can just do for example
SELECT *
FROM dbo.GetData ('2021-01-31', GETDATE()) d
ORDER BY
d.employee_id
Note that you cannot put the ORDER BY within the function, it doesn't work.

Showing list of all 24 hours in sql server if there is no data also

I have a query where I need to show 24 hour calls for each day.
But I am getting the hours which I have calls only.
My requirement is I need to get all the hours split and 0 if there are no calls.
Please suggest
Below is my code.
select #TrendStartDate
,isd.Name
,isd.Call_ID
,isd.callType
,DATEPART(HOUR,isd.ArrivalTime)
from [PHONE_CALLS] ISD WITH (NOLOCK)
WHERE CallType = 'Incoming'
and Name not in ('DefaultQueue')
and CAST(ArrivalTime as DATe) between #TrendStartDate and #TrendEndDate
The basic idea is that you use a table containing numbers from 0 to 23, and left join that to your data table:
WITH CTE AS
(
SELECT TOP 24 ROW_NUMBER() OVER(ORDER BY ##SPID) - 1 As TheHour
FROM sys.objects
)
SELECT #TrendStartDate
,isd.Name
,isd.Call_ID
,isd.callType
,TheHour
FROM CTE
LEFT JOIN [PHONE_CALLS] ISD WITH (NOLOCK)
ON DATEPART(HOUR,isd.ArrivalTime) = TheHour
AND CallType = 'Incoming'
AND Name NOT IN ('DefaultQueue')
AND CAST(ArrivalTime as DATe) BETWEEN #TrendStartDate AND #TrendEndDate
If you have a tally table, you should use that. If not, the cte will provide you with numbers from 0 to 23.
If you have a numbers table you can use a query like the following:
SELECT d.Date,
h.Hour,
Calls = COUNT(pc.Call_ID)
FROM ( SELECT [Hour] = Number
FROM dbo.Numbers
WHERE Number >= 0
AND Number < 24
) AS h
CROSS JOIN
( SELECT Date = DATEADD(DAY, Number, #TrendStartDate)
FROM dbo.Numbers
WHERE Number <= DATEDIFF(DAY, #TrendStartDate, #TrendEndDate)
) AS d
LEFT JOIN [PHONE_CALLS] AS pc
ON pc.CallType = 'Incoming'
AND pc.Name NOT IN ('DefaultQueue')
AND CAST(pc.ArrivalTime AS DATE) = d.Date
AND DATEPART(HOUR, pc.ArrivalTime) = h.Hour
GROUP BY d.Date, h.Hour
ORDER BY d.Date, h.Hour;
The key is to get all the hours you need:
SELECT [Hour] = Number
FROM dbo.Numbers
WHERE Number >= 0
AND Number < 24
And all the days that you need in your range:
SELECT Date = DATEADD(DAY, Number, #TrendStartDate)
FROM dbo.Numbers
WHERE Number < DATEDIFF(DAY, #TrendStartDate, #TrendEndDate)
Then cross join the two, so that you are guaranteed to have all 24 hours for each day you want. Finally, you can left join to your call table to get the count of calls.
Example on DB<>Fiddle
You can use SQL SERVER recursivity with CTE to generate the hours between 0 and 23 and then a left outer join with the call table
You also use any other Method mentioned in this link to generate numbers from 0 to 23
Link to SQLFiddle
set dateformat ymd
declare #calls as table(date date,hour int,calls int)
insert into #calls values('2020-01-02',0,66),('2020-01-02',1,888),
('2020-01-02',2,5),('2020-01-02',3,8),
('2020-01-02',4,9),('2020-01-02',5,55),('2020-01-02',6,44),('2020-01-02',7,87),('2020-01-02',8,90),
('2020-01-02',9,34),('2020-01-02',10,22),('2020-01-02',11,65),('2020-01-02',12,54),('2020-01-02',13,78),
('2020-01-02',23,99);
with cte as (select 0 n,date from #calls union all select 1+n,date from cte where 1+n <24)
select distinct(cte.date),cte.n [Hour],isnull(ca.calls,0) calls from cte left outer join #calls ca on cte.n=ca.hour and cte.date=ca.date

Taking most recent values in sum over date range

I have a table which has the following columns: DeskID *, ProductID *, Date *, Amount (where the columns marked with * make the primary key). The products in use vary over time, as represented in the image below.
Table format on the left, and a (hopefully) intuitive representation of the data on the right for one desk
The objective is to have the sum of the latest amounts of products by desk and date, including products which are no longer in use, over a date range.
e.g. using the data above the desired table is:
So on the 1st Jan, the sum is 1 of Product A
On the 2nd Jan, the sum is 2 of A and 5 of B, so 7
On the 4th Jan, the sum is 1 of A (out of use, so take the value from the 3rd), 5 of B, and 2 of C, so 8 in total
etc.
I have tried using a partition on the desk and product ordered by date to get the most recent value and turned the following code into a function (Function1 below) with #date Date parameter
select #date 'Date', t.DeskID, SUM(t.Amount) 'Sum' from (
select #date 'Date', t.DeskID, t.ProductID, t.Amount
, row_number() over (partition by t.DeskID, t.ProductID order by t.Date desc) as roworder
from Table1 t
where 1 = 1
and t.Date <= #date
) t
where t.roworder = 1
group by t.DeskID
And then using a utility calendar table and cross apply to get the required values over a time range, as below
select * from Calendar c
cross apply Function1(c.CalendarDate)
where c.CalendarDate >= '20190101' and c.CalendarDate <= '20191009'
This has the expected results, but is far too slow. Currently each desk uses around 50 products, and the products roll every month, so after just 5 years each desk has a history of ~3000 products, which causes the whole thing to grind to a halt. (Roughly 30 seconds for a range of a single month)
Is there a better approach?
Change your function to the following should be faster:
select #date 'Date', t.DeskID, SUM(t.Amount) 'Sum'
FROM (SELECT m.DeskID, m.ProductID, MAX(m.[Date) AS MaxDate
FROM Table1 m
where m.[Date] <= #date) d
INNER JOIN Table1 t
ON d.DeskID=t.DeskID
AND d.ProductID=t.ProductID
and t.[Date] = d.MaxDate
group by t.DeskID
The performance of TVF usually suffers. The following removes the TVF completely:
-- DROP TABLE Table1;
CREATE TABLE Table1 (DeskID int not null, ProductID nvarchar(32) not null, [Date] Date not null, Amount int not null, PRIMARY KEY ([Date],DeskID,ProductID));
INSERT Table1(DeskID,ProductID,[Date],Amount)
VALUES (1,'A','2019-01-01',1),(1,'A','2019-01-02',2),(1,'B','2019-01-02',5),(1,'A','2019-01-03',1)
,(1,'B','2019-01-03',4),(1,'C','2019-01-03',3),(1,'B','2019-01-04',5),(1,'C','2019-01-04',2),(1,'C','2019-01-05',2)
GO
DECLARE #StartDate date=N'2019-01-01';
DECLARE #EndDate date=N'2019-01-05';
;WITH cte_p
AS
(
SELECT DISTINCT DeskID,ProductID
FROM Table1
WHERE [Date] <= #EndDate
),
cte_a
AS
(
SELECT #StartDate AS [Date], p.DeskID, p.ProductID, ISNULL(a.Amount,0) AS Amount
FROM (
SELECT t.DeskID, t.ProductID
, MAX(t.Date) AS FirstDate
FROM Table1 t
WHERE t.Date <= #StartDate
GROUP BY t.DeskID, t.ProductID) f
INNER JOIN Table1 a
ON f.DeskID=a.DeskID
AND f.ProductID=a.ProductID
AND f.[FirstDate]=a.[Date]
RIGHT JOIN cte_p p
ON p.DeskID=a.DeskID
AND p.ProductID=a.ProductID
UNION ALL
SELECT DATEADD(DAY,1,a.[Date]) AS [Date], t.DeskID, t.ProductID, t.Amount
FROM Table1 t
INNER JOIN cte_a a
ON t.DeskID=a.DeskID
AND t.ProductID=a.ProductID
AND t.[Date] > a.[Date]
AND t.[Date] <= DATEADD(DAY,1,a.[Date])
WHERE a.[Date]<#EndDate
UNION ALL
SELECT DATEADD(DAY,1,a.[Date]) AS [Date], a.DeskID, a.ProductID, a.Amount
FROM cte_a a
WHERE NOT EXISTS(SELECT 1 FROM Table1 t
WHERE t.DeskID=a.DeskID
AND t.ProductID=a.ProductID
AND t.[Date] > a.[Date]
AND t.[Date] <= DATEADD(DAY,1,a.[Date]))
AND a.[Date]<#EndDate
)
SELECT [Date], DeskID, SUM(Amount)
FROM cte_a
GROUP BY [Date], DeskID;

Aggregate for each day over time series, without using non-equijoin logic

Initial Question
Given the following dataset paired with a dates table:
MembershipId | ValidFromDate | ValidToDate
==========================================
0001 | 1997-01-01 | 2006-05-09
0002 | 1997-01-01 | 2017-05-12
0003 | 2005-06-02 | 2009-02-07
How many Memberships were open on any given day or timeseries of days?
Initial Answer
Following this question being asked here, this answer provided the necessary functionality:
select d.[Date]
,count(m.MembershipID) as MembershipCount
from DIM.[Date] as d
left join Memberships as m
on(d.[Date] between m.ValidFromDateKey and m.ValidToDateKey)
where d.CalendarYear = 2016
group by d.[Date]
order by d.[Date];
though a commenter remarked that There are other approaches when the non-equijoin takes too long.
Followup
As such, what would the equijoin only logic look like to replicate the output of the query above?
Progress So Far
From the answers provided so far I have come up with the below, which outperforms on the hardware I am working with across 3.2 million Membership records:
declare #s date = '20160101';
declare #e date = getdate();
with s as
(
select d.[Date] as d
,count(s.MembershipID) as s
from dbo.Dates as d
join dbo.Memberships as s
on d.[Date] = s.ValidFromDateKey
group by d.[Date]
)
,e as
(
select d.[Date] as d
,count(e.MembershipID) as e
from dbo.Dates as d
join dbo.Memberships as e
on d.[Date] = e.ValidToDateKey
group by d.[Date]
),c as
(
select isnull(s.d,e.d) as d
,sum(isnull(s.s,0) - isnull(e.e,0)) over (order by isnull(s.d,e.d)) as c
from s
full join e
on s.d = e.d
)
select d.[Date]
,c.c
from dbo.Dates as d
left join c
on d.[Date] = c.d
where d.[Date] between #s and #e
order by d.[Date]
;
Following on from that, to split this aggregate into constituent groups per day I have the following, which is also performing well:
declare #s date = '20160101';
declare #e date = getdate();
with s as
(
select d.[Date] as d
,s.MembershipGrouping as g
,count(s.MembershipID) as s
from dbo.Dates as d
join dbo.Memberships as s
on d.[Date] = s.ValidFromDateKey
group by d.[Date]
,s.MembershipGrouping
)
,e as
(
select d.[Date] as d
,e..MembershipGrouping as g
,count(e.MembershipID) as e
from dbo.Dates as d
join dbo.Memberships as e
on d.[Date] = e.ValidToDateKey
group by d.[Date]
,e.MembershipGrouping
),c as
(
select isnull(s.d,e.d) as d
,isnull(s.g,e.g) as g
,sum(isnull(s.s,0) - isnull(e.e,0)) over (partition by isnull(s.g,e.g) order by isnull(s.d,e.d)) as c
from s
full join e
on s.d = e.d
and s.g = e.g
)
select d.[Date]
,c.g
,c.c
from dbo.Dates as d
left join c
on d.[Date] = c.d
where d.[Date] between #s and #e
order by d.[Date]
,c.g
;
Can anyone improve on the above?
If most of your membership validity intervals are longer than few days, have a look at an answer by Martin Smith. That approach is likely to be faster.
When you take calendar table (DIM.[Date]) and left join it with Memberships, you may end up scanning the Memberships table for each date of the range. Even if there is an index on (ValidFromDate, ValidToDate), it may not be super useful.
It is easy to turn it around.
Scan the Memberships table only once and for each membership find those dates that are valid using CROSS APPLY.
Sample data
DECLARE #T TABLE (MembershipId int, ValidFromDate date, ValidToDate date);
INSERT INTO #T VALUES
(1, '1997-01-01', '2006-05-09'),
(2, '1997-01-01', '2017-05-12'),
(3, '2005-06-02', '2009-02-07');
DECLARE #RangeFrom date = '2006-01-01';
DECLARE #RangeTo date = '2006-12-31';
Query 1
SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
#T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >= Memberships.ValidFromDate
AND dbo.Calendar.dt <= Memberships.ValidToDate
AND dbo.Calendar.dt >= #RangeFrom
AND dbo.Calendar.dt <= #RangeTo
) AS CA
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE);
OPTION(RECOMPILE) is not really needed, I include it in all queries when I compare execution plans to be sure that I'm getting the latest plan when I play with the queries.
When I looked at the plan of this query I saw that the seek in the Calendar.dt table was using only ValidFromDate and ValidToDate, the #RangeFrom and #RangeTo were pushed to the residue predicate. It is not ideal. The optimiser is not smart enough to calculate maximum of two dates (ValidFromDate and #RangeFrom) and use that date as a starting point of the seek.
It is easy to help the optimiser:
Query 2
SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
#T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >=
CASE WHEN Memberships.ValidFromDate > #RangeFrom
THEN Memberships.ValidFromDate
ELSE #RangeFrom END
AND dbo.Calendar.dt <=
CASE WHEN Memberships.ValidToDate < #RangeTo
THEN Memberships.ValidToDate
ELSE #RangeTo END
) AS CA
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE)
;
In this query the seek is optimal and doesn't read dates that may be discarded later.
Finally, you may not need to scan the whole Memberships table.
We need only those rows where the given range of dates intersects with the valid range of the membership.
Query 3
SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
#T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >=
CASE WHEN Memberships.ValidFromDate > #RangeFrom
THEN Memberships.ValidFromDate
ELSE #RangeFrom END
AND dbo.Calendar.dt <=
CASE WHEN Memberships.ValidToDate < #RangeTo
THEN Memberships.ValidToDate
ELSE #RangeTo END
) AS CA
WHERE
Memberships.ValidToDate >= #RangeFrom
AND Memberships.ValidFromDate <= #RangeTo
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE)
;
Two intervals [a1;a2] and [b1;b2] intersect when
a2 >= b1 and a1 <= b2
These queries assume that Calendar table has an index on dt.
You should try and see what indexes are better for the Memberships table.
For the last query, if the table is rather large, most likely two separate indexes on ValidFromDate and on ValidToDate would be better than one index on (ValidFromDate, ValidToDate).
You should try different queries and measure their performance on the real hardware with real data. Performance may depend on the data distribution, how many memberships there are, what are their valid dates, how wide or narrow is the given range, etc.
I recommend to use a great tool called SQL Sentry Plan Explorer to analyse and compare execution plans. It is free. It shows a lot of useful stats, such as execution time and number of reads for each query. The screenshots above are from this tool.
On the assumption your date dimension contains all dates contained in all membership periods you can use something like the following.
The join is an equi join so can use hash join or merge join not just nested loops (which will execute the inside sub tree once for each outer row).
Assuming index on (ValidToDate) include(ValidFromDate) or reverse this can use a single seek against Memberships and a single scan of the date dimension. The below has an elapsed time of less than a second for me to return the results for a year against a table with 3.2 million members and general active membership of 1.4 million (script)
DECLARE #StartDate DATE = '2016-01-01',
#EndDate DATE = '2016-12-31';
WITH MD
AS (SELECT Date,
SUM(Adj) AS MemberDelta
FROM Memberships
CROSS APPLY (VALUES ( ValidFromDate, +1),
--Membership count decremented day after the ValidToDate
(DATEADD(DAY, 1, ValidToDate), -1) ) V(Date, Adj)
WHERE
--Members already expired before the time range of interest can be ignored
ValidToDate >= #StartDate
AND
--Members whose membership starts after the time range of interest can be ignored
ValidFromDate <= #EndDate
GROUP BY Date),
MC
AS (SELECT DD.DateKey,
SUM(MemberDelta) OVER (ORDER BY DD.DateKey ROWS UNBOUNDED PRECEDING) AS CountOfNonIgnoredMembers
FROM DIM_DATE DD
LEFT JOIN MD
ON MD.Date = DD.DateKey)
SELECT DateKey,
CountOfNonIgnoredMembers AS MembershipCount
FROM MC
WHERE DateKey BETWEEN #StartDate AND #EndDate
ORDER BY DateKey
Demo (uses extended period as the calendar year of 2016 isn't very interesting with the example data)
One approach is to first use an INNER JOIN to find the set of matches and COUNT() to project MemberCount GROUPed BY DateKey, then UNION ALL with the same set of dates, with a 0 on that projection for the count of members for each date. The last step is to SUM() the MemberCount of this union, and GROUP BY DateKey. As requested, this avoids LEFT JOIN and NOT EXISTS. As another member pointed out, this is not an equi-join, because we need to use a range, but I think it does what you intend.
This will serve up 1 year's worth of data with around 100k logical reads. On an ordinary laptop with a spinning disk, from cold cache, it serves 1 month in under a second (with correct counts).
Here is an example that creates 3.3 million rows of random duration. The query at the bottom returns one month's worth of data.
--Stay quiet for a moment
SET NOCOUNT ON
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
--Clean up if re-running
DROP TABLE IF EXISTS DIM_DATE
DROP TABLE IF EXISTS FACT_MEMBER
--Date dimension
CREATE TABLE DIM_DATE
(
DateKey DATE NOT NULL
)
--Membership fact
CREATE TABLE FACT_MEMBER
(
MembershipId INT NOT NULL
, ValidFromDateKey DATE NOT NULL
, ValidToDateKey DATE NOT NULL
)
--Populate Date dimension from 2001 through end of 2018
DECLARE #startDate DATE = '2001-01-01'
DECLARE #endDate DATE = '2018-12-31'
;WITH CTE_DATE AS
(
SELECT #startDate AS DateKey
UNION ALL
SELECT
DATEADD(DAY, 1, DateKey)
FROM
CTE_DATE AS D
WHERE
D.DateKey < #endDate
)
INSERT INTO
DIM_DATE
(
DateKey
)
SELECT
D.DateKey
FROM
CTE_DATE AS D
OPTION (MAXRECURSION 32767)
--Populate Membership fact with members having a random membership length from 1 to 36 months
;WITH CTE_DATE AS
(
SELECT #startDate AS DateKey
UNION ALL
SELECT
DATEADD(DAY, 1, DateKey)
FROM
CTE_DATE AS D
WHERE
D.DateKey < #endDate
)
,CTE_MEMBER AS
(
SELECT 1 AS MembershipId
UNION ALL
SELECT MembershipId + 1 FROM CTE_MEMBER WHERE MembershipId < 500
)
,
CTE_MEMBERSHIP
AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY NEWID()) AS MembershipId
, D.DateKey AS ValidFromDateKey
FROM
CTE_DATE AS D
CROSS JOIN CTE_MEMBER AS M
)
INSERT INTO
FACT_MEMBER
(
MembershipId
, ValidFromDateKey
, ValidToDateKey
)
SELECT
M.MembershipId
, M.ValidFromDateKey
, DATEADD(MONTH, FLOOR(RAND(CHECKSUM(NEWID())) * (36-1)+1), M.ValidFromDateKey) AS ValidToDateKey
FROM
CTE_MEMBERSHIP AS M
OPTION (MAXRECURSION 32767)
--Add clustered Primary Key to Date dimension
ALTER TABLE DIM_DATE ADD CONSTRAINT PK_DATE PRIMARY KEY CLUSTERED
(
DateKey ASC
)
--Index
--(Optimize in your spare time)
DROP INDEX IF EXISTS SK_FACT_MEMBER ON FACT_MEMBER
CREATE CLUSTERED INDEX SK_FACT_MEMBER ON FACT_MEMBER
(
ValidFromDateKey ASC
, ValidToDateKey ASC
, MembershipId ASC
)
RETURN
--Start test
--Emit stats
SET STATISTICS IO ON
SET STATISTICS TIME ON
--Establish range of dates
DECLARE
#rangeStartDate DATE = '2010-01-01'
, #rangeEndDate DATE = '2010-01-31'
--UNION the count of members for a specific date range with the "zero" set for the same range, and SUM() the counts
;WITH CTE_MEMBER
AS
(
SELECT
D.DateKey
, COUNT(*) AS MembershipCount
FROM
DIM_DATE AS D
INNER JOIN FACT_MEMBER AS M ON
M.ValidFromDateKey <= #rangeEndDate
AND M.ValidToDateKey >= #rangeStartDate
AND D.DateKey BETWEEN M.ValidFromDateKey AND M.ValidToDateKey
WHERE
D.DateKey BETWEEN #rangeStartDate AND #rangeEndDate
GROUP BY
D.DateKey
UNION ALL
SELECT
D.DateKey
, 0 AS MembershipCount
FROM
DIM_DATE AS D
WHERE
D.DateKey BETWEEN #rangeStartDate AND #rangeEndDate
)
SELECT
M.DateKey
, SUM(M.MembershipCount) AS MembershipCount
FROM
CTE_MEMBER AS M
GROUP BY
M.DateKey
ORDER BY
M.DateKey ASC
OPTION (RECOMPILE, MAXDOP 1)
Here's how I'd solve this problem with equijoin:
--data generation
declare #Membership table (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insert into #Membership values
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')
declare #startDate date, #endDate date
select #startDate = MIN(ValidFromDate), #endDate = max(ValidToDate) from #Membership
--in order to use equijoin I need all days between min date and max date from Membership table (both columns)
;with cte as (
select #startDate [date]
union all
select DATEADD(day, 1, [date]) from cte
where [date] < #endDate
)
--in this query, we will assign value to each day:
--one, if project started on that day
--minus one, if project ended on that day
--then, it's enough to (cumulative) sum all this values to get how many projects were ongoing on particular day
select [date],
sum(case when [DATE] = ValidFromDate then 1 else 0 end +
case when [DATE] = ValidToDate then -1 else 0 end)
over (order by [date] rows between unbounded preceding and current row)
from cte [c]
left join #Membership [m]
on [c].[date] = [m].ValidFromDate or [c].[date] = [m].ValidToDate
option (maxrecursion 0)
Here's another solution:
--data generation
declare #Membership table (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insert into #Membership values
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')
;with cte as (
select CAST('2016-01-01' as date) [date]
union all
select DATEADD(day, 1, [date]) from cte
where [date] < '2016-12-31'
)
select [date],
(select COUNT(*) from #Membership where ValidFromDate < [date]) -
(select COUNT(*) from #Membership where ValidToDate < [date]) [ongoing]
from cte
option (maxrecursion 0)
Pay attention, I think #PittsburghDBA is right when it says that current query return wrong result.
The last day of membership is not counted and so final sum is lower than it should be.
I have corrected it in this version.
This should improve a bit your actual progress:
declare #s date = '20160101';
declare #e date = getdate();
with
x as (
select d, sum(c) c
from (
select ValidFromDateKey d, count(MembershipID) c
from Memberships
group by ValidFromDateKey
union all
-- dateadd needed to count last day of membership too!!
select dateadd(dd, 1, ValidToDateKey) d, -count(MembershipID) c
from Memberships
group by ValidToDateKey
)x
group by d
),
c as
(
select d, sum(x.c) over (order by d) as c
from x
)
select d.day, c cnt
from calendar d
left join c on d.day = c.d
where d.day between #s and #e
order by d.day;
First of all, your query yields '1' as MembershipCount even if no active membership exists for the given date.
You should return SUM(CASE WHEN m.MembershipID IS NOT NULL THEN 1 ELSE 0 END) AS MembershipCount.
For optimal performance create an index on Memberships(ValidFromDateKey, ValidToDateKey, MembershipId) and another on DIM.[Date](CalendarYear, DateKey).
With that done, the optimal query shall be:
DECLARE #CalendarYear INT = 2000
SELECT dim.DateKey, SUM(CASE WHEN con.MembershipID IS NOT NULL THEN 1 ELSE 0 END) AS MembershipCount
FROM
DIM.[Date] dim
LEFT OUTER JOIN (
SELECT ValidFromDateKey, ValidToDateKey, MembershipID
FROM Memberships
WHERE
ValidFromDateKey <= CONVERT(DATETIME, CONVERT(VARCHAR, #CalendarYear) + '1231')
AND ValidToDateKey >= CONVERT(DATETIME, CONVERT(VARCHAR, #CalendarYear) + '0101')
) con
ON dim.DateKey BETWEEN con.ValidFromDateKey AND con.ValidToDateKey
WHERE dim.CalendarYear = #CalendarYear
GROUP BY dim.DateKey
ORDER BY dim.DateKey
Now, for your last question, what would be the equijoin equivalent query.
There is NO WAY you can rewrite this as a non-equijoin!
Equijoin doesn't imply using join sintax. Equijoin implies using an equals predicate, whatever the sintax.
Your query yields a range comparison, hence equals doesn't apply: a between or similar is required.

Need to calc start and end date from single effective date

I am trying to write SQL to calculate the start and end date from a single date called effective date for each item. Below is a idea of how my data looks. There are times when the last effective date for an item will be in the past so I want the end date for that to be a year from today. The other two items in the table example have effective dates in the future so no need to create and end date of a year from today.
I have tried a few ways but always run into bad data. Below is an example of my query and the bad results
select distinct tb1.itemid,tb1.EffectiveDate as startdate
, case
when dateadd(d,-1,tb2.EffectiveDate) < getdate()
or tb2.EffectiveDate is null
then getdate() +365
else dateadd(d,-1,tb2.EffectiveDate)
end as enddate
from #test tb1
left join #test as tb2 on (tb2.EffectiveDate > tb1.EffectiveDate
or tb2.effectivedate is null) and tb2.itemid = tb1.itemid
left join #test tb3 on (tb1.EffectiveDate < tb3.EffectiveDate
andtb3.EffectiveDate <tb2.EffectiveDate or tb2.effectivedate is null)
and tb1.itemid = tb3.itemid
left join #test tb4 on tb1.effectivedate = tb4.effectivedate \
and tb1.itemid = tb4.itemid
where tb1.itemID in (62741,62740, 65350)
Results - there is an extra line for 62740
Bad Results
I expect to see below since the first two items have a future end date no need to create an end date of today + 365 but the last one only has one effective date so we have to calculate the end date.
I think I've read your question correctly. If you could provide your expected output it would help a lot.
Test Data
CREATE TABLE #TestData (itemID int, EffectiveDate date)
INSERT INTO #TestData (itemID, EffectiveDate)
VALUES
(62741,'2016-06-25')
,(62741,'2016-06-04')
,(62740,'2016-07-09')
,(62740,'2016-06-25')
,(62740,'2016-06-04')
,(65350,'2016-05-28')
Query
SELECT
a.itemID
,MIN(a.EffectiveDate) StartDate
,MAX(CASE WHEN b.MaxDate > GETDATE() THEN b.MaxDate ELSE CONVERT(date,DATEADD(yy,1,GETDATE())) END) EndDate
FROM #TestData a
JOIN (SELECT itemID, MAX(EffectiveDate) MaxDate FROM #TestData GROUP BY itemID) b
ON a.itemID = b.itemID
GROUP BY a.itemID
Result
itemID StartDate EndDate
62740 2016-06-04 2016-07-09
62741 2016-06-04 2016-06-25
65350 2016-05-28 2017-06-24
This should do it:
SELECT itemid
,effective_date AS "Start"
,(SELECT MIN(effective_date)
FROM effective_date_tbl
WHERE effective_date > edt.effective_date
AND itemid = edt.itemid) AS "End"
FROM effective_date_tbl edt
WHERE effective_date <
(SELECT MAX(effective_date) FROM effective_date_tbl WHERE itemid = edt.itemid)
UNION ALL
SELECT itemid
,effective_date AS "Start"
,(SYSDATE + 365) AS "End"
FROM effective_date_tbl edt
WHERE 1 = ( SELECT COUNT(*) FROM effective_date_table WHERE itemid = edt.itemid )
ORDER BY 1, 2, 3;
I did this exercise for Items that have multiple EffectiveDate in the table
you can create this view
CREATE view [VW_TESTDATA]
AS ( SELECT * FROM
(SELECT ROW_NUMBER() OVER (ORDER BY Item,CONVERT(datetime,EffectiveDate,110)) AS ID, Item, DATA
FROM MyTable ) AS Q
)
so use a select to compare the same Item
select * from [VW_TESTDATA] as A inner join [VW_TESTDATA] as B on A.Item = B.Item and A.id = B.id-1
in this way you always minor and major Date
I did not understand how to handle dates with only one Item , but it seems the simplest thing and can be added to this query with a UNION ALL, because the view not cover individual Item
You also need to figure out how to deal with Item with two equal EffectiveDate
you should use the case when statement..
[wrong query because a misunderstand of the requirements]
SELECT
ItemID AS Item,
StartDate,
CASE WHEN EndDate < Sysdate THEN Sysdate + 365 ELSE EndDate END AS EndDate
FROM
(
SELECT tabStartDate.ItemID, tabStartDate.EffectiveDate AS StartDate, tabEndDate.EffectiveDate AS EndDate
FROM TableItems tabStartDate
JOIN TableItems tabEndDate on tabStartDate.ItemID = tabEndDate.ItemID
) TableDatesPerItem
WHERE StartDate < EndDate
update after clarifications in the OP and some comments
I found a solution quite portable, because it doesn't make use of partioning but endorses on a sort of indexing rule that make to correspond the dates of each item with others with the same id, in order of time's succession.
The portability is obviously related to the "difficult" part of query, while row numbering mechanism and conversion go adapted, but I think that it isn't a problem.
I sended a version for MySql that it can try on SQL Fiddle..
Table
CREATE TABLE ITEMS
(`ItemID` int, `EffectiveDate` Date);
INSERT INTO ITEMS
(`ItemID`, `EffectiveDate`)
VALUES
(62741, DATE(20160625)),
(62741, DATE(20160604)),
(62740, DATE(20160709)),
(62740, DATE(20160625)),
(62740, DATE(20160604)),
(62750, DATE(20160528))
;
Query
SELECT
RESULT.ItemID AS ItemID,
DATE_FORMAT(RESULT.StartDate,'%m/%d/%Y') AS StartDate,
CASE WHEN RESULT.EndDate < CURRENT_DATE
THEN DATE_FORMAT((CURRENT_DATE + INTERVAL 365 DAY),'%m/%d/%Y')
ELSE DATE_FORMAT(RESULT.EndDate,'%m/%d/%Y')
END AS EndDate
FROM
(
SELECT
tabStartDate.ItemID AS ItemID,
tabStartDate.StartDate AS StartDate,
tabEndDate.EndDate
,tabStartDate.IDX,
tabEndDate.IDX AS IDX2
FROM
(
SELECT
tabStartDateIDX.ItemID AS ItemID,
tabStartDateIDX.EffectiveDate AS StartDate,
#rownum:=#rownum+1 AS IDX
FROM ITEMS AS tabStartDateIDX
ORDER BY tabStartDateIDX.ItemID, tabStartDateIDX.EffectiveDate
)AS tabStartDate
JOIN
(
SELECT
tabEndDateIDX.ItemID AS ItemID,
tabEndDateIDX.EffectiveDate AS EndDate,
#rownum:=#rownum+1 AS IDX
FROM ITEMS AS tabEndDateIDX
ORDER BY tabEndDateIDX.ItemID, tabEndDateIDX.EffectiveDate
)AS tabEndDate
ON tabStartDate.ItemID = tabEndDate.ItemID AND (tabEndDate.IDX - tabStartDate.IDX = ((select count(*) from ITEMS)+1) )
,(SELECT #rownum:=0) r
UNION
(
SELECT
tabStartDateSingleItem.ItemID AS ItemID,
tabStartDateSingleItem.EffectiveDate AS StartDate,
tabStartDateSingleItem.EffectiveDate AS EndDate
,0 AS IDX,0 AS IDX2
FROM ITEMS AS tabStartDateSingleItem
Group By tabStartDateSingleItem.ItemID
HAVING Count(tabStartDateSingleItem.ItemID) = 1
)
) AS RESULT
;