Loading Date Range Values to a Daily Grain Fact Table - sql

ETL Question here.
For a given table that includes entries with a start and end date, what is the optimal method to retrieve counts for each day, including those days that may not have an entry within the scope of the start end date.
Given Table Example
Stock
ID StartDate EndDate Category
1 1/1/2013 1/5/2013 Appliances
2 1/1/2013 1/10/2013 Appliances
3 1/2/2013 1/10/2013 Appliances
Output required
Available
Category EventDate Count
Appliances 1/1/2013 2
Appliances 1/2/2013 3
...
...
Appliances 1/10/2013 2
Appliances 1/11/2013 0
...
...
One method I know of, which takes FOREVER, is to create a Table variable, and run a While Block iterating through the start and end of the range I wish to retrieve, then execute a query like so..
Insert into #TempTable (Category,EventDate,Count)
FROM Stock
Where #CurrentLoopDate BETWEEN StartDate AND EndDate
Another method would be to create a table or temp table of dates in the range I want populated, and join it with a BETWEEN function.
Insert into #TempTable (Category,EventDate,Count)
FROM DateTable
INNER JOIN Stock ON DateTable.[Date] BETWEEN StartDate AND EndDate
Yet other methods are similar but use SSIS, but essentially are the same as the above two solutions.
Any GURU's know of a more efficient method?

Have you tried using a recursive CTE?
WITH Dates_CTE AS (
SELECT [ID]
,[StartDate]
,[EndDate]
,[Category]
FROM [dbo].[Stock]
UNION ALL
SELECT [ID]
,DATEADD(D, 1, [StartDate])
,[EndDate]
,[Category]
FROM Dates_cte d
WHERE DATEADD(D, 1, [StartDate]) <= EndDate
)
SELECT StartDate AS EventDate
,Category
,COUNT(*)
FROM Dates_CTE
GROUP BY StartDate, Category
OPTION (MAXRECURSION 0)
That should do the trick ;-)

Related

How to implement loops in SQL?

I am trying to calculate a KPI for each patient, the KPI is called "Initial prescription start date(IPST)".
The definition of IPST is if the patient has a negative history of using that particular medication for 60 days before a start date that start date is a IPST.
For example- See screen shot below, for patient with ID=101, I will start with IPST as 4/15/2019 , the difference in days between 4/15/2019 and 4/1/2019 is 14 <60 thus I will change my IPST to 4/1/2019.
Continuing with this iteration IPST for 101 is 3/17/2019 and 102 is 3/18/2018 as shown on the right hand side table.
I tried to build a UDF as below, where I am passing id of a patient and UDF is returning the IPST.
CREATE FUNCTION [Initial_Prescription_Date]
(
#id Uniqueidentifier
)
RETURNS date
AS
BEGIN
{
I am failing to implement this code here
}
I can get a list of Start_dates for a patient from a medication table like this
Select id, start_date from patient_medication
I will have to iterate through this list to get to the IPST for a patient.
I'll answer in order to start a dialog that we can work on.
The issue that I have is the the difference in days for ID = 102 between the last record and the one you've picked as the IPST is 29 days, but the IPST you've picked for 102 is 393 days, is that correct?
You don't need to loop to solve this problem. If you're comparing all of your dates only to your most recent, you can simply use MAX:
DECLARE #PatientRecords TABLE
(
ID INTEGER,
StartDate DATE,
Medicine VARCHAR(100)
)
INSERT INTO #PatientRecords VALUES
(101,'20181201','XYZ'),
(101,'20190115','XYZ'),
(101,'20190317','XYZ'),
(101,'20190401','XYZ'),
(101,'20190415','XYZ'),
(102,'20190401','XYZ'),
(102,'20190415','XYZ'),
(102,'20190315','XYZ'),
(102,'20180318','XYZ');
With maxCTE AS
(
SELECT *, DATEDIFF(DAY, StartDate, MAX(StartDate) OVER (PARTITION BY ID, MEDICINE ORDER BY StartDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) [IPSTDateDifference]
FROM #PatientRecords
)
SELECT m.ID, m.Medicine, MIN(m.StartDate) [IPST]
FROM maxCTE m
WHERE [IPSTDateDifference] < 60
GROUP BY m.ID, m.Medicine
ORDER BY 1,3;

Optimizing a SQL Query when joining two tables. Naive algorithm gives me millions of rows

I apologize, I am not sure how to word the heading for this question. If someone can reform it for me to better suit what I am asking that would be greatly appreciated.
I have a quite a problem that I have been stuck on for the longest of time. I use Tableau in conjunction with SQLServer 2014.
I have a single table that essentially shows all the employees within our company. Their hire date and termination date (NULL if still employed). I am looking to generate a headcount forthe past. Here is an example of this table:
employeeID HireDate TermDate FavouriteFish FavouriteColor
1 1/1/15 1/1/18 Cod Blue
2 4/12/16 NULL Bass Red
.
.
.
n
As you can see this list can go on and on.. In fact the table in question I currently have over 10000 rows for all past and current employees.
My goal is to construct a view to see on each day of the year for the last 5 years what the total head count of employed employees we had. Here is the kicker though... I need to retain the rest of the information such as:
FavouriteFish FavouriteColor... and so on
The only way I can think of doing this,and it doesn't work so well because it is extremely slow, is to create a separate calendar table for each day of the year for the past 5 years; like so:
Date CrossJoinKey
1/1/2013 1
1/2/2013 1
1/3/2013 1
.
.
.
4/4/2018 1
From here I add a column to my original Employee Table called: CrossJoinKey; like so..
employeeID HireDate TermDate FavouriteFish FavouriteColor CrossJoinKey
1 1/1/15 1/1/18 Cod Blue 1
2 4/12/16 NULL Bass Red 1
.
.
.
n
From here I create a LEFT JOIN Calendar ON Employee.CrossKeyJoin=Calendar.CrossKeyJoin
Hopefully here you can immediately see the problem.. It creates a relationship with A LOT OF ROWS!! In fact it gives me somewhere around 18million rows. It gives me the information I am after, however it takes a LONG time to query, and when I import this to Tableau to create an extract it takes a LONG time to do that as well.. However, once Tableau eventually creates the extract it is relatively fast. I can use the inner guts to isolate and creating a headcount by day for the past 5 years... by seeing if the date field is in between the termDate and HireDate. But this entire process needs to be quite frequently, and I feel the current method is unpractical.
I feel this is a naive way to accomplish what I am after, and I feel this problem has to have addressed before in the past. Is there anyone here that could please shed some light on how to optimize this?
Word of note... I have considered essentially creating a query that populates a calendar table by looking through the employee table and 'counting' each employee that is still employed, but this method loses resolution and I am not able to retain any of the other data for the employees.
Something like this, shown below, works and is much faster, but NOT what I am looking for:
Date HeadCount
1/1/2013 1200
1/2/2013 1201
1/3/2013 1200
.
.
.
4/4/2018 5000
Thank you very much for spending some time on this.
UPDATE:
Here is a link to a google sheets data sample
I've edited some of your data, as you can see in the #example table.
I wanted to note: you either spelt =D favourite or Color incorrectly, Please correct it, either FavoriteColor, or FavouriteColour.
declare #example as table (
exampleid int identity(1,1) not null primary key clustered
, StartDate date not null
, TermDate date null
);
insert into #example (StartDate, TermDate)
select '1/1/2016', '1/1/2018' union all
select '4/3/2017', '1/10/2018' union all
select '9/3/2016', '2/4/2018' union all
select '5/9/2017', '11/21/2017' union all
select '9/18/2016', '11/15/2017' union all
select '12/12/2015', '2/8/2018' union all
select '6/18/2016', '12/20/2017' union all
select '7/26/2015', '11/4/2017' union all
select '1/7/2015', NULL union all
select '10/2/2013', '10/21/2013' union all
select '10/14/2013', '12/12/2017' union all
select '10/11/2013', '11/3/2017' union all
select '6/30/2015', '1/12/2018' union all
select '2/17/2016', NULL union all
select '8/12/2015', '11/26/2017' union all
select '12/2/2015', '11/15/2017' union all
select '3/30/2016', '11/30/2017' union all
select '6/18/2016', '11/9/2017' union all
select '4/3/2017', '2/12/2018' union all
select '3/26/2017', '1/15/2018' union all
select '1/27/2017', NULL union all
select '7/29/2016', '1/10/2018';
--| This is an adaption of Aaron Bertrand's work (time dim table)
--| this will control the start date
declare #date datetime = '2013-10-01';
;with cte as (
select 1 ID
, #date date_
union all
select ID + 1
, dateadd(day, 1, date_)
from cte
)
, cte2 as (
select top 1000 ID
, cast(date_ as date) date_
, 0 Running
, iif(datepart(weekday, date_) in(1,7), 0,1) isWeekday
, datepart(weekday, date_) DayOfWeek
, datename(weekday, date_) DayOfWeekName
, month(date_) Month
, datename(month, date_) MonthName
, datepart(quarter, date_) Quarter
from cte
--option (maxrecursion 1000)
)
, cte3 as (
select a.id
, Date_
, b.StartDate
, iif(b.StartDate is not null, 1, 0) Add_
, iif(c.TermDate is not null, -1, 0) Remove_
from cte2 a
left join #example b
on a.date_ = b.StartDate
left join #example c
on a.date_ = c.TermDate
-- option (maxrecursion 1000)
)
select date_
--, Add_
--, Remove_
, sum((add_ + remove_)) over (order by date_ rows unbounded preceding) CurrentCount
from cte3
option (maxrecursion 1000)
Result Set:
date_ CurrentCount
2013-10-01 0
2013-10-02 1
2013-10-03 1
2013-10-04 1
2013-10-05 1

Finding Overlapping Between StartDate and EndDate to see if at the same day there are more than 3 occurences

So I have a table in SQL SERVER 2008 that has a list of itmes. A Person can only order max of 4 items per day. So i want to know in any start date or end date is there a time where they have more than 4 items for that particular day.
Here is an example of my table:
OrderNo Item quantity StartDate EndDate
112 xbox 2 2012-12-05 2012-12-10
123     tv     1 2012-12-06  2012-12-07
125 computer 4 2012-12-10 2012-12-11
165 game 1 2012-12-06 2012-12-10
186 toy 2 2012-12-02 2012-12-04
so from this table we can see that they had more than 4 items per day...now I need to know how many items and what days did they have more than 4 items.
Basically I want to check the overlapping dates from when the items were out and when it was returned to see if there were more than 4 items out at the same time on a particular date.
I have no clue how to approach this. I have looked at numerous overlapping dates and ranges in SQL.
You need a calendar table. Ideally this is a permenant table set up in your master database all properly indexed but you can create it on the fly like:
WITH Calendar
AS
(
SELECT MIN(StartDate) AS Today
,MAX(EndDate) AS LastDay
FROM table
UNION ALL
SELECT DATEADD(day,1,Today)
,LastDay
FROM Calendar
WHERE Today<LastDay
)
Note: you have a normal maximum recursion of 100 so the most you can get is 100 days with this unless you add OPTION (MAXRECURSION n) where n is an int less than 32768.
You now have a table you can join with your original table that covers all the relevant dates, like so
SELECT Today
,SUM(Quantity) AS ItemCount
FROM Calendar c
INNER JOIN
Table t ON c.Today BETWEEN t.StartDate AND t.EndDate
GROUP BY Today
HAVING SUM(Quantity)>4
See this SQL Fiddle
This SQL fiddle gives the solution with a "permenant" calendar table.
This one uses running totals to summarize items ordered per date. To check against any date ( not given in sample data ), simply use this result and do some joins. For more information about sequential numbers / date take a look at islands and gaps
insert into "timestamps"
select
*
from
(
select "OrderNo", "ts", "quantity", case when "timestamp" = 'StartDate' then 1 else -1 end as "factor", 0 as "rolSum" from
( select "OrderNo", "StartDate", "EndDate", "quantity" from "data" ) as d
unpivot ( "ts" for "timestamp" in ( "StartDate", "EndDate" )) as pvt
) as data
order by "ts"
declare #rolSum int = 0
update "timestamps" set #rolSum = "rolSum" = #rolSum + "quantity" * "factor" from "timestamps"
select
"OrderNo"
, "ts"
, "rolSum"
from
"timestamps"
See SQL-Fiddle-Demo ( including table creation and your demo data ).

break down by weeks in SQL Server

Given this query:
DECLARE
#FROM_DT DATETIME,
#TO_DT DATETIME
BEGIN
SET #FROM_DT = '10/01/2009'
SET #TO_DT = DATEADD(DAY,7,#FROM_DT)
--WHILE (#FROM_DT <= '10/01/2010')
WHILE (#TO_DT < '10/01/2010')
BEGIN
SELECT
CONVERT(CHAR(10),#FROM_DT,101) AS FROM_DT,
CONVERT(CHAR(10),DATEADD(DAY,-1,#TO_DT),101) AS TO_DT,
COUNTRY AS CITZ,
COUNT(SUBJECT_KEY) AS PEOPLE
FROM MYTALE
WHERE DATE_DT >=#FROM_DT
AND DATE_DT <#TO_DT
GROUP BY COUNTRY
SET #FROM_DT = DATEADD(DAY,7,#FROM_DT)
SET #TO_DT = DATEADD(DAY, 7,#TO_DT)
END
END
Here are my results:
FROM_DT TO_DT COUNTRY PEOPLE
10/01/2009 10/07/2009 A 2
10/01/2009 10/07/2009 B 1
FROM_DT TO_DT COUNTRY PEOPLE
10/08/2009 10/14/2009 A 1
10/08/2009 10/14/2009 C 2
---to
FROM_DT TO_DT COUNTRY PEOPLE
09/23/2010 09/29/2010 A 1
09/23/2010 09/29/2010 B 3
FROM_DT TO_DT COUNTRY PEOPLE
09/30/2010 10/06/2010 C 13
09/30/2010 10/06/2010 D 1
Question:
Is there a way in SQL that it can write the output like below? (I need to consolidate the data. I could copy and paste them but it's 52 weeks of data. Not a efficient way to do it) Please help. I use SQL Server 2005 & 2008 version.
FROM_DT TO_DT COUNTRY PEOPLE
10/01/2009 10/07/2009 A 2
10/01/2009 10/07/2009 B 1
10/08/2009 10/14/2009 A 1
10/08/2009 10/14/2009 C 2
09/23/2010 09/29/2010 A 1
09/23/2010 09/29/2010 B 3
09/30/2010 10/06/2010 C 13
----
From the query above, i commented the WHILE (#FROM_DT <= '10/01/2010') out and replaced it with WHILE (#TO_DT < '10/01/2010') because I would like to get the data for FY10 only, which the date starts from 10/1/2009 to 9/30/2010. However, the result only up to 9/29/2010, the data from 9/30/2010 is not included. Is something wrong with my query? Please help!
Well, SQL Server has a function called DATEPART which can also give you the WEEK part of a date - something like:
SELECT
DATEPART(WEEK, DATE_DT)
Country AS CITZ,
COUNT(Subject_Key) AS PEOPLE
FROM dbo.MyTable
GROUP BY
Country, DATEPART(WEEK, DATE_DT)
This gives you the numeric week number (but not yet the from_date and to_date).
Or you could leave your basic query alone, but store the results into a temporary table:
CREATE TABLE dbo.tmp_Results(FromDT CHAR(10), ToDT CHAR(10),
Country VARCHAR(100), Count INT)
and then just insert your results into that table for each run:
INSERT INTO dbo.tmp_Results(FromDT, ToDT, Country, Count)
SELECT
CONVERT(CHAR(10),#FROM_DT,101) AS FROM_DT,
CONVERT(CHAR(10),DATEADD(DAY,-1,#TO_DT),101) AS TO_DT,
COUNTRY AS CITZ,
COUNT(SUBJECT_KEY) AS PEOPLE
FROM MYTALE
WHERE DATE_DT >=#FROM_DT
AND DATE_DT <#TO_DT
GROUP BY COUNTRY
and then select from that temp table in the end:
SELECT * FROM dbo.tmp_Results
Recursive CTEs to the rescue! No need for temporary tables anymore since you can generate your set on the fly, and can start the weeks at any date instead of on Monday.
(Danger: Written in notepad. Minor bugs / typos may be present. Right idea, though)
WITH weeks (start, end) AS (
select
#from_dt as start
dateadd(day, 7, #from_dt) as end
UNION
select
dateadd(day, start, 7)
dateadd(day, end, 7)
from
weeks
where
start < #last_dt
)
select
w.start,
w.end,
c.country,
count(c.subject_key)
from
my_table c
join weeks on c.date_dt >= start and c.date_dt < end
group by
start, end, country
You could use a dynamic query with a union clause, but what I would do is create a temporary table and insert your results into it. Then you could select out the data from there and drop the temp table.
Your other option would be to create a table that held on to the from-to dates for your weeks and join on that table instead. This would actually be a preferred way to do it, but you would need to keep that table up to date with all of the dates you need.
An approach from data warehousing would be to create a "Weeks" table with all your possible weeks in it, along with their start and end dates:
Week StartDate EndDate
1 10/01/2009 10/07/2009
2 10/08/2009 10/14/2009
3 10/15/2009 10/21/2009
...
...and then just join to that. You fill the "Weeks" table once, in advance -- you can fill it up to the year 3000 if you want -- and then it's available in your database to do queries like this:
SELECT
StartDate, EndDate, COUNTRY, COUNT(SUBJECT_KEY) AS People
FROM
MYTALE INNER JOIN Weeks ON DATE_DT BETWEEN StartDate AND EndDate
GROUP BY
StartDate, EndDate, Country
This often simplifies complicated queries when you need to do data analysis over a range of dates (and you can pre-build a similar "days" or "months" table.) It can also be faster, assuming you've indexed the tables appropriately. These tables are "time dimensions" in star schema data warehouse parlance.
Try:
DECLARE
#FROM_DT DATETIME,
#TO_DT DATETIME
BEGIN
SET #FROM_DT = '10/01/2009'
SET #TO_DT = DATEADD(DAY,7*53,#FROM_DT)
SELECT
CONVERT(CHAR(10),DATEADD(DAY,7*(WEEKNO),#FROM_DT),101) AS FROM_DT,
CONVERT(CHAR(10),DATEADD(DAY,7*(WEEKNO)+6,#FROM_DT),101) AS TO_DT,
COUNTRY AS CITZ,
COUNT(SUBJECT_KEY) AS PEOPLE
(SELECT M.*, TRUNC(DATEDIFF(DAY,#FROM_DT,DATE_DT)/7) WEEKNO
FROM MYTALE M
WHERE DATE_DT >=#FROM_DT
AND DATE_DT <#TO_DT) SQ
GROUP BY COUNTRY, WEEKNO
END

Generate missing dates + Sql Server (SET BASED)

I have the following
id eventid startdate enddate
1 1 2009-01-03 2009-01-05
1 2 2009-01-05 2009-01-09
1 3 2009-01-12 2009-01-15
How to generate the missing dates pertaining to every eventid?
Edit:
The missing gaps are to be find out based on the eventid's. e.g. for eventid 1 the output should be 1/3/2009,1/4/2009,1/5/2009.. for eventtype id 2 it will be 1/5/2009, 1/6/2009... to 1/9/2009 etc
My task is to find out the missing dates between two given dates.
Here is the whole thing which i have done so far
declare #tblRegistration table(id int primary key,startdate date,enddate date)
insert into #tblRegistration
select 1,'1/1/2009','1/15/2009'
declare #tblEvent table(id int,eventid int primary key,startdate date,enddate date)
insert into #tblEvent
select 1,1,'1/3/2009','1/5/2009' union all
select 1,2,'1/5/2009','1/9/2009' union all
select 1,3,'1/12/2009','1/15/2009'
;with generateCalender_cte as
(
select cast((select startdate from #tblRegistration where id = 1 )as datetime) DateValue
union all
select DateValue + 1
from generateCalender_cte
where DateValue + 1 <= (select enddate from #tblRegistration where id = 1)
)
select DateValue as missingdates from generateCalender_cte
where DateValue not between '1/3/2009' and '1/5/2009'
and DateValue not between '1/5/2009' and '1/9/2009'
and DateValue not between '1/12/2009'and'1/15/2009'
Actually what I am trying to do is that, I have generated a calender table and from there I am trying to find out the missing dates based on the id's
The ideal output will be
eventid missingdates
1 2009-01-01 00:00:00.000
1 2009-01-02 00:00:00.000
3 2009-01-10 00:00:00.000
3 2009-01-11 00:00:00.000
and also it has to be in SET BASED and the start and end dates should not be hardcoded
Thanks in adavnce
The following uses a recursive CTE (SQL Server 2005+):
WITH dates AS (
SELECT CAST('2009-01-01' AS DATETIME) 'date'
UNION ALL
SELECT DATEADD(dd, 1, t.date)
FROM dates t
WHERE DATEADD(dd, 1, t.date) <= '2009-02-01')
SELECT t.eventid, d.date
FROM dates d
JOIN TABLE t ON d.date BETWEEN t.startdate AND t.enddate
It generates dates using the DATEADD function. It can be altered to take a start & end date as parameters. According to KM's comments, it's faster than using the numbers table trick.
Like rexem - I made a function that contains a similar CTE to generate any series of datetime intervals you need. Very handy for summarizing data by datetime intervals like you are doing.
A more detailed post and the function source code are here:
Insert Dates in the return from a query where there is none
Once you have the "counts of events by date" ... your missing dates would be the ones with a count of 0.