Conceptualizing SQL Query for data that isn't there - sql

SQL Server 2017.
Having been running simple-to-intermediate SQL queries for many years, I'm having trouble wrapping my head around this one, as it's querying for information that doesn't actually exist.
Given a table called Activity with ProductId (int), PurchaseDate (datetime)
and some rows that look like this:
1 2020-10-31
1 2020-11-01
1 2020-11-02
1 2020-11-03
2 2020-10-31
2 2020-11-01
2 2020-11-03
2 2020-11-04
3 2020-10-31
3 2020-11-01
4 2020-10-31
4 2020-11-01
4 2020-11-03
5 2020-10-20
6 2020-10-31
6 2020-11-01
6 2020-11-02
And then another table called ProductIds with column ProductId (int) and 7 rows, with values 1-7:
I need to return from the Activity table any ProductIds that do not have an entry for a date from a date range, as well as the date that doesn't have the entry. This would be the results:
2 2020-11-02
3 2020-11-02
3 2020-11-03
4 2020-11-02
5 2020-10-31
5 2020-11-01
5 2020-11-02
5 2020-11-03
6 2020-11-03
7 2020-10-31
7 2020-11-01
7 2020-11-02
7 2020-11-03
So the query would be looking for any ProductId from the ProductIds table that does not have an associated entry in the Activity table for dates between 2020-10-31 and 2020-11-03.
This is what I have so far, but bangin' my head trying to figure it out:
SELECT ProductId, PurchaseDate
FROM dbo.Activity
WHERE ProductId NOT IN (SELECT ProductId FROM dbo.ProductIds);
I know there are at least a couple things wrong with that query and I just can't figure out how to go about this. As you can see, the results set is returning information that doesn't exist in the table, hence my confusion.

Try this:
DECLARE #Activity TABLE
(
[ProductId] INT
,[PurchaseDate] DATE
);
DECLARE #ProductIds TABLE
(
[ProductId] INT
);
INSERT INTO #Activity ([ProductId], [PurchaseDate])
VALUES (1, '2020-10-31')
,(1, '2020-11-01')
,(1, '2020-11-02')
,(1, '2020-11-03')
,(2, '2020-10-31')
,(2, '2020-11-01')
,(2, '2020-11-03')
,(2, '2020-11-04')
,(3, '2020-10-31')
,(3, '2020-11-01')
,(4, '2020-10-31')
,(4, '2020-11-01')
,(4, '2020-11-03')
,(5, '2020-10-20')
,(6, '2020-10-31')
,(6, '2020-11-01')
,(6, '2020-11-02');
INSERT INTO #ProductIds ([ProductId])
VALUES (1), (2), (3), (4), (5), (6), (7);
DECLARE #date_beg DATE = '2020-10-31'
,#date_end DATE = '2020-11-03';
SELECT P.[ProductId]
,Dates.[Date]
FROM #ProductIds P
CROSS APPLY
(
SELECT DATEADD(DAY, nbr - 1, #date_beg)
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY c.object_id ) AS Nbr
FROM sys.columns c
) nbrs
WHERE nbr - 1 <= DATEDIFF(DAY, #date_beg, #date_end)
) Dates ([Date])
LEFT JOIN #Activity A
ON P.[ProductId] = A.[ProductId]
AND Dates.[Date] = A.[PurchaseDate]
WHERE A.[ProductId] IS NULL
ORDER BY P.[ProductId]
,Dates.[Date];

You can generate the date series with a recursive query, then cross join that with the list of products to generate all possible combinations. Finally, you can use not exists to filter on tuples that do not exists in the activity table.
with dates as (
select convert(date, '20201031') purchasedate
union all select dateadd(day, 1, purchasedate) from dates where purchasedate < '20201103'
)
select p.productid, d.purchasedate
from productids p
cross join dates d
where not exists (
select 1
from activity a
where a.productid = p.productid and a.purchasedate = d.purchasedate
)
If you have a date range that spans over more than 100 days, you need to add option (maxrecusion 0) at the very end of the query.

Relational databases have a basis in set theory, so naturally they have set operators built for all sets operations. We're just used to using operators that result in intersections that we overlook other possibilities.
If you don't have a date table, you can generate a range of dates with a recursive query. Then create a set that's a cross product of ProductId and Date, and exclude and members that are in Activity using Except.
DECLARE #MinDate DATE = '2020-10-31';
DECLARE #MaxDate DATE = '2020-11-03';
WITH Dates AS (
SELECT #MinDate [PurchaseDate]
UNION ALL
SELECT DATEADD(DAY, 1, PurchaseDate)
FROM Dates
WHERE PurchaseDate < #MaxDate
)
SELECT p.ProductId, d.PurchaseDate FROM ProductIds p CROSS JOIN Dates d
EXCEPT SELECT ProductId, PurchaseDate FROM Activity

Related

Creating counts based on date ranges with inner join

Here is an illustration for what I'd like to do
Table A:
user_id | industry | startdate | enddate | generation
1 retail 2000-01-01 2001-01-01 Gen X
1 retail 2002-01-01 2003-02-01 Gen X
2 Tech 2001-01-01 2002-01-01 Gen X
2 Business 2002-03-01 2003-01-01 Gen X
2 Tech 2003-02-01 null Gen X
... ... ... ... ...
35642 Medicine 2020-02-01 2022-03-01 Gen Z
Table B
month
1990-01-01
1990-02-01
...
2022-03-01
Desired Result:
industry | generation| count | month
retail Gen X 200 2002-02-01
retail Gen Y 250 2002-02-01
Tech Gen X 130 2002-02-01
Tech Gen Y 166 2002-02-01
...
For now, I've only got tables A and B. I want to create counts by industry, by month, by generation, but I'm not sure how I can do this using the two tables that I have.
My (incorrect) approach would be something like select count(*), industry, month, generation where A.startdate < B.month and A.enddate > B.month, but this query is obviously not running. Is what I want to do possible with just tables A and B?
Apologies if I'm being unclear, I am admittedly new to SQL queries and am not sure how to approach this problem.
Try this approach:
Create a CTE that generates the distinct list of industry/generation by querying table A
Create a 2nd CTE that cartesian joins the first CTE to table B - giving you a list of all months/industry/generation
Left outer join table A to the 2nd CTE and query for the result you want to achieve
Executing the steps of NickW gives me.
I guess that users whose enddate is null are still in businss
## Create tables
CREATE TABLE table_a (
user_id int,
industry text,
startdate date,
enddate date,
generation text
);
CREATE TABLE table_b (
month date
);
# Insert data
WITH series_months AS (
SELECT date(i)
from generate_series(
date '1999-01-01',
date '2012-09-01',
INTERVAL '1 month'
) i
)
INSERT INTO table_b (month)
SELECT * FROM series_months;
INSERT INTO table_a (user_id, industry, startdate, enddate, generation)
VALUES
(1, 'retail', '2000-01-01', '2001-01-01', 'Gen X'),
(1, 'retail', '2002-01-01', '2003-02-01', 'Gen X'),
(2, 'Tech', '2001-01-01', '2002-01-01', 'Gen X'),
(2, 'Business', '2002-03-01', '2003-01-01', 'Gen X'),
(2, 'Tech', '2003-02-01', NULL, 'Gen X');
# Perform joins
WITH industry_generation as (
SELECT distinct industry, generation from table_a
), months_industry_generation as (
SELECT *
FROM table_b, industry_generation
), combined_table AS (
SELECT mig.month, mig.industry, mig.generation, user_id, startdate, enddate
FROM
(SELECT * FROM months_industry_generation) mig
inner join table_a a
on mig.industry = a.industry AND mig.generation = a.generation
where month >= startdate AND (month <= enddate OR enddate is null)
)
SELECT industry, generation, count(user_id) AS count, month
FROM combined_table
GROUP BY month, industry, generation
ORDER BY 1, 2, 4
;

Return number of rows dependent on number

Have a table with this
Id
StartDate
NoOfMonths
1
2021-09-01
2
2
2021-09-01
3
And want a query to return this
Id
Date
1
2021-09-01
1
2021-10-01
2
2021-09-01
2
2021-10-01
2
2021-11-01
How can I make this happen?
Here is an example without an additional table:
DECLARE #t TABLE(
ID int
, StartDate date
, NoOfMonths int
)
INSERT INTO #t VALUES
(1, '2021-09-01', 2)
,(2, '2021-09-01', 3);
WITH cte AS(
SELECT ID, StartDate, NoOfMonths
FROM #t
UNION ALL
SELECT ID, DATEADD(MONTH, 1, StartDate), NoOfMonths-1
FROM cte
WHERE NoOfMonths > 1
)
SELECT ID, StartDate
FROM cte
ORDER BY ID, StartDate
This could be solved by having an additional calendar table, which would be populated and maintained by you. The content of the table could be just dates (first days of months). Then you would join records from that calendar table with your original table using DATEADD() function if it's MS SQL server. So something like:
select DateMonth
from CalendarTable ct
inner join YourTable yt
on ct.DateMonth between yt.StartDate and DATEADD (MONTH, yt.NoOfMonths, yt.StartDate)

SQL: CTE query Speed

I am using SQL Server 2008 and am trying to increase the speed of my query below. The query assigns points to patients based on readmission dates.
Example: A patient is seen on 1/2, 1/5, 1/7, 1/8, 1/9, 2/4. I want to first group visits within 3 days of each other. 1/2-5 are grouped, 1/7-9 are grouped. 1/5 is NOT grouped with 1/7 because 1/5's actual visit date is 1/2. 1/7 would receive 3 points because it is a readmit from 1/2. 2/4 would also receive 3 points because it is a readmit from 1/7. When the dates are grouped the first date is the actual visit date.
Most articles suggest limiting the data set or adding indexes to increase speed. I have limited the amount of rows to about 15,000 and added a index. When running the query with 45 test visit dates/ 3 test patients, the query takes 1.5 min to run. With my actual data set it takes > 8 hrs.
How can I get this query to run < 1 hr? Is there a better way to write my query? Does my Index look correct? Any help would be greatly appreciated.
Example expected results below query.
;CREATE TABLE RiskReadmits(MRN INT, VisitDate DATE, Category VARCHAR(15))
;CREATE CLUSTERED INDEX Risk_Readmits_Index ON RiskReadmits(VisitDate)
;INSERT RiskReadmits(MRN,VisitDate,CATEGORY)
VALUES
(1, '1/2/2016','Inpatient'),
(1, '1/5/2016','Inpatient'),
(1, '1/7/2016','Inpatient'),
(1, '1/8/2016','Inpatient'),
(1, '1/9/2016','Inpatient'),
(1, '2/4/2016','Inpatient'),
(1, '6/2/2016','Inpatient'),
(1, '6/3/2016','Inpatient'),
(1, '6/5/2016','Inpatient'),
(1, '6/6/2016','Inpatient'),
(1, '6/8/2016','Inpatient'),
(1, '7/1/2016','Inpatient'),
(1, '8/1/2016','Inpatient'),
(1, '8/4/2016','Inpatient'),
(1, '8/15/2016','Inpatient'),
(1, '8/18/2016','Inpatient'),
(1, '8/28/2016','Inpatient'),
(1, '10/12/2016','Inpatient'),
(1, '10/15/2016','Inpatient'),
(1, '11/17/2016','Inpatient'),
(1, '12/20/2016','Inpatient')
;WITH a AS (
SELECT
z1.VisitDate
, z1.MRN
, (SELECT MIN(VisitDate) FROM RiskReadmits WHERE VisitDate > DATEADD(day, 3, z1.VisitDate)) AS NextDay
FROM
RiskReadmits z1
WHERE
CATEGORY = 'Inpatient'
), a1 AS (
SELECT
MRN
, MIN(VisitDate) AS VisitDate
, MIN(NextDay) AS NextDay
FROM
a
GROUP BY
MRN
), b AS (
SELECT
VisitDate
, MRN
, NextDay
, 1 AS OrderRow
FROM
a1
UNION ALL
SELECT
a.VisitDate
, a.MRN
, a.NextDay
, b.OrderRow +1 AS OrderRow
FROM
a
JOIN b
ON a.VisitDate = b.NextDay
), c AS (
SELECT
MRN,
VisitDate
, (SELECT MAX(VisitDate) FROM b WHERE b1.VisitDate > VisitDate AND b.MRN = b1.MRN) AS PreviousVisitDate
FROM
b b1
)
SELECT distinct
c1.MRN,
c1.VisitDate
, CASE
WHEN DATEDIFF(day,c1.PreviousVisitDate,c1.VisitDate) < 30 THEN PreviousVisitDate
ELSE NULL
END AS ReAdmissionFrom
, CASE
WHEN DATEDIFF(day,c1.PreviousVisitDate,c1.VisitDate) < 30 THEN 3
ELSE 0
END AS Points
FROM
c c1
ORDER BY c1.MRN
Expected Results:
MRN VisitDate ReAdmissionFrom Points
1 2016-01-02 NULL 0
1 2016-01-07 2016-01-02 3
1 2016-02-04 2016-01-07 3
1 2016-06-02 NULL 0
1 2016-06-06 2016-06-02 3
1 2016-07-01 2016-06-06 3
1 2016-08-01 NULL 0
1 2016-08-15 2016-08-01 3
1 2016-08-28 2016-08-15 3
1 2016-10-12 NULL 0
1 2016-11-17 NULL 0
1 2016-12-20 NULL 0
oops I changed the names of a few cte's (and the post messed up what was code)
It should be like this:
b AS (
SELECT
VisitDate
, MRN
, NextDay
, 1 AS OrderRow
FROM
a1
UNION ALL
SELECT
a.VisitDate
, a.MRN
, a.NextDay
, b.OrderRow +1 AS OrderRow
FROM
a AS a
JOIN b
ON a.VisitDate = b.NextDay AND a.MRN = b.MRN
)
I'm going to take a wild guess here and say you want to change the b cte to
have AND a.MRN = b.MRN as a second condition in the second select query like this:
, b AS (
SELECT
VisitDate
, MRN
, NextDay
, 1 AS OrderRow
FROM
firstVisitAndFollowUp
UNION ALL
SELECT
a.VisitDate
, a.MRN
, a.NextDay
, b.OrderRow +1 AS OrderRow
FROM
visitsDistance3daysOrMore AS a
JOIN b
ON a.VisitDate = b.NextDay AND a.MRN = b.MRN
)

Split project date range into rows of work weeks for all projects in SQL

I have a projects table with a total_hours column and a startdate, enddate column.
If a project has a date range of 5 weeks, I need a query that returns 5 rows with the incremented work week number in a calculated field for all projects.
Here is my table data with a query showing the range in work week format.
drop table #temp
CREATE TABLE #Temp
(ProjectID int, Total_Hours int, StartDate datetime, EndDate datetime)
;
INSERT INTO #Temp
(ProjectID, Total_Hours, StartDate, EndDate)
VALUES
(645, 555, '2016-01-01 00:00:00', '2016-02-01 00:00:00'),
(700, 234, '2015-01-14 00:00:00', '2016-02-01 00:00:00')
Select datepart(week,startdate),datepart(week,Enddate) from #Temp
I need a query that will return the following values
ProjectID WW
645 1
645 2
645 3
645 4
645 5
645 6
700 3
700 4
700 5
700 6
I feel I should use recursion but don't know how.
You could do it with recursion but a numbers table is generally more efficient:
with n as (
select row_number() over (order by (select null)) - 1 as n
from master.spt_values
)
select t.projectid, dateadd(week, n.n, t.startdate) as ww
from #Temp t join
n
on dateadd(week, n.n, t.startdate) <= t.enddate;
If you prefer a recursive query, use
with t as (
select projectid,datepart(week,startdate) sw,datepart(week,enddate) ew from #Temp
union all
select projectid,sw+1,ew from t where sw < ew
)
select projectid, sw
from t
order by 1,2
Sample Demo

Table with dates, table with week numbers, join together?

I have two tables. Table 1:
StuAp_Id StuAp_StaffID StuAp_Date StuAp_Attended
16 77000002659366 2011-09-07 Yes
17 77000002659366 2011-09-14 Yes
18 77000002659366 2011-09-14 Yes
19 77000002659366 2011-09-14 No
20 77000001171783 2011-09-19 Yes
Table 2:
Year Week Start
2011 1 2011-09-05 00:00:00.000
2011 2 2011-09-12 00:00:00.000
2011 3 2011-09-19 00:00:00.000
2011 4 2011-09-26 00:00:00.000
2011 5 2011-10-03 00:00:00.000
2011 6 2011-10-10 00:00:00.000
2011 7 2011-10-17 00:00:00.000
2011 8 2011-10-24 00:00:00.000
2011 9 2011-10-31 00:00:00.000
How would I join these two tables to make something like this:
StuAp_Id StuAp_StaffID StuAp_Date StuAp_Attended Week
16 77000002659366 2011-09-07 Yes 1
17 77000002659366 2011-09-14 Yes 2
18 77000002659366 2011-09-14 Yes 2
19 77000002659366 2011-09-14 No 2
20 77000001171783 2011-09-19 Yes 3
Thanks in advance
You can write simple INNER JOIN using GROUP BY clause.
SELECT Table1.*
,MAX(WEEK) AS WEEK
FROM Table1
INNER JOIN Table2 ON STUAP_DATE >= START
GROUP BY STUAP_ID,STUAP_STAFFID,STUAP_DATE,STUAP_ATTENDED
don't know about specifics on sql2k5 (don't have one around to test) but I would use a sub select eg.
select table_1.*,
[week] = (select isnull(max([week]), 0)
from table_2
where table_1.StuAp_Date >= table_2.start)
from table_1
CTEs to the rescue!
create table StuAp (
StuAp_Id int,
StuAp_StaffID bigint,
StuAp_Date datetime,
StuAp_Attended varchar(3)
)
create table Weeks (
Year int,
Week int,
Start datetime
)
insert into StuAp
values (16, 77000002659366, {d '2011-09-07'}, 'Yes'),
(17, 77000002659366, {d '2011-09-14'}, 'Yes'),
(18, 77000002659366, {d '2011-09-14'}, 'Yes'),
(19, 77000002659366, {d '2011-09-14'}, 'No'),
(20, 77000001171783, {d '2011-09-19'}, 'Yes')
insert into Weeks
values (2011, 1, {d '2011-09-05'}),
(2011, 2, {d '2011-09-12'}),
(2011, 3, {d '2011-09-19'}),
(2011, 4, {d '2011-09-26'}),
(2011, 5, {d '2011-10-03'}),
(2011, 6, {d '2011-10-10'}),
(2011, 7, {d '2011-10-17'}),
(2011, 8, {d '2011-10-24'}),
(2011, 9, {d '2011-10-31'})
;with OrderedWeeks as (
select ROW_NUMBER() OVER (ORDER BY year, week) as row, w.*
from Weeks w
), Ranges as (
select w1.*, w2.Start as Finish
from OrderedWeeks w1 inner join
OrderedWeeks w2 on w1.row = w2.row - 1
)
select s.StuAp_Id, s.StuAp_StaffID, s.StuAp_Date, s.StuAp_Attended, r.Week
from StuAp s inner join
Ranges r on s.StuAp_Date >= r.Start and s.StuAp_Date < r.Finish
This should scale quite well too.
Honestly though, if you find yourself doing queries like this often, you should really consider changing the stucture of your Weeks table to include a finish date. You could even make it an indexed view, or (assuming that the data changes rarely), you could keep your original table and use triggers or a SQL Agent job to keep a copy that contains Finish up to date.
SET ANSI_WARNINGS ON;
GO
DECLARE #Table1 TABLE
(
StuAp_Id INT PRIMARY KEY
,StuAp_StaffID NUMERIC(14,0) NOT NULL
,StuAp_Date DATETIME NOT NULL
,StuAp_Attended VARCHAR(3) NOT NULL
,StuAp_DateOnly AS DATEADD(DAY, DATEDIFF(DAY,0,StuAp_Date), 0) PERSISTED
);
INSERT #Table1
SELECT 16,77000002659366 ,'2011-09-07','Yes'
UNION ALL
SELECT 17,77000002659366 ,'2011-09-14','Yes'
UNION ALL
SELECT 18,77000002659366 ,'2011-09-14','Yes'
UNION ALL
SELECT 19,77000002659366 ,'2011-09-14','No'
UNION ALL
SELECT 20,77000001171783 ,'2011-09-19','Yes';
DECLARE #Table2 TABLE
(
Year INT NOT NULL
,Week INT NOT NULL
,Start DATETIME NOT NULL
,[End] AS DATEADD(DAY,6,Start) PERSISTED
,PRIMARY KEY(Year, Week)
,UNIQUE(Start)
);
INSERT #Table2
SELECT 2011,1 ,'2011-09-05 00:00:00.000'
UNION ALL
SELECT 2011,2 ,'2011-09-12 00:00:00.000'
UNION ALL
SELECT 2011,3 ,'2011-09-19 00:00:00.000'
UNION ALL
SELECT 2011,4 ,'2011-09-26 00:00:00.000'
UNION ALL
SELECT 2011,5 ,'2011-10-03 00:00:00.000'
UNION ALL
SELECT 2011,6 ,'2011-10-10 00:00:00.000'
UNION ALL
SELECT 2011,7 ,'2011-10-17 00:00:00.000'
UNION ALL
SELECT 2011,8 ,'2011-10-24 00:00:00.000'
UNION ALL
SELECT 2011,9 ,'2011-10-31 00:00:00.000';
--Solution 1 : if StuAp_Date has only date part
SELECT a.*, b.Week
FROM #Table1 a
INNER JOIN #Table2 b ON a.StuAp_Date BETWEEN b.Start AND b.[End]
--Solution 2 : if StuAp_Date has only date part
SELECT a.*, b.Week
FROM #Table1 a
INNER JOIN #Table2 b ON a.StuAp_Date BETWEEN b.Start AND DATEADD(DAY,6,b.Start)
--Solution 3 : if StuAp_Date has date & time
SELECT a.*, b.Week
FROM #Table1 a
INNER JOIN #Table2 b ON a.StuAp_DateOnly BETWEEN b.Start AND b.[End]
--Solution 4 : if StuAp_Date has date & time
SELECT a.*, b.Week
FROM #Table1 a
INNER JOIN #Table2 b ON DATEADD(DAY, DATEDIFF(DAY,0,a.StuAp_Date), 0) BETWEEN b.Start AND DATEADD(DAY,6,b.Start)