Combining WITH, UNION and OR in JOIN - sql

I have a national database of all hospital records, and another national database of infection events. I am looking to extract all relevant hospital events for the infection, but I am struggling to find a way to optmise the query. OtherData is a proxy for another set of columns.
The infection data looks like this:
UniqueID
PatientNumber
HospitalNumber
Date
OtherData
14000000
1234
BAC
2022-01-27
DELTA
12007927
5412
HSA
2022-01-20
OMICRON
1
7862
UDO
2020-02-01
ALPHA
The hospital data looks like this:
EpisodeID
PatientNumber
HospitalNumber
StartDate
EndDate
OtherData
4
1234
2022-01-25
NA
ICU
987213
5412
2022-01-20
2022-01-27
DIED
3
BAC
2021-11-20
2022-01-20
DISCHARGED
3
BAC
2020-01-29
2022-02-10
DISCHARGED
The data can be missing lots of fields, and I have two identifiers (national and local) I can use to link the data. I query against both using UNION. But because these are National registers, and I'm dealing with Covid-19 data, we are talking about linking 10s of millions of records (on seperate servers). In order to minimise the amount of hospital data pulled in I am attempting to link between date ranges of the infection.
My query as it stands. It took 8 min pulling a single UniqueID when I tested it, and I have 10s of millions.
Im not sure if I should be using and AND (X OR Y) with the dates in the INNER JOIN or if I should have them seperated, and use two more UNIONs.
This data is ingested into R for further processing and analytics.
Help appreciated!
DECLARE #days AS INT = 28;
WITH
infections AS (
SELECT UniqueID
,PatientNumber
,HospitalNumber
,Date
,OtherData
FROM infections
),
link_tbl AS (
SELECT
i.UniqueID
,h.EpisodeID
,h.PatientNumber
,h.HospitalNumber
,h.StartDate
,h.EndDate
FROM infections i
INNER JOIN hospital h
ON i.NHSNumber = h.NHSNumber
AND (h.EndDate
BETWEEN CONVERT(date, DATEADD(DAY, -#days, i.Date))
AND CONVERT(date, DATEADD(DAY, #days, i.Date))
OR h.StartDate
BETWEEN CONVERT(date, DATEADD(DAY, -#days, i.Date))
AND CONVERT(date, DATEADD(DAY, #days, i.Date))
)
UNION
SELECT
i.UniqueID
,h.EpisodeID
,h.NHSNumber
,h.HospitalNumber
,h.StartDate
,h.EndDate
FROM infections i
INNER JOIN hospital h
ON i.HospitalNumber = h.HospitalNumber
AND (h.EndDate
BETWEEN CONVERT(date, DATEADD(DAY, -#days, i.Date))
AND CONVERT(date, DATEADD(DAY, #days, i.Date))
OR h.StartDate
BETWEEN CONVERT(date, DATEADD(DAY, -#days, i.Date))
AND CONVERT(date, DATEADD(DAY, #days, i.Date))
)
)
SELECT
hospital.allothervars* ---(these are named)
,infections.* ---named in query
FROM hospital
INNER JOIN link_tbl ON hospital.EpisodeID = link_tbl.EpisodeID

I deal with data on a similar scale with similar use cases. One thing I found is that filtering by an unindexed date is VERY slow. We got around it by finding the index for a specific date and using that as a subquery.
So as an example :
SELECT Id,
Date,
Value,
FROM Table
WHERE Id >= (SELECT MIN(Id) FROM Table WHERE Date = '01-01-1970')
rather than doing
SELECT Id,
Date,
Value,
FROM Table
WHERE Date >= '01-01-1970'
Indexing datetime data creates a pretty big file size, so most of the time it's not indexed. The SQL Execution engine can do a much better job sorting on primary keys rather than datetimes.
What works for me is generating a big list of the min and max IDs for a date range grouped by the day. Then whenever I need to do an ad hoc request or do something that has a bound StartDate but not an EndDate, you can just hard code the ID and do a greater than operator for your filter.

Related

SQL Server sampling large volume of data per hour

Im using SQL Server 2016, and have a very large table containing millions of rows of data from different sources at irregular intervals over several years. The table cannot be altered, typical data looks like this -
Reading_ID Source Date Reading
========== ====== ==== =======
1 1 2023/01/01 00:04:00 7
2 1 2023/01/01 00:10:00 3
3 2 2023/01/01 00:15:00 8
4 1 2023/01/01 01:00:00 2
5 2 2023/01/01 01:03:00 15
The table has CONSTRAINT [PK_DATA_READINGS] PRIMARY KEY CLUSTERED ([Source] ASC, [Date] ASC). The SOURCE can be any number, its not fixed or known in advance. New sources can start at any time.
What I want to do is specify a date range and an interval in hours, then just get 1 reading from each source every X hours. i.e. in the above row 2 wouldn't be returned as its too close to row 1
I've tried something like the following -
DECLARE #Start_Date DATETIME = '2023/01/01 00:00:00',
#End_Date DATETIME = '2023/02/01 00:00:00',
#Interval_Hours = 4
;WITH HOURLY_DATA AS (
SELECT d.Source,
d.Date,
d.Reading,
ROW_NUMBER() OVER (PARTITION BY d.Source, DATEDIFF(HOUR, #Start_Date, d.DATE) / #Interval_Hours ORDER BY d.SOURCE, d.DATE) AS SOURCE_HOUR_ROW
FROM data_readings d
WHERE d.DATE BETWEEN #Start_Date AND #End_Date
)
SELECT h.Source,
h.Date,
h.Reading
FROM HOURLY_DATA h
WHERE h.SOURCE_HOUR_ROW = 1
But its still very slow to execute, sometimes taking 5 minutes or more to complete. I would like a faster way to get this data. I've looked at the Explain Plan, but cant see an obvious solution.
Thank you for looking.
You say the Source column has no table that it correlates to. This significantly worsens performance options, as it means you have no way of skipping through your (Source, Date) index by date.
Ideally you would have a table containing a list of possible Source values using a foreign-key relationship. There is no reason why you couldn't update this dynamically.
However, you can hack it with an indexed view.
CREATE VIEW dbo.vAllSources
WITH SCHEMABINDING
AS
SELECT
dr.Source,
COUNT_BIG(*) AS Count
FROM dbo.data_readings dr
GROUP BY
dr.Source;
CREATE UNIQUE CLUSTERED INDEX UX_AllSources ON AllSources (Source);
The server will efficiently maintain this index based off the original table.
Then you can do a simple join. Use the NOEXPAND hint to force it to use the index.
DECLARE #Start_Date DATETIME = '20230101 00:00:00',
#End_Date DATETIME = '20230201 00:00:00',
#Interval_Hours = 4;
WITH HOURLY_DATA AS (
SELECT
d.Source,
d.Date,
d.Reading,
ROW_NUMBER() OVER (PARTITION BY d.Source, DATEDIFF(HOUR, #Start_Date, d.DATE) / #Interval_Hours ORDER BY d.DATE) AS SOURCE_HOUR_ROW
FROM AllSources s WITH (NOEXPAND)
JOIN data_readings d
ON s.Source = d.Source
AND d.DATE BETWEEN #Start_Date AND #End_Date
)
SELECT h.Source,
h.Date,
h.Reading
FROM HOURLY_DATA h
WHERE h.SOURCE_HOUR_ROW = 1;
Note that BETWEEN on date values is generally not recommended, as it implies >= AND <=. You are far better of using a half-open interval:
AND d.DATE >= #Start_Date AND d.DATE < #End_Date
You should also use non-ambiguous date formats.
The slowness is caused by the volume of data in the CTE.
I found this solution which works faster How to sample records by time

SQL Server full outer group not getting all values

I am trying to full join two tables by their date with one table having columns call 'date', 'parties', 'total', etc and another table just having dates.
Below is the query I have:
SELECT
rangeDates.[ListOFDates],
partiesDetails.[Party], partiesDetails.[Amount]
FROM rangeDates
FULL OUTER JOIN partiesDetails
ON rangeDates.[ListOFDates] = partiesDetails.[Date]
Now is the table rangeDates I have. Also there's a date for everyday for a set period of dates, for exmaple, below it starts at '2017-02-02' and may have a date everyday till '2018-03-01'
Ref ListOFDates
1 2017-02-02
2 2017-02-03
3 2017-02-04
.........
And in the partiesDetails table
Date Party Amount
2017-02-03 Tuf 5000
2017-04-01 Tuf 2000
2017-05-22 Wing 3000
.................
The ideal results I would want is:
ListOfDates Party Amount
2017-02-02 NULL NULL
2017-02-03 Tuf 5000
2017-02-04 NULL NULL
............
I feel that maybe you should be using a calendar table here:
WITH dates AS (
SELECT CAST('20170202' AS date) AS [date]
UNION ALL
SELECT DATEADD(dd, 1, [date])
FROM dates
WHERE DATEADD(dd, 1, [date]) <= '2018-03-01'
)
SELECT
d.date,
p.[Party],
p.[Amount]
FROM dates d
LEFT JOIN partiesDetails p
ON d.date = p.[Date]
ORDER BY
d.date;
I make this suggestion because you used the language and may have a date everyday till, which seems to imply that maybe the rangeDates table does not in fact cover the entire date range you have in mind.

How to return a default value when no rows are returned from the select statement

I have a select statement that returns two columns, a date column, and a count(value) column. When the count(value) column doesn't have any records, I need it to return a 0. Currently, it just skips that date record all together.
Here is the basics of the query.
select convert(varchar(25), DateTime, 101) as recordDate,
count(Value) as recordCount
from History
where Value < 700
group by convert(varchar(25), DateTime, 101)
Here are some results that I'm getting.
+------------+-------------+
| recordDate | recordCount |
+------------+-------------+
| 02/26/2014 | 143 |
| 02/27/2014 | 541 |
| 03/01/2014 | 21 |
| 03/02/2014 | 60 |
| 03/03/2014 | 113 |
+------------+-------------+
Notice it skips 2/28/2014. This is because the count(value) column doesn't have anything to count. How can I add the record in there that has the date of 2/28/2014, with a recordCount of 0?
To generate rows for missing dates you can join your data to a date dimension table
It would look something like this:
select convert(varchar(25), ddt.DateField, 101) as recordDate,
count(t.Value) as recordCount
from History h
right join dbo.DateDimensionTable ddt
on ddt.DateField = convert(varchar(25), h.DateTime, 101)
where h.Value < 700
group by convert(varchar(25), h.DateTime, 101)
If your table uses the DateTime column to store dates only (meaning the time is always midnight), then you can replace this
right join dbo.DateDimensionTable ddt
on ddt.DateField = convert(varchar(25), h.DateTime, 101)
with this
right join dbo.DateDimensionTable ddt
on ddt.DateField = h.DateTime
You may use COUNT(*). It will return zero if nothing was found for the column. Also you may group result set by value column if it is needed.
select convert(varchar(25), DateTime, 101) as recordDate,
CASE WHEN count(value) =0 THEN 0 ELSE COUNT(value) END recordCount
from History
where Value < 700
group by convert(varchar(25), DateTime, 101)
When you use a group by, it only creates a distinct list of values that exist in your records. Since 20140228 has no records, it will not show up in the group by.
Your best bet is to generate a list of values, dates in your case, and left join or apply that table against your history table.
I can't seem to copy my T-SQL in here so here's a hastebin.
http://hastebin.com/winaqutego.vbs
The best practice would be for you to have a datamart where a separate dimensional table for dates is kept with all dates you might be interested at - even if they lack amounts. DMason's answer shows the query with such a dimensional table.
To keep with the best practices you would have a fact table where you'd keep these historical data already pre-grouped at the granularity level you need (daily, in this case), so you wouldn't need a GROUP BY unless you needed a coarser granularity (weekly, monthly, yearly).
And in both your operational and datamart databases the dates would be stored as dates, not...
But then, since this is real world and you might not be able to change what somebody else made... If you: a) only care about the dates that appear in [History], and b) such dates are never stored with hours/minutes; then following query might be what you'd need:
SELECT MyDates.DateTime, COUNT(*)-1 AS RecordCount
FROM (
SELECT DISTINCT DateTime FROM History
) MyDates
LEFT JOIN History H
ON MyDates.DateTime = H.Datetime
AND H.Value < 700
GROUP BY MyDates.DateTime
Do try to add an index over DateTime and to further constrain the query with an earliest/latest date for better performance results.
I agree that a Dates table (AKA time dimension) is the right solution, but there is a simpler option:
SELECT
CONVERT(VARCHAR(25), DateTime, 101) AS RecordDate,
SUM(CASE WHEN Value < 700 THEN 1 ELSE 0 END) AS RecordCount
FROM
History
GROUP BY
CONVERT(VARCHAR(25), DateTime, 101)
Try this:
DECLARE #Records TABLE (
[RecordDate] DATETIME,
[RecordCount] INT
)
DECLARE #Date DATETIME = '02/26/2014' -- Enter whatever date you want to start with
DECLARE #EndDate DATETIME = '03/31/2014' -- Enter whatever date you want to stop
WHILE (1=1)
BEGIN
-- Insert the date into the temp table along with the count
INSERT INTO #Records (RecordDate, RecordCount)
VALUES (CONVERT(VARCHAR(25), #Date, 101),
(SELECT COUNT(*) FROM dbo.YourTable WHERE RecordDate = #Date))
-- Go to the next day
#Date = DATEADD(d, 1, #Date)
-- If we have surpassed the end date, break out of the loop
IF (#Date > #EndDate) BREAK;
END
SELECT * FROM #Records
If your dates have time components, you would need to modify this to check for start and end of day in the SELECT COUNT(*)... query.

SQL QUERY showing Between Dates as specific dates + Data belonging to each date!

This is how a table is presented
SELECT RequestsID, Country, Activity,
[People needed (each day)], [Start date], [End date]
FROM dbo.Requests
There will be a lot of requests, and I would like to sum up the "People needed" per day (!), not as Between Start- and End date.
Also I would like to group by country, and have the possibility to set between which dates I want to get data.
Some days might be empty regarding needed people (0), but the Date should be presented anyway.
Note that there can be several requests pointing out the same Dates, and the same Country - but the Activity is then different.
The query should be like (well, it´s not SQL as you can see, just trying to show the logic)
From Requests,
show Country and SUM 'People needed'
where (column not in Requests table-Date) is a Date (will be
a number of dates, I want to set the scope by a Start and End date)
and Requests.Country is #Country
(and the Date(s) above of course is between the Requests Start date and End date...)
And from (a non existing table...?) show Date
Group by Country
I would like to see something like this:
Date Country People needed
06/01/2010 Nigeria 34 // this might be from three different Requests, all pointing out Nigeria. People needed might be (30+1+3 = 34)
06/02/2010 Nigeria 10
06/03/2010 Nigeria 0
06/04/2010 Nigeria 1
06/05/2010 Nigeria 134
06/01/2010 China 2
06/02/2010 China 0
06/03/2010 China 14
06/04/2010 China 23
06/05/2010 China 33
06/01/2010 Chile 3
06/02/2010 Chile 4
06/03/2010 Chile 0
06/04/2010 Chile 0
06/05/2010 Chile 19
How would you do it?
NOTE:
I would like to see some kind of example code, to get started :-)
Normally, I'd suggest having a static calendar table which contains a sequential list of dates. However, using Cade Roux's clever approach of generating a calendar table, you would have something like:
;With Calendar As
(
Select Cast(Floor(Cast(#StartDate As float)) As datetime) As [Date]
Union All
Select DateAdd(d, 1, [Date])
From Calendar
Where DateAdd(d, 1, [Date]) < #EndDate
)
Select C.[Date], R.Country, Sum(R.PeopleNeeded)
From Calendar As C
Left Join Requests As R
On C.[Date] Between R.[Start Date] And R.[End Date]
And ( #Country Is Null Or R.Country = #Country )
Group By C.[Date], R.Country
Option (MAXRECURSION 0);
Now, if it is the case that you want to filter on country such that the only days returned are those for the given country that have data, then you would simply need to change the Left Join to an Inner Join.
ADDITION
From the comments, it was requested to show all countries whether they have a Request or not. To do that, you need to cross join to the Countries table:
With Calendar As
(
Select Cast(Floor(Cast(#StartDate As float)) As datetime) As [Date]
Union All
Select DateAdd(d, 1, [Date])
From Calendar
Where DateAdd(d, 1, [Date]) < #EndDate
)
Select C.[Date], C2.Country, Sum(R.PeopleNeeded)
From Calendar As C
Cross Join Countries As C2
Left Join Requests As R
On C.[Date] Between R.[Start Date] And R.[End Date]
And R.CountryId = C2.CountryId
Group By C.[Date], C2.Country
Option (MAXRECURSION 0);
Typically I would use a tally or pivot table of all dates and then join based on that date being between the range.
A technique similar to that discussed here.
something like this?
select d.Date, c.Country, sum(People) as PeopleNeeded
from Dates d left join Requests r on d.Date between r.Start and r.End
group by d.Date, c.Country
where Dates contains an appropriate range of dates, as in Cade Roux's answer

SQL for counting events by date

I feel like I've seen this question asked before, but neither the SO search nor google is helping me... maybe I just don't know how to phrase the question. I need to count the number of events (in this case, logins) per day over a given time span so that I can make a graph of website usage. The query I have so far is this:
select
count(userid) as numlogins,
count(distinct userid) as numusers,
convert(varchar, entryts, 101) as date
from
usagelog
group by
convert(varchar, entryts, 101)
This does most of what I need (I get a row per date as the output containing the total number of logins and the number of unique users on that date). The problem is that if no one logs in on a given date, there will not be a row in the dataset for that date. I want it to add in rows indicating zero logins for those dates. There are two approaches I can think of for solving this, and neither strikes me as very elegant.
Add a column to the result set that lists the number of days between the start of the period and the date of the current row. When I'm building my chart output, I'll keep track of this value and if the next row is not equal to the current row plus one, insert zeros into the chart for each of the missing days.
Create a "date" table that has all the dates in the period of interest and outer join against it. Sadly, the system I'm working on already has a table for this purpose that contains a row for every date far into the future... I don't like that, and I'd prefer to avoid using it, especially since that table is intended for another module of the system and would thus introduce a dependency on what I'm developing currently.
Any better solutions or hints at better search terms for google? Thanks.
Frankly, I'd do this programmatically when building the final output. You're essentially trying to read something from the database which is not there (data for days that have no data). SQL isn't really meant for that sort of thing.
If you really want to do that, though, a "date" table seems your best option. To make it a bit nicer, you could generate it on the fly, using i.e. your DB's date functions and a derived table.
I had to do exactly the same thing recently. This is how I did it in T-SQL (
YMMV on speed, but I've found it performant enough over a coupla million rows of event data):
DECLARE #DaysTable TABLE ( [Year] INT, [Day] INT )
DECLARE #StartDate DATETIME
SET #StartDate = whatever
WHILE (#StartDate <= GETDATE())
BEGIN
INSERT INTO #DaysTable ( [Year], [Day] )
SELECT DATEPART(YEAR, #StartDate), DATEPART(DAYOFYEAR, #StartDate)
SELECT #StartDate = DATEADD(DAY, 1, #StartDate)
END
-- This gives me a table of all days since whenever
-- you could select #StartDate as the minimum date of your usage log)
SELECT days.Year, days.Day, events.NumEvents
FROM #DaysTable AS days
LEFT JOIN (
SELECT
COUNT(*) AS NumEvents
DATEPART(YEAR, LogDate) AS [Year],
DATEPART(DAYOFYEAR, LogDate) AS [Day]
FROM LogData
GROUP BY
DATEPART(YEAR, LogDate),
DATEPART(DAYOFYEAR, LogDate)
) AS events ON days.Year = events.Year AND days.Day = events.Day
Create a memory table (a table variable) where you insert your date ranges, then outer join the logins table against it. Group by your start date, then you can perform your aggregations and calculations.
The strategy I normally use is to UNION with the opposite of the query, generally a query that retrieves data for rows that don't exist.
If I wanted to get the average mark for a course, but some courses weren't taken by any students, I'd need to UNION with those not taken by anyone to display a row for every class:
SELECT AVG(mark), course FROM `marks`
UNION
SELECT NULL, course FROM courses WHERE course NOT IN
(SELECT course FROM marks)
Your query will be more complex but the same principle should apply. You may indeed need a table of dates for your second query
Option 1
You can create a temp table and insert dates with the range and do a left outer join with the usagelog
Option 2
You can programmetically insert the missing dates while evaluating the result set to produce the final output
WITH q(n) AS
(
SELECT 0
UNION ALL
SELECT n + 1
FROM q
WHERE n < 99
),
qq(n) AS
(
SELECT 0
UNION ALL
SELECT n + 1
FROM q
WHERE n < 99
),
dates AS
(
SELECT q.n * 100 + qq.n AS ndate
FROM q, qq
)
SELECT COUNT(userid) as numlogins,
COUNT(DISTINCT userid) as numusers,
CAST('2000-01-01' + ndate AS DATETIME) as date
FROM dates
LEFT JOIN
usagelog
ON entryts >= CAST('2000-01-01' AS DATETIME) + ndate
AND entryts < CAST('2000-01-01' AS DATETIME) + ndate + 1
GROUP BY
ndate
This will select up to 10,000 dates constructed on the fly, that should be enough for 30 years.
SQL Server has a limitation of 100 recursions per CTE, that's why the inner queries can return up to 100 rows each.
If you need more than 10,000, just add a third CTE qqq(n) and cross-join with it in dates.