Getting rid of an expensive self join

Getting rid of an expensive self join - sql

I have a SQL statement like this
SELECT
pa.col1,
SUM(ps.col2) col2,
SUM(psl.col2) col2_previous_month
FROM
pa
LEFT JOIN
ps ON pa.Id = ps.Id AND ps.date = #currDate
LEFT JOIN
ps as psl ON psl.Id = ps.Id AND psl.date = dateadd(month, - 1, #currDate)
GROUP BY
pa.col1;
This SQL is called often and since the table ps has 100M rows the left join is hurting. Is there a way to rewrite this using left Join?
Regards
Nick

Perhaps this will help
Select pa.col1
,col2 =isnull(sum(case when ps.date=#currDate then ps.col2 else null end),0)
,col2_prior=isnull(sum(case when ps.date=dateadd(month,-1,#currDate) then ps.col2 else null end),0)
From pa
JOIN ps as ps ON pa.Id = ps.Id
and ps.date in (#currDate,dateadd(month,-1,#currDate))
Group By pa.col1

If the query of John doesn't help you can also try this one:
SELECT
pa.col1
,SUM(ps1.col2) col2
,SUM(ps2.col2) col2_previous_month
FROM pa
LEFT JOIN
(
SELECT col2
FROM ps
WHERE date = #currDate
) ps1 ON ON pa.Id = ps1.Id
LEFT JOIN
(
SELECT col2
FROM ps
WHERE date = dateadd(month, - 1, #currDate)
) ps2 ON ON pa.Id = ps2.Id
GROUP BY pa.col1;
I thought to it after having read your comments.
It is exactly the same as your initial query except that I moved the search on dates inside a nested query, which might help the optimizer to properly use the index.

The query looks fine. In order to have it perform fast, you should have the following indexes:
pa (id)
ps (id, date)
If you want it still faster, use covering indexes:
pa (id, col1)
ps (id, date, col2)

Related

Nesting queries

My query from the attached schema is asking me to look for the same location of where the people who tested positive went and were in the same people as the untested people. (Untested means the people not there in the testing table.
--find the same locations of where the positive people and the untested people went
select checkin.LocID, checkin.PersonID
from checkin join testing on checkin.personid = testing.personid
where results = 'Positive'
and (select CheckIn.PersonID
from checkin join testing on checkin.PersonID = testing.PersonID where CheckIn.PersonID
not in (select testing.PersonID from testing));
In my view the query is stating the following
To select a location and person from joining the checking and testing table and the result is positive and to select a person from the check in table who is not there in the testing table.
Since the answer I am getting is zero and I know manually there are people. What am I doing wrong?
I hope this makes sense.

You can get the people tested 'Positive' with this query:
select personid from testing where results = 'Positive'
and the untested people with:
select p.personid
from person p left join testing t
on t.personid = p.personid
where t.testingid is null
You must join to each of these queries a copy of checkin and these copies joined together:
select l.*
from (select personid from testing where results = 'Positive') p
inner join checkin cp on cp.personid = p.personid
inner join checkin cu on cu.lid = cp.lid
inner join (
select p.personid
from person p left join testing t
on t.personid = p.personid
where t.testingid is null
) pu on pu.personid = cu.personid
inner join location l on l.locationid = cu.lid

If what you want are the positive people who are at a location where there is also someone who is not tested, you might consider:
select ch.LocID,
group_concat(case when t.results = 'positive' then ch.PersonID end) as positive_persons
from checkin ch left join
testing t
on ch.personid = t.personid
group by ch.LocId
having sum(case when t.results = 'positive' then 1 else 0 end) > 0 and
count(*) <> count(t.personid); -- at least one person not tested
With this structure, you can get the untested people using:
group_concat(case when t.personid is null then ch.personid)

You have several mistakes (missing exists, independent subquery in exists). I believe that this should do the work
select ch1.LocID, ch1.PersonID
from checkin ch1
join testing t1 on ch1.personid = t1.personid
where results = 'Positive'
and exists (
select 1
from checkin ch2
where ch1.LocID = ch2.LocID and ch2.PersonID not in (
select testing.PersonID
from testing
)
);

Query is working fine in second execution but taking too much time in first execution

I am writing a query which is accepting a comma separated string and calculating the sum of transaction. which is working fine as result wise but taking too much time to execute in first attempt. I understand its need tuning but didn't find out the exact reason can any one point me whats wrong with my query.
Declare #IDs nvarchar(max)='1,4,5,6,8,9,43,183'
SELECT isnull(isnull(SUM(FT.PaidAmt),0) - isnull(SUM(CT.PaidAmt),0),0) [Amount], convert(char(10),FT.TranDate,126) [Date]
from FeeTransaction FT
Inner Join (
Select max(P.Id) [Id], P.TranMainId, isnull(SUM(P.AmtToPay),0) [Amt]
From Patient_Account P
Group By P.TranMainId
) PA ON FT.Id = PA.TranMainId
Inner Join Patient_Account XP ON PA.Id = XP.Id
Inner Join Master_Fee MF ON XP.FeeId = MF.Id
INNER Join Master_Patient MP ON FT.PID = MP.Id
Inner Join Master_FeeType TY ON MF.FeeTypeId = TY.Id
Left JOIN FeeTransaction CT on FT.TransactionId = CT.TransactionId AND CT.TranDate between '2019'+'08'+'01' and '2019'+'08'+'31' and CT.[Status] <> 'A' AND isnull(CT.IsCancel,0) = 1
Where convert(nvarchar,FT.TranDate,112) between '2019'+'08'+'01' and '2019'+'08'+'31' AND FT.[Status] = 'A' AND XP.FeeId in (SELECT val FROM dbo.f_split(#IDs, ','))
AND isnull(FT.IsCancel,0) = 0 AND FT.EntryBy = 'rajan'
Group By convert(char(10),FT.TranDate,126)

I would rephrase the query a bit:
select coalesce(SUM(FT.PaidAmt), 0) - coalesce(SUM(CT.PaidAmt), 0)as [Amount],
convert(char(10),FT.TranDate,126) [Date]
from FeeTransaction FT join
(select xp.*,
coalesce(sum(p.amttopay) over (TranMainId), 0) as amt
from Patient_Account XP ON PA.Id = XP.Id
) xp join
Master_Fee MF
on XP.FeeId = MF.Id join
Master_Patient MP
on FT.PID = MP.Id join
Master_FeeType TY
on MF.FeeTypeId = TY.Id left join
FeeTransaction CT
on FT.TransactionId = CT.TransactionId and
CT.TranDate between '20190801' and '20190831' and
CT.[Status] <> 'A' and
CT.IsCanel = 1
where FT.TranDate >= '20190801' and and
FT.TranDate < '20190901'
FT.[Status] = 'A' AND
XP.FeeId in (SELECT val FROM dbo.f_split(#IDs, ',')) and
(FT.IsCancel = 0 or FT.IsCancel IS NULL) and
FT.EntryBy = 'rajan'
Group By convert(char(10), FT.TranDate, 126)
Then for this version, you specifically an index on FeeTransaction(EntryBy, Status, TranDate, Cancel).
Note the following changes:
You do not need to aggregate Patient_Account as a subquery. Window functions are quite convenient.
Your date comparisons preclude the use of indexes. Converting dates to strings is a bad practice in general.
You have over-used isnull().
I assume that the appropriate indexes are in place for the joins.

I would use STRING_SPLIT and Common Table Expressions and do away with date conversions:
Declare #IDs nvarchar(max)='1,4,5,6,8,9,43,183'
;WITH CTE_ID AS
(
SELECT value AS ID FROM STRING_SPLIT(#IDs, ',');)
),
MaxPatient
AS
(
SELECT MAX(P.Id) [Id], P.TranMainId, isnull(SUM(P.AmtToPay),0) [Amt]
From Patient_Account P
Group By P.TranMainId
)
SELECT isnull(isnull(SUM(FT.PaidAmt),0) - isnull(SUM(CT.PaidAmt),0),0) As [Amount],
convert(char(10),FT.TranDate,126) [Date]
FROM FeeTransaction FT
INNER JOIN MaxPatient PA
ON FT.Id = PA.TranMainId
INNER JOIN Patient_Account XP
ON PA.Id = XP.Id
INNER JOIN Master_Fee MF
ON XP.FeeId = MF.Id
INNER Join Master_Patient MP
ON FT.PID = MP.Id
INNER JOIN Master_FeeType TY
ON MF.FeeTypeId = TY.Id
INNER JOIN CTE_ID
ON XP.FeeId = CTE_ID.ID
LEFT JOIN FeeTransaction CT
ON FT.TransactionId = CT.TransactionId AND
CT.TranDate >= '20190801' AND CT.TranDate < '20190831' AND
CT.[Status] <> 'A' AND isnull(CT.IsCancel,0) = 1
WHERE FT.TranDate >= '20190801' and FT.TranDate < '20190831' AND
FT.[Status] = 'A' AND
ISNULL(FT.IsCancel,0) = 0 AND
FT.EntryBy = 'rajan'
GROUP BY CAST(FT.TranDate AS Date)

Not only is your query slow, but it appear that it is giving incorrect output.
i) When you are not using any column of Patient_Account in your resultset then why are you writing this sub query?
Select max(P.Id) [Id], P.TranMainId, isnull(SUM(P.AmtToPay),0) [Amt]
From Patient_Account P
Group By P.TranMainId
ii) Avoid using <>.So Status must be either 'A' or 'I'
so write this instead CT.[Status] = 'I'
iii) What is the correct data type of TranDate ?Don't use function in where condition. .
iv) No need of isnull(CT.IsCancel,0) = 1,instead write CT.IsCancel = 1
So my script is just outline, but it is easy to understand.
Declare #IDs nvarchar(max)='1,4,5,6,8,9,43,183'
create table #temp(id int)
insert into #temp(id)
SELECT val FROM dbo.f_split(#IDs, ',')
declare #FromDate Datetime='2019-08-01'
declare #toDate Datetime='2019-08-31'
-- mention all column of FeeTransaction that you need in this query along with correct data type
-- Store TranDate in this manner convert(char(10),FT.TranDate,126) in this table
create table #Transaction()
select * from FeeTransaction FT
where FT.TranDate>=#FromDate and FT.TranDate<#toDate
and exists(select 1 from #temp t where t .val=ft.id)
-- mention all column of Patient_Account that you need in this query along with correct data type
create table #Patient_Account()
Select max(P.Id) [Id], P.TranMainId, isnull(SUM(P.AmtToPay),0) [Amt]
From Patient_Account P
where exists(select 1 from #Transaction T where t.id=PA.TranMainId)
Group By P.TranMainId
SELECT isnull(isnull(SUM(FT.PaidAmt),0) - isnull(SUM(CT.PaidAmt),0),0) [Amount], TranDate [Date]
from #Transaction FT
Inner Join #Patient_Account XP ON PA.Id = XP.Id
Inner Join Master_Fee MF ON XP.FeeId = MF.Id
INNER Join Master_Patient MP ON FT.PID = MP.Id
Inner Join Master_FeeType TY ON MF.FeeTypeId = TY.Id
Left JOIN #Transaction CT on FT.TransactionId = CT.TransactionId AND CT.[Status] = 'I' AND CT.IsCancel = 1
Where AND FT.[Status] = 'A' AND XP.FeeId in (SELECT val FROM #temp t)
AND FT.IsCancel = 0 AND FT.EntryBy = 'rajan'
Group By TranDate

Self join on joined table

My query looks like
Select m.cw_sport_match_id as MatchId,
m.season_id as SeasonId,
s.title as SeasonName,
c.title as ContestName
from dbo.cw_sport_match m
inner join dbo.cw_sport_season s
ON m.season_id = s.cw_sport_season_id
inner join dbo.cw_sport_contest c
ON m.contest_id = c.cw_sport_contest_id
Where s.date_start <= GETDATE() AND s.date_end >= GETDATE()
order by s.date_start
No i need the name parent of the sport_contest (if there is one, it can be null). So basically a self join but no on the same table as the query is for. All the examples that i find do the self join are not done on another table.
can any sql pro help me?
So how can i join the cw_sport_season itself with the season_parent_id and get the title of it?

If I'm understanding your question correctly, you want to outer join the cw_sport_season table to itself using the season_parent_id field. Maybe something on these lines:
Select m.cw_sport_match_id as MatchId,
m.season_id as SeasonId,
s.title as SeasonName,
parent.title as ParentSeasonName,
c.title as ContestName
from dbo.cw_sport_match m
inner join dbo.cw_sport_season s
ON m.season_id = s.cw_sport_season_id
inner join dbo.cw_sport_contest c
ON m.contest_id = c.cw_sport_contest_id
left join dbo.cw_sport_season parent
ON s.season_parent_id = parent.cw_sport_season_id
Where s.date_start <= GETDATE() AND s.date_end >= GETDATE()
order by s.date_start

My CASE statement is wrong. Any idea what I am doing wrong?

I am in a logjam.
When I run the following query, it works:
select DISTINCT l.Seating_Capacity - (select count(*)
from tblTrainings t1, tbllocations l
where l.locationId = t1.LocationId) as
availableSeats
from tblTrainings t1, tbllocations l
where l.locationId = t1.LocationId
However, we would like to add a CASE statement that says, when Seating_Capacity - total count as shown above = 0 then show 'FULL' message.
Otherwise, show remaining number.
Here is that query:
select DISTINCT case when l.Seating_Capacity - (select count(*)
from tblTrainings t1, tbllocations l
where l.locationId = t1.LocationId) = 0 then 'full' else STR(Seating_Capacity) end)
availableSeats
from tblTrainings t1, tbllocations l
where l.locationId = t1.LocationId
I am getting 'Incorrect syntax near ')' which is near 'End'
I am also getting an error that the inner Seating_Capacity is invalid column name.
Your assistance is greatly appreciated.
I must have been in a dream land because I thought it was working during testing.
Now, the app is live and it isn't working.
Thanks a lot in advance
select DISTINCT l.LocationId,c.courseId, c.coursename, l.Seating_Capacity - (select count(*)
from tblTrainings t1
where l.locationId = t1.LocationId and c.courseId = t1.courseId) as
availableSeats,d.dateid,d.trainingDates,d.trainingtime,c.CourseDescription,
i.instructorName,l.location,l.seating_capacity
from tblLocations l
Inner Join tblCourses c on l.locationId = c.locationId
left join tblTrainings t on l.locationId = t.LocationId and c.courseId = t.courseId
Inner Join tblTrainingDates d on c.dateid=d.dateid
Inner Join tblCourseInstructor ic on c.courseId = ic.CourseId
Inner Join tblInstructors i on ic.instructorId = i.instructorId
WHERE CONVERT(VARCHAR(10), d.trainingDates, 101) >= CONVERT(VARCHAR(10), GETDATE(), 101)

To avoid repeating the expression, you can use a WITH clause to simplify your query:
WITH (
-- Start with your query that already works
SELECT DISTINCT l.Seating_Capacity - (select count(*)
from tblTrainings t1, tbllocations l
where l.locationId = t1.LocationId) AS availableSeats
FROM tblTrainings t1, tbllocations l
WHERE l.locationId = t1.LocationId
) AS source
SELECT
-- Add a CASE statement on top of it
CASE WHEN availableSeats = 0 THEN 'Full'
ELSE STR(availableSeats)
END AS availableSeats
FROM source

You have an extra ) at the end of your case statement remove that.
0 then 'full' else STR(Seating_Capacity) end)
^^^
for Seating_Capacity try accessing it with table alias like l.Seating_Capacity

I think you are over complicating the query with your subquery. As I understand it then the following should work as you need:
SELECT AvailableSeats = CASE WHEN l.Seating_Capacity - COUNT(*) = 0 THEN 'Full'
ELSE STR(l.Seating_Capacity - COUNT(*))
END
FROM tblTrainings t1
INNER JOIN tblLocations l
ON l.LocationID = t1.LocationID
GROUP BY l.Seating_Capacity;
I have changed your else to STR(l.Seating_Capacity - COUNT(*)) because I assume you want to know the seats remaining, rather than just the capacity? If I have misinterpreted the requirement, just change it to STR(l.Seating_Capacity).
I have also switched your ANSI 89 implicit joins to ANSI 92 explicit joins, the standard changed over 20 years, and there are plenty of good reasons to switch to the newer syntax. But for completeness the ANSI 89 version of the above query would be:
SELECT AvailableSeats = CASE WHEN l.Seating_Capacity - COUNT(*) = 0 THEN 'Full'
ELSE STR(l.Seating_Capacity - COUNT(*))
END
FROM tblTrainings t1, tblLocations l
WHERE l.LocationID = t1.LocationID
GROUP BY l.Seating_Capacity;
EDIT
To adapt your full query you can simply replace your subquery in the select, with a joined subquery:
SELECT l.LocationId,
c.courseId,
c.coursename,
CASE WHEN l.Seating_Capacity - t.SeatsTaken = 0 THEN 'Full'
ELSE STR(l.Seating_Capacity - t.SeatsTaken)
END AS availableSeats,
d.dateid,
d.trainingDates,
d.trainingtime,
c.CourseDescription,
i.instructorName,
l.location,
l.seating_capacity
FROM tblLocations l
INNER JOIN tblCourses c
ON l.locationId = c.locationId
LEFT JOIN
( SELECT t.LocationID, t.CourseID, SeatsTaken = COUNT(*)
FROM tblTrainings t
GROUP BY t.LocationID, t.CourseID
) t
ON l.locationId = t.LocationId
AND c.courseId = t.courseId
INNER JOIN tblTrainingDates d
ON c.dateid=d.dateid
INNER JOIN tblCourseInstructor ic
ON c.courseId = ic.CourseId
INNER JOIN tblInstructors i
ON ic.instructorId = i.instructorId
WHERE d.trainingDates >= CAST(GETDATE() AS DATE);
JOINs tend to optimise better than correlated subqueries (although sometimes the optimiser can determine that a JOIN would work instead), it also means that you can reference the result (SeatsTaken) multiple times without re-evaluating the subquery.
In addition, by moving the count to a subquery, and removing the join to tblTrainings I think you eliminate the need for DISTINCT which should improve the performance.
Finally I have changed this line:
WHERE CONVERT(VARCHAR(10), d.trainingDates, 101) >= CONVERT(VARCHAR(10), GETDATE(), 101)
To
WHERE d.trainingDates >= CAST(GETDATE() AS DATE);
I don't know if you do, but if you had an index on d.TrainingDates then by converting it to varchar to compare it to today you remove the ability for the optimiser to use this index, since you are only saying >= midnight today, there is no need to perform any conversion on d.TrainingDates, and all you need to do is remove the time portion of GETDATE(), which can be done by casting to DATE. More on this is contained in this article (Yet another gem from Aaron Bertrand)

Inner join that ignore singlets

I have to do an self join on a table. I am trying to return a list of several columns to see how many of each type of drug test was performed on same day (MM/DD/YYYY) in which there were at least two tests done and at least one of which resulted in a result code of 'UN'.
I am joining other tables to get the information as below. The problem is I do not quite understand how to exclude someone who has a single result row in which they did have a 'UN' result on a day but did not have any other tests that day.
Query Results (Columns)
County, DrugTestID, ID, Name, CollectionDate, DrugTestType, Results, Count(DrugTestType)
I have several rows for ID 12345 which are correct. But ID 12346 is a single row of which is showing they had a row result of count (1). They had a result of 'UN' on this day but they did not have any other tests that day. I want to exclude this.
I tried the following query
select
c.desc as 'County',
dt.pid as 'PID',
dt.id as 'DrugTestID',
p.id as 'ID',
bio.FullName as 'Participant',
CONVERT(varchar, dt.CollectionDate, 101) as 'CollectionDate',
dtt.desc as 'Drug Test Type',
dt.result as Result,
COUNT(dt.dru_drug_test_type) as 'Count Of Test Type'
from
dbo.Test as dt with (nolock)
join dbo.History as h on dt.pid = h.id
join dbo.Participant as p on h.pid = p.id
join BioData as bio on bio.id = p.id
join County as c with (nolock) on p.CountyCode = c.code
join DrugTestType as dtt with (nolock) on dt.DrugTestType = dtt.code
inner join
(
select distinct
dt2.pid,
CONVERT(varchar, dt2.CollectionDate, 101) as 'CollectionDate'
from
dbo.DrugTest as dt2 with (nolock)
join dbo.History as h2 on dt2.pid = h2.id
join dbo.Participant as p2 on h2.pid = p2.id
where
dt2.result = 'UN'
and dt2.CollectionDate between '11-01-2011' and '10-31-2012'
and p2.DrugCourtType = 'AD'
) as derived
on dt.pid = derived.pid
and convert(varchar, dt.CollectionDate, 101) = convert(varchar, derived.CollectionDate, 101)
group by
c.desc, dt.pid, p.id, dt.id, bio.fullname, dt.CollectionDate, dtt.desc, dt.result
order by
c.desc ASC, Participant ASC, dt.CollectionDate ASC

This is a little complicated because the your query has a separate row for each test. You need to use window/analytic functions to get the information you want. These allow you to do calculate aggregation functions, but to put the values on each line.
The following query starts with your query. It then calculates the number of UN results on each date for each participant and the total number of tests. It applies the appropriate filter to get what you want:
with base as (<your query here>)
select b.*
from (select b.*,
sum(isUN) over (partition by Participant, CollectionDate) as NumUNs,
count(*) over (partition by Partitipant, CollectionDate) as NumTests
from (select b.*,
(case when result = 'UN' then 1 else 0 end) as IsUN
from base
) b
) b
where NumUNs <> 1 or NumTests <> 1
Without the with clause or window functions, you can create a particularly ugly query to do the same thing:
select b.*
from (<your query>) b join
(select Participant, CollectionDate, count(*) as NumTests,
sum(case when result = 'UN' then 1 else 0 end) as NumUNs
from (<your query>) b
group by Participant, CollectionDate
) bsum
on b.Participant = bsum.Participant and
b.CollectionDate = bsum.CollectionDate
where NumUNs <> 1 or NumTests <> 1

If I understand the problem, the basic pattern for this sort of query is simply to include negating or exclusionary conditions in your join. I.E., self-join where columnA matches, but columns B and C do not:
select
[columns]
from
table t1
join table t2 on (
t1.NonPkId = t2.NonPkId
and t1.PkId != t2.PkId
and t1.category != t2.category
)
Put the conditions in the WHERE clause if it benchmarks better:
select
[columns]
from
table t1
join table t2 on (
t1.NonPkId = t2.NonPkId
)
where
t1.PkId != t2.PkId
and t1.category != t2.category
And it's often easiest to start with the self-join, treating it as a "base table" on which to join all related information:
select
[columns]
from
(select
[columns]
from
table t1
join table t2 on (
t1.NonPkId = t2.NonPkId
)
where
t1.PkId != t2.PkId
and t1.category != t2.category
) bt
join [othertable] on (<whatever>)
join [othertable] on (<whatever>)
join [othertable] on (<whatever>)
This can allow you to focus on getting that self-join right, without interference from other tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Getting rid of an expensive self join - sql

The query looks fine. In order to have it perform fast, you should have the following indexes: pa (id) ps (id, date) If you want it still faster, use covering indexes: pa (id, col1) ps (id, date, col2)

Related

Nesting queries

Query is working fine in second execution but taking too much time in first execution

Self join on joined table

My CASE statement is wrong. Any idea what I am doing wrong?

Inner join that ignore singlets

Categories

Resources