How order of joins affect performance of a query - sql

I'm experiencing big differences in timeperformance in my query, and it seems the order of which the joins (inner and left outer) occur in the query makes all the difference.
Are there some "ground rules" in what order joins should be in?
Both of them are part of a bigger query.
The difference between them is that the left join is placed last in the faster query.
Slow query: (> 10 minutes)
SELECT [t0].[Ref], [t1].[Key], [t1].[Name],
(CASE
WHEN [t3].[test] IS NULL THEN CONVERT(NVarChar(250),#p0)
ELSE CONVERT(NVarChar(250),[t3].[Key])
END) AS [value],
(CASE
WHEN 0 = 1 THEN CONVERT(NVarChar(250),#p1)
ELSE CONVERT(NVarChar(250),[t4].[Key])
END) AS [value2]
FROM [dbo].[tblA] AS [t0]
INNER JOIN [dbo].[tblB] AS [t1] ON [t0].[RefB] = [t1].[Ref]
LEFT OUTER JOIN (
SELECT 1 AS [test], [t2].[Ref], [t2].[Key]
FROM [dbo].[tblC] AS [t2]
) AS [t3] ON [t0].[RefC] = ([t3].[Ref])
INNER JOIN [dbo].[tblD] AS [t4] ON [t0].[RefD] = ([t4].[Ref])
Faster query: (~ 30 seconds)
SELECT [t0].[Ref], [t1].[Key], [t1].[Name],
(CASE
WHEN [t3].[test] IS NULL THEN CONVERT(NVarChar(250),#p0)
ELSE CONVERT(NVarChar(250),[t3].[Key])
END) AS [value],
(CASE
WHEN 0 = 1 THEN CONVERT(NVarChar(250),#p1)
ELSE CONVERT(NVarChar(250),[t4].[Key])
END) AS [value2]
FROM [dbo].[tblA] AS [t0]
INNER JOIN [dbo].[tblB] AS [t1] ON [t0].[RefB] = [t1].[Ref]
INNER JOIN [dbo].[tblD] AS [t4] ON [t0].[RefD] = ([t4].[Ref])
LEFT OUTER JOIN (
SELECT 1 AS [test], [t2].[Ref], [t2].[Key]
FROM [dbo].[tblC] AS [t2]
) AS [t3] ON [t0].[RefC] = ([t3].[Ref])

Generally INNER JOIN order won't matter because inner joins are commutative and associative. In both cases, you still have t0 inner join t4 so should make no difference.
Re-phrasing that, SQL is declarative: you say "what you want", not "how". The optimiser works the "how" and will re-order JOINs as needed, looking as WHEREs etc too in practice.
In complex queries, a cost based query optimiser won't exhaust all permutation so it could matter occasionally.
So, I'd check for these:
You said these are part of a bigger query, so this section matters less because the whole query matters.
Complexity can be hidden using views too if any of the tables are actually views
Is this repeatable, no matter what order code runs in?
What are the query plan differences?
See some other SO questions:
how to best organize the Inner Joins in (select) statement
SQL Server 2005 - Order of Inner Joins

If u have more than 2 tables it is important to order table joins. It can make big differences. First table should get a leading hint. First table is that object with most selective rows. For example: If u have a member table with 1.000.000 people and you only want to select female gender and it is first table, so you only join 500.000 records to next table. If this table is at the end of join order (maybe table 4,5 or 6) then each record (worst case 1.000.000) will be joined. This includes inner and outer joins.
The Rule: Start with most selective table, then join next logical most selective table.
Converting functions and beautifying should do last. Sometimes it is better to
bundle the shole SQL in brackets and use expressions and functions in outer select statements.

In the case of left join it impact a lot the performance. i was having a problem in a select query that was like that :
select distinct count(p0_.id) over () as col_0_0_,
p0_.id as col_1_0_,
p0_.lp as col_2_0_,
0
as col_3_0_,
max(coalesce(i6_.cft, i7_.rfo,
'')) as col_4_0_,
p0_.pdv as col_5_0_,
(s8_.qer)
as col_6_0_,
cf1_.ests as col_7_0_
from Produit p0_
left outer join CF cf1_ on p0_.fk_cf = cf1_.id
left outer join CA c2_ on cf1_.fk_ca = c2_.id
left outer join ml mt on c2_.fk_m = mt.id
left outer join sk s8_ on p0_.id = s8_.fk_p
left outer join rf r5_ on
rp4_.fk_r = r5_.id
left outer join
in i6_ on r5_.fk_ireftc = i6_.id
left outer join r_p_r rp4_ on p0_.id = rp4_.fk_p
left outer join
ir i7_ on r5_.fk_if = i7_.id
left outer join re_p_g gc9_ on p0_.id = gc9_.fk_p
left outer join gc g10_ on gc9_.fk_g = g10_.id
where
and (p0_.lC is null or p0_.lS = 'E')
and g10_.id is null or g10_.id
and r5_.fk_i is null
group by col_1_0_, col_2_0_, col_3_0_, col_5_0_, col_6_0_, col_7_0_
order by col_2_0_ asc, p0_.id
limit 10;
the query takes 13 to 15 seconde to execute, when i change the order its takes 1 to 2 seconde.
select distinct count(p0_.id) over () as col_0_0_,
p0_.id as col_1_0_,
p0_.lp as col_2_0_,
0
as col_3_0_,
max(coalesce(i6_.cft, i7_.rfo,
'')) as col_4_0_,
p0_.pdv as col_5_0_,
(s8_.qer)
as col_6_0_,
cf1_.ests as col_7_0_
from Produit p0_
left outer join CF cf1_ on p0_.fk_cf = cf1_.id
left outer join sk s8_ on p0_.id = s8_.fk_p
left outer join r_p_r rp4_ on p0_.id = rp4_.fk_p
left outer join re_p_g gc9_ on p0_.id = gc9_.fk_p
left outer join CA c2_ on cf1_.fk_ca = c2_.id
left outer join ml mt on c2_.fk_m = mt.id
left outer join rf r5_ on
rp4_.fk_r = r5_.id
left outer join
in i6_ on r5_.fk_ireftc = i6_.id
left outer join
ir i7_ on r5_.fk_if = i7_.id
left outer join gc g10_ on gc9_.fk_g = g10_.id
where
and (p0_.lC is null or p0_.lS = 'E')
and(g10_.id is null
and r5_.fk_i is null
group by col_1_0_, col_2_0_, col_3_0_, col_5_0_, col_6_0_, col_7_0_
order by col_2_0_ asc, p0_.id
limit 10;
in my case i change the order in case when i load a table i use all the join that use this table in the join that follow and not to load it in another block. like in my p0_ table i made all the left join in the first 4 lines not like in the first code.
PS: to test my perf in postgre i use this website: http://tatiyants.com/pev/#/plans/new

At least in SQLite, I found out that it makes a huge difference. Actually it didn't need to be a very complex query for the difference to show itself. My JOIN statements were inside an embedded clause however.
Basically, you should start with the most specific limitations first, as Christian has pointed out.

Related

Sort & Parallelism costing my query too much time

My SQL query is taking a large amount of time to run. I wrote a similar query and pit them against each other and this one runs FASTER when a small dataset (10K lines) is used, but about 20-30x slower than the other one when a LARGE dataset (500K+ lines) is used. My first query however does not have ONE column that I need, and I cannot add it without going about it with this different approach.
SELECT a.[RFIDTAGID], a.[JOB_NUMBER], d.[PROJECT_NUMBER], a.[PART_NUMBER], a.[QUANTITY], b.[DESIGNATION] as LOCATION,
c.[DESIGNATION] as CONTAINER, a.[LAST_SEEN_TIME], b.[TYPE], b.[BLDG], d.[PBG], d.[PLANNED_MFG_DELIVERY_DATE], d.[EXTENSION_DATE], a.[ORG_ID]
FROM [LTS].[dbo].[LTS_PACKAGE] as a
LEFT OUTER JOIN (
SELECT [DESIGNATION], [CONTAINER_ID], [LOCATION_ID]
FROM [LTS].[dbo].[LTS_CONTAINER]
) c ON a.[CONTAINER_ID] = c.[CONTAINER_ID]
LEFT OUTER JOIN (
SELECT [DESIGNATION], [TYPE], [BLDG], [LOCATION_ID]
FROM [LTS].[dbo].[LTS_LOCATION]
) b ON a.[LAST_SEEN_LOC_ID] = b.[LOCATION_ID] OR b.[LOCATION_ID] = c.[LOCATION_ID]
INNER JOIN (
SELECT [PBG], [PLANNED_MFG_DELIVERY_DATE], [EXTENSION_DATE], [DISCRETE_JOB_NUMBER], [PROJECT_NUMBER]
FROM [LTS].[dbo].[LTS_DISCRETE_JOB_SUMMARY]
)d ON a.[JOB_NUMBER] = d.[DISCRETE_JOB_NUMBER]
WHERE
d.[PLANNED_MFG_DELIVERY_DATE] <= GETDATE()
AND b.[TYPE] NOT IN('MFG', 'Manufacturing')
AND (b.[DESIGNATION] IS NOT NULL OR c.[DESIGNATION] IS NOT NULL)
ORDER BY [JOB_NUMBER], d.[PLANNED_MFG_DELIVERY_DATE] desc, [RFIDTAGID];
You can see below the usage, 100% is roughly 20,000, whereas my other query is about 900:
Is there something I can do to speed up my query, or where did I bog it down?
Remove inner selects and join directly to the tables:
SELECT a.[RFIDTAGID], a.[JOB_NUMBER], d.[PROJECT_NUMBER], a.[PART_NUMBER], a.[QUANTITY], b.[DESIGNATION] as LOCATION,
c.[DESIGNATION] as CONTAINER, a.[LAST_SEEN_TIME], b.[TYPE], b.[BLDG], d.[PBG], d.[PLANNED_MFG_DELIVERY_DATE], d.[EXTENSION_DATE], a.[ORG_ID]
FROM [LTS].[dbo].[LTS_PACKAGE] a
LEFT OUTER JOIN [LTS].[dbo].[LTS_CONTAINER]
c ON a.[CONTAINER_ID] = c.[CONTAINER_ID]
LEFT OUTER JOIN [dbo].[LTS_LOCATION]
b ON a.[LAST_SEEN_LOC_ID] = b.[LOCATION_ID] OR b.[LOCATION_ID] = c.[LOCATION_ID]
INNER JOIN
[LTS].[dbo].[LTS_DISCRETE_JOB_SUMMARY]
d ON a.[JOB_NUMBER] = d.[DISCRETE_JOB_NUMBER]
WHERE
d.[PLANNED_MFG_DELIVERY_DATE] <= GETDATE()
AND b.[TYPE] NOT IN('MFG', 'Manufacturing')
AND (b.[DESIGNATION] IS NOT NULL OR c.[DESIGNATION] IS NOT NULL)
ORDER BY [JOB_NUMBER], d.[PLANNED_MFG_DELIVERY_DATE] desc, [RFIDTAGID];

Query with combined WHERE clause is slower than two individual WHERE clauses

I'm having a performance problem with a SQL query that is generated by a .NET application.
Basically what the query is doing is:
(query1) left join (query2) right join (queries3 to 30) WHERE (query1.ID IS NULL) OR (query3.ID IS NULL AND query4.ID IS NULL AND… queryN.ID IS NULL)
When the query only does WHERE A (query1.ID) the query is fast.
When the query only does WHERE B (query3 to 30) the query is fast
When A and B are a combined WHERE clause with an OR, the query is
very slow.
I'm looking for a way to optimize this query without variables or stored procedures.
The query:
SELECT DISTINCT [Table0].[FIELD]
FROM /*8*/ ([Table0] AS [Table0]
INNER JOIN
[XTABLE] AS [XTABLE0]
ON [Table0].ID = [XTABLE0].ID1
AND [XTABLE0].ID3 = 52)
RIGHT OUTER /*10*/ JOIN
[Table1] AS [Table1]
/*21*/ /*11*/ ON [XTABLE0].ID2 = [Table1].ID
AND [XTABLE0].ID3 = 52
LEFT OUTER JOIN
([XTABLE] AS [XTABLE1]
INNER JOIN
[Table2] AS [Table2]
ON [XTABLE1].ID1 = [Table2].ID
AND [XTABLE1].ID3 = 19
/*20a*/ INNER JOIN
[XTABLE] AS [XTABLE2]
ON [Table2].ID = [XTABLE2].ID1
AND [XTABLE2].ID3 = 8
INNER JOIN
[Table3] AS [Table3]
ON [XTABLE2].ID2 = [Table3].ID
AND [XTABLE2].ID3 = 8/*22*/ )
ON [Table1].ID = [XTABLE1].ID2
AND [XTABLE1].ID3 = 19
/*26 */ LEFT OUTER JOIN
([XTABLE] AS [XTABLE3]
... and tens of similar INNER JOIN blocks
WHERE (/*13*/ [XTABLE0].ID IS NULL)
OR (/*25*/ [XTABLE1].ID IS NULL
AND /*27b*/ [XTABLE3].ID IS NULL
AND /*27b*/ [XTABLE5].ID IS NULL
... and tens of similar lines
AND /*27b*/ [XTABLE131].ID IS NULL);
You are OUTER JOIN'ing the queries, so, when you start putting stuff in the WHERE clause from the result of the OUTER JOIN table expressions (derived table in this case) then it will more than likely be treat as an INNER JOIN - you can see that by checking the query plan.

LEFT JOIN ON COALESCE(a, b, c) - very strange behavior

I have encountered very strange behavior of my query and I wasted a lot of time to understand what causes it, in vane. So I am asking for your help.
SELECT count(*) FROM main_table
LEFT JOIN front_table ON front_table.pk = main_table.fk_front_table
LEFT JOIN info_table ON info_table.pk = front_table.fk_info_table
LEFT JOIN key_table ON key_table.pk = COALESCE(info_table.fk_key_table, front_table.fk_key_table_1, front_table.fk_key_table_2)
LEFT JOIN side_table ON side_table.fk_front_table = front_table.pk
WHERE side_table.pk = (SELECT MAX(pk) FROM side_table WHERE fk_front_table = front_table.pk)
OR side_table.pk IS NULL
Seems like a simple join query, with coalesce, I've used this technique before(not too many times) and it worked right.
In this query I don't ever get nulls for side_table.pk. If I remove coalesce or just don't use key_table, then the query returns rows with many null side_table.pk, but if I add coalesce join, I can't get those nulls.
It seems key_table and side_table don't have anything in common, but the result is so weird.
Also, when I don't use side_table and WHERE clause, the count(*) result with coalesce and without differs, but I can't see any pattern in rows missing, it seems random!
Real query:
SELECT ECHANGE.EXC_AUTO_KEY, STOCK_RESERVATIONS.STR_AUTO_KEY FROM EXCHANGE
LEFT JOIN WO_BOM ON WO_BOM.WOB_AUTO_KEY = EXCHANGE.WOB_AUTO_KEY
LEFT JOIN VIEW_WO_SUB ON VIEW_WO_SUB.WOO_AUTO_KEY = WO_BOM.WOO_AUTO_KEY
LEFT JOIN STOCK stock3 ON stock3.STM_AUTO_KEY = EXCHANGE.STM_AUTO_KEY
LEFT JOIN STOCK stock2 ON stock2.STM_AUTO_KEY = EXCHANGE.ORIG_STM
LEFT JOIN CONSIGNMENT_CODES con2 ON con2.CNC_AUTO_KEY = stock2.CNC_AUTO_KEY
LEFT JOIN CONSIGNMENT_CODES con3 ON con3.CNC_AUTO_KEY = stock3.CNC_AUTO_KEY
LEFT JOIN CI_UTL ON CI_UTL.CUT_AUTO_KEY = EXCHANGE.CUT_AUTO_KEY
LEFT JOIN PART_CONDITION_CODES pcc2 ON pcc2.PCC_AUTO_KEY = stock2.PCC_AUTO_KEY
LEFT JOIN PART_CONDITION_CODES pcc3 ON pcc3.PCC_AUTO_KEY = stock3.PCC_AUTO_KEY
LEFT JOIN STOCK_RESERVATIONS ON STOCK_RESERVATIONS.STM_AUTO_KEY = stock3.STM_AUTO_KEY
LEFT JOIN WAREHOUSE wh2 ON wh2.WHS_AUTO_KEY = stock2.WHS_ORIGINAL
LEFT JOIN SM_HISTORY ON (SM_HISTORY.STM_AUTO_KEY = EXCHANGE.ORIG_STM AND SM_HISTORY.WOB_REF = EXCHANGE.WOB_AUTO_KEY)
LEFT JOIN RC_DETAIL ON stock3.RCD_AUTO_KEY = RC_DETAIL.RCD_AUTO_KEY
LEFT JOIN RC_HEADER ON RC_HEADER.RCH_AUTO_KEY = RC_DETAIL.RCH_AUTO_KEY
LEFT JOIN WAREHOUSE wh3 ON wh3.WHS_AUTO_KEY = COALESCE(RC_DETAIL.WHS_AUTO_KEY, stock3.WHS_ORIGINAL, stock3.WHS_AUTO_KEY)
WHERE STOCK_RESERVATIONS.STR_AUTO_KEY = (SELECT MAX(STR_AUTO_KEY) FROM STOCK_RESERVATIONS WHERE STM_AUTO_KEY = stock3.STM_AUTO_KEY)
OR STOCK_RESERVATIONS.STR_AUTO_KEY IS NULL
Removing LEFT JOIN WAREHOUSE wh3 gives me about unique EXC_AUTO_KEY values with a lot of NULL STR_AUTO_KEY, while leaving this row removes all NULL STR_AUTO_KEY.
I recreated simple tables with numbers with the same structure and query works without any problems o.0
I have a feeling COALESCE is acting as a REQUIRED flag for the joined table, hence shooting the LEFT JOIN to become an INNER JOIN.
Try this:
SELECT COUNT(*)
FROM main_table
LEFT JOIN front_table ON front_table.pk = main_table.fk_front_table
LEFT JOIN info_table ON info_table.pk = front_table.fk_info_table
LEFT JOIN key_table ON key_table.pk = NVL(info_table.fk_key_table, NVL(front_table.fk_key_table_1, front_table.fk_key_table_2))
LEFT JOIN (SELECT fk_, MAX(pk) as pk FROM side_table GROUP BY fk_) st ON st.fk_ = front_table.pk
NVL might behave just the same though...
I undertood what was the problem (not entirely though): there is a LEFT JOIN VIEW_WO_SUB in original query, 3rd line. It causes this query to act in a weird way.
When I replaced the view with the other table which contained the information I needed, the query started returning right results.
Basically, with this view join, NVL, COALESCE or CASE join with combination of certain arguments did not work along with OR clause in WHERE subquery, all rest was fine. ALthough, I could get the query to work with this view join, by changing the order of joined tables, I had to place table participating in where subquery to the bottom.

How to improve the performance of a SQL query even after adding indexes?

I am trying to execute the following sql query but it takes 22 seconds to execute. the number of returned items is 554192. I need to make this faster and have already put indexes in all the tables involved.
SELECT mc.name AS MediaName,
lcc.name AS Country,
i.overridedate AS Date,
oi.rating,
bl1.firstname + ' ' + bl1.surname AS Byline,
b.id BatchNo,
i.numinbatch ItemNumberInBatch,
bah.changedatutc AS BatchDate,
pri.code AS IssueNo,
pri.name AS Issue,
lm.neptunemessageid AS MessageNo,
lmt.name AS MessageType,
bl2.firstname + ' ' + bl2.surname AS SourceFullName,
lst.name AS SourceTypeDesc
FROM profiles P
INNER JOIN profileresults PR
ON P.id = PR.profileid
INNER JOIN items i
ON PR.itemid = I.id
INNER JOIN batches b
ON b.id = i.batchid
INNER JOIN itemorganisations oi
ON i.id = oi.itemid
INNER JOIN lookup_mediachannels mc
ON i.mediachannelid = mc.id
LEFT OUTER JOIN lookup_cities lc
ON lc.id = mc.cityid
LEFT OUTER JOIN lookup_countries lcc
ON lcc.id = mc.countryid
LEFT OUTER JOIN itembylines ib
ON ib.itemid = i.id
LEFT OUTER JOIN bylines bl1
ON bl1.id = ib.bylineid
LEFT OUTER JOIN batchactionhistory bah
ON b.id = bah.batchid
INNER JOIN itemorganisationissues ioi
ON ioi.itemorganisationid = oi.id
INNER JOIN projectissues pri
ON pri.id = ioi.issueid
LEFT OUTER JOIN itemorganisationmessages iom
ON iom.itemorganisationid = oi.id
LEFT OUTER JOIN lookup_messages lm
ON iom.messageid = lm.id
LEFT OUTER JOIN lookup_messagetypes lmt
ON lmt.id = lm.messagetypeid
LEFT OUTER JOIN itemorganisationsources ios
ON ios.itemorganisationid = oi.id
LEFT OUTER JOIN bylines bl2
ON bl2.id = ios.bylineid
LEFT OUTER JOIN lookup_sourcetypes lst
ON lst.id = ios.sourcetypeid
WHERE p.id = #profileID
AND b.statusid IN ( 6, 7 )
AND bah.batchactionid = 6
AND i.statusid = 2
AND i.isrelevant = 1
when looking at the execution plan I can see an step which is costing 42%. Is there any way I could get this to a lower threshold or any way that I can improve the performance of the whole query.
Remove the profiles table as it is not needed and change the WHERE clause to
WHERE PR.profileid = #profileID
You have a left outer join on the batchactionhistory table but also have a condition in your WHERE clause which turns it back into an inner join. Change you code to this:
LEFT OUTER JOIN batchactionhistory bah
ON b.id = bah.batchid
AND bah.batchactionid = 6
You don't need the batches table as it is used to join other tables which could be joined directly and to show the id in you SELECT which is also available in other tables. Make the following changes:
i.batchidid AS BatchNo,
LEFT OUTER JOIN batchactionhistory bah
ON i.batchidid = bah.batchid
Are any of the fields that are used in joins or the WHERE clause from tables that contain large amounts of data but are not indexed. If so try adding an index on at time to the largest table.
Do you need every field in the result - if you could loose one or to you maybe could reduce the number of tables further.
First, if this is not a stored procedure, make it one. That's a lot of text for sql server to complile.
Next, my experience is that "worst practices" are occasionally a good idea. Specifically, I have been able to improve performance by splitting large queries into a couple or three small ones and assembling the results.
If this query is associated with a .net, coldfusion, java, etc application, you might be able to do the split/re-assemble in your application code. If not, a temporary table might come in handy.

Super Slow Query - sped up, but not perfect... Please help

I posted a query yesterday (see here) that was horrible (took over a minute to run, resulting in 18,215 records):
SELECT DISTINCT
dbo.contacts_link_emails.Email, dbo.contacts.ContactID, dbo.contacts.First AS ContactFirstName, dbo.contacts.Last AS ContactLastName, dbo.contacts.InstitutionID,
dbo.institutionswithzipcodesadditional.CountyID, dbo.institutionswithzipcodesadditional.StateID, dbo.institutionswithzipcodesadditional.DistrictID
FROM
dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_3
INNER JOIN
dbo.contacts
INNER JOIN
dbo.contacts_link_emails
ON dbo.contacts.ContactID = dbo.contacts_link_emails.ContactID
ON contacts_def_jobfunctions_3.JobID = dbo.contacts.JobTitle
INNER JOIN
dbo.institutionswithzipcodesadditional
ON dbo.contacts.InstitutionID = dbo.institutionswithzipcodesadditional.InstitutionID
LEFT OUTER JOIN
dbo.contacts_def_jobfunctions
INNER JOIN
dbo.contacts_link_jobfunctions
ON dbo.contacts_def_jobfunctions.JobID = dbo.contacts_link_jobfunctions.JobID
ON dbo.contacts.ContactID = dbo.contacts_link_jobfunctions.ContactID
WHERE
(dbo.contacts.JobTitle IN
(SELECT JobID
FROM dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_1
WHERE (ParentJobID <> '1841')))
AND
(dbo.contacts_link_emails.Email NOT IN
(SELECT EmailAddress
FROM dbo.newsletterremovelist))
OR
(dbo.contacts_link_jobfunctions.JobID IN
(SELECT JobID
FROM dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_2
WHERE (ParentJobID <> '1841')))
AND
(dbo.contacts_link_emails.Email NOT IN
(SELECT EmailAddress
FROM dbo.newsletterremovelist AS newsletterremovelist))
ORDER BY EMAIL
With a lot of coaching and research, I've tuned it up to the following:
SELECT contacts.ContactID,
contacts.InstitutionID,
contacts.First,
contacts.Last,
institutionswithzipcodesadditional.CountyID,
institutionswithzipcodesadditional.StateID,
institutionswithzipcodesadditional.DistrictID
FROM contacts
INNER JOIN contacts_link_emails ON
contacts.ContactID = contacts_link_emails.ContactID
INNER JOIN institutionswithzipcodesadditional ON
contacts.InstitutionID = institutionswithzipcodesadditional.InstitutionID
WHERE
(contacts.ContactID IN
(SELECT contacts_2.ContactID
FROM contacts AS contacts_2
INNER JOIN contacts_link_emails AS contacts_link_emails_2 ON
contacts_2.ContactID = contacts_link_emails_2.ContactID
LEFT OUTER JOIN contacts_def_jobfunctions ON
contacts_2.JobTitle = contacts_def_jobfunctions.JobID
RIGHT OUTER JOIN newsletterremovelist ON
contacts_link_emails_2.Email = newsletterremovelist.EmailAddress
WHERE (contacts_def_jobfunctions.ParentJobID <> 1841)
GROUP BY contacts_2.ContactID
UNION
SELECT contacts_1.ContactID
FROM contacts_link_jobfunctions
INNER JOIN contacts_def_jobfunctions AS contacts_def_jobfunctions_1 ON
contacts_link_jobfunctions.JobID = contacts_def_jobfunctions_1.JobID
AND contacts_def_jobfunctions_1.ParentJobID <> 1841
INNER JOIN contacts AS contacts_1 ON
contacts_link_jobfunctions.ContactID = contacts_1.ContactID
INNER JOIN contacts_link_emails AS contacts_link_emails_1 ON
contacts_link_emails_1.ContactID = contacts_1.ContactID
LEFT OUTER JOIN newsletterremovelist AS newsletterremovelist_1 ON
contacts_link_emails_1.Email = newsletterremovelist_1.EmailAddress
GROUP BY contacts_1.ContactID))
While this query is now super fast (about 3 seconds), I've blown part of the logic somewhere - it only returns 14,863 rows (instead of the 18,215 rows that I believe is accurate).
The results seem near correct. I'm working to discover what data might be missing in the result set.
Can you please coach me through whatever I've done wrong here?
Thanks,
Russell Schutte
The main problem with your original query was that you had two extra joins just to introduce duplicates and then a DISTINCT to get rid of them.
Use this:
SELECT cle.Email,
c.ContactID,
c.First AS ContactFirstName,
c.Last AS ContactLastName,
c.InstitutionID,
izip.CountyID,
izip.StateID,
izip.DistrictID
FROM dbo.contacts c
INNER JOIN
dbo.institutionswithzipcodesadditional izip
ON izip.InstitutionID = c.InstitutionID
INNER JOIN
dbo.contacts_link_emails cle
ON cle.ContactID = c.ContactID
WHERE cle.Email NOT IN
(
SELECT EmailAddress
FROM dbo.newsletterremovelist
)
AND EXISTS
(
SELECT NULL
FROM dbo.contacts_def_jobfunctions cdj
WHERE cdj.JobId = c.JobTitle
AND cdj.ParentJobId <> '1841'
UNION ALL
SELECT NULL
FROM dbo.contacts_link_jobfunctions clj
JOIN dbo.contacts_def_jobfunctions cdj
ON cdj.JobID = clj.JobID
WHERE clj.ContactID = c.ContactID
AND cdj.ParentJobId <> '1841'
)
ORDER BY
email
Create the following indexes:
newsletterremovelist (EmailAddress)
contacts_link_jobfunctions (ContactID, JobID)
contacts_def_jobfunctions (JobID)
Do you get the same results when you do:
SELECT count(*)
FROM
dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_3
INNER JOIN
dbo.contacts
INNER JOIN
dbo.contacts_link_emails
ON dbo.contacts.ContactID = dbo.contacts_link_emails.ContactID
ON contacts_def_jobfunctions_3.JobID = dbo.contacts.JobTitle
SELECT COUNT(*)
FROM
contacts
INNER JOIN contacts_link_jobfunctions
ON contacts.ContactID = contacts_link_jobfunctions.ContactID
INNER JOIN contacts_link_emails
ON contacts.ContactID = contacts_link_emails.ContactID
If so keep adding each join conditon on until you don't get the same results and you will see where your mistake was. If all the joins are the same, then look at the where clauses. But I will be surprised if it isn't in the first join because the syntax you have orginally won't even work on SQL Server and it is pretty nonstandard SQL and may have been incorrect all along but no one knew.
Alternatively, pick a few of the records that are returned in the orginal but not the revised. Track them through the tables one at a time to see if you can find why the second query filters them out.
I'm not directly sure what is wrong, but when I run in to this situation, the first thing I do is start removing variables.
So, comment out the where clause. How many rows are returned?
If you get back the 11,604 rows then you've isolated the problems to the joins. Work though the joins, commenting each one out (remove the associated columns too) and figure out how many rows are eliminated.
As you do this, aim to find what is causing the desired rows to be eliminated. Once isolated, consider the join differences between the first query and the second query.
In looking at the first query, you could probably just modify that to eliminate any INs and instead do a EXISTS instead.
Consider your indexes as well. Any thing in the where or join clauses should probably be indexed.