How do I improve my code for better/faster performance? - sql

I have this INSERT setup where I want the following to be filtered and inserted in a table. The 'Splitout' has more than 30000000 rows. This statement is taking more than 2 hours for a single project I have 100 projects like this.
My initial plan was to insert everything at once but because it took more than 20 for it to execute I had to split by projects but even then the performance was very low. I was planning on using CROSS-APPLY but wasn't really sure how it would apply in my case. Any suggestions to improve the performance is appreciated.
Below is the code I have now - Thank you !
insert into DimQuestion (ResponderKey, ProjectID, qid, Question, QuestionType,AttributeID,
Attribute, ProductID, ProductCode, ProductName, AnswerCode, AnswerLabel)
SELECT distinct RKey
,a.ProID
,a.qid
,c.QID + ' - ' + c.[Ql] as Question
,c.[Type] as QType
,a.AttributeID
,e.Attribute
,a.ProductID
,d.ProductCode
,d.ProductName
,a.Answers
,'AnswerLabel' = case when a.qid not in ('Q2','QA','QA1','QA2','QA5','QA6','QA7','QA8','QF1','QF2','QF2c','QF2a','QF5',
'QF6','QF7','QF8','QF9','QF10','QX5','QX12')
then datamap.[Answer Label]
when a.qid = 'Q2'
then f.AnswerLabel
when a.qid in ('QA','QA1','QA2','QA5','QA6','QA7','QA8','QF1','QF2','QF2c','QF2a','QF5',
'QF6','QF7','QF8','QF9','QF10','QX5','QX12')
then a.Answers END
FROM [SplitOut] a
INNER join [DimResponder] b on a.responseid = b.ResponseID and a.ProjectID = b.ProjectID
INNER join Question_List c on a.qid = c.qid
left outer JOIN Data_Map datamap ON a.QID = datamap.QID and a.answers = datamap.[answer code]
left outer join DimProduct d on a.ProductID = d.ProductTypeCode and a.ProjectID = d.ProjectID
left outer join DimAttribute e on e.projectid = 0 and a.AttributeID = e.AttributeCode
left outer join Q2AnswerData f on a.QID = f.QID and a.Answers = f.AnswerCode and a.AttributeID = f.VariableID
where a.columnNames not like '%open%' and a.ColumnNames not like '%seg%' and a.columnnames not like '%rot%' and a.Answers not like ''and datamap.Project not in ('Project 0') and a.ProjectID in (1,2,3,4,5,6,7,8,9,10)

Make sure columns in the ON clauses are indexed.
Just looking at the SQL and not having the table counts and explain plan, here's something to try. Use a subquery for the SPLITOUT table. You say that table has 30,000,000 rows. Most of your where clause qualifiers are against the SPLITOUT table, so I'm using a subquery to reduce the number of rows joined to it. The way it's coded in your version, SPLITOUT will possibly be joined to the other tables before the where clause is applied. I agree with the comments that like clauses are bad. They don't use an index so you're most likely table scanning your 30,000,000 row table.
I also made the call to DATAMAP a subquery because it's a left join with a qualifier in the where clause. If there's no row, the qualifier will fail when you may have wanted it to succeed.
Run the subquery on SPLITOUT by itself. Tune it first. Create a composite index on splitout.projectID, answers, and columnNames. If the optimizer uses it for projectID, the likes on columnNames may be index scanned. Once the SPLITOUT subquery is tuned, add the other joins in one at a time.
Try to remove the distinct with cleaner joins. The optimizer has to sort to do a distinct which is costly.
Don't use like and in when you don't need to. Use = and not when possible.
I wouldn't use cross join for a query such as this.
insert into DimQuestion (ResponderKey, ProjectID, qid, Question, QuestionType,AttributeID, Attribute, ProductID, ProductCode, ProductName, AnswerCode, AnswerLabel)
SELECT distinct RKey
,a.ProID
,a.qid
,c.QID + ' - ' + c.[Ql] as Question
,c.[Type] as QType
,a.AttributeID
,e.Attribute
,a.ProductID
,d.ProductCode
,d.ProductName
,a.Answers
,'AnswerLabel' = case when a.qid = 'Q2' then f.AnswerLabel
when a.qid not in ('Q2','QA','QA1','QA2','QA5','QA6','QA7','QA8','QF1','QF2','QF2c','QF2a','QF5',
'QF6','QF7','QF8','QF9','QF10','QX5','QX12')
then datamap.[Answer Label]
when a.qid in ('QA','QA1','QA2','QA5','QA6','QA7','QA8','QF1','QF2','QF2c','QF2a','QF5',
'QF6','QF7','QF8','QF9','QF10','QX5','QX12')
then a.Answers
end
FROM (select a.responseID, a.ProjectID, a.ProID, a.qid, a.AttributeID, a.ProductID, a.Answers
from [SplitOut]
where a.ProjectID in (1,2,3,4,5,6,7,8,9,10)
and a.Answers <> '' -- don't use like when equality or inequality will work, note not and <> do not use the index
and a.columnNames not like '%open%' and a.ColumnNames not like '%seg%' and a.columnnames not like '%rot%' -- very bad, won't use index, consider creating a category or codes column to identify these values.
) a
inner join [DimResponder] b on a.responseid = b.ResponseID and a.ProjectID = b.ProjectID
inner join Question_List c on a.qid = c.qid
left outer join (select quid, [Answer Code] ,[Answer Label] from datamap where datamap.Project <> 'Project 0') datamap ON a.QID = datamap.QID and a.answers = datamap.[answer code]
left outer join DimProduct d on a.ProductID = d.ProductTypeCode and a.ProjectID = d.ProjectID
left outer join DimAttribute e on e.projectid = 0 and a.AttributeID = e.AttributeCode
left outer join Q2AnswerData f on a.QID = f.QID and a.Answers = f.AnswerCode and a.AttributeID = f.VariableID

Inner and outer joins tend to be expensive (used to do a lot of SQL years ago and they were then). You may be able to do it as several insert statements and it may be a lot faster, expecialy if you can reduce/get rid on inner/outer joins.
Also maybe making some temp tables will help, pre processing some of the data.
Also prociding a list of indexed will realy help. Indexes can make the diference wetween minutes and houres,

Related

SQL subquery with join to main query

I have this:
SELECT
SU.FullName as Salesperson,
COUNT(DS.new_dealsheetid) as Units,
SUM(SO.New_profits_sales_totaldealprofit) as TDP,
SUM(SO.New_profits_sales_totaldealprofit) / COUNT(DS.new_dealsheetid) as PPU,
-- opportunities subquery
(SELECT COUNT(*) FROM Opportunity O
LEFT JOIN Account A ON O.AccountId = A.AccountId
WHERE A.OwnerId = SU.SystemUserId AND
YEAR(O.CreatedOn) = 2022)
-- /opportunities subquery
FROM New_dealsheet DS
LEFT JOIN SalesOrder SO ON DS.New_DSheetId = SO.SalesOrderId
LEFT JOIN New_salespeople SP ON DS.New_SalespersonId = SP.New_salespeopleId
LEFT JOIN SystemUser SU ON SP.New_SystemUserId = SU.SystemUserId
WHERE
YEAR(SO.New_purchaseordersenddate) = 2022 AND
SP.New_SalesGroupIdName = 'LO'
GROUP BY SU.FullName
I'm getting an error from the subquery:
Column 'SystemUser.SystemUserId' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Is it possible to use the SystemUser table join from the main query in this way?
As has been mentioned extensively in the comments, the error is actually telling you the problem; SU.SystemUserId isn't in the GROUP BY nor in an aggregate function, and it appears in the SELECT of the query (albeit in the WHERE of a correlated subquery). Any columns in the SELECT must be either aggregated or in the GROUP BY when using one of the other for a query scope. As the column in question isn't aggregated nor in the GROUP BY, the error occurs.
There are, however, other problems. Like mentioned ikn the comments too, your LEFT JOINs make little sense, as many of the tables you LEFT JOIN to require a column in that table to have a non-NULL value; it is impossible for a column to have a non-NULL value if a row was not found.
You also use syntax like YEAR(<Column Name>) = <int Value> in the WHERE; this is not SARGable, and thus should be avoided. Use explicit date boundaries instead.
I assume here that SU.SystemUserId is a primary key, and so should be in the GROUP BY. This is probably a good thing anyway, as a person's full name isn't something that can be used to determine who a person is on their own (take this from someone who shared their name, date of birth and post code with another person in their youth; it caused many problems on the rudimentary IT systems of the time). This results in a query like this:
SELECT SU.FullName AS Salesperson,
COUNT(DS.new_dealsheetid) AS Units,
SUM(SO.New_profits_sales_totaldealprofit) AS TDP,
SUM(SO.New_profits_sales_totaldealprofit) / COUNT(DS.new_dealsheetid) AS PPU,
(SELECT COUNT(*)
FROM dbo.Opportunity O
JOIN dbo.Account A ON O.AccountId = A.AccountId --A.OwnerID must have a non-NULL value, so why was this a LEFT JOIN?
WHERE A.OwnerId = SU.SystemUserId
AND O.CreatedOn >= '20220101' --Don't use YEAR(<Column Name>) = <int Value> syntax, it isn't SARGable
AND O.CreatedOn < '20230101') AS SomeColumnAlias
FROM dbo.New_dealsheet DS
JOIN dbo.SalesOrder SO ON DS.New_DSheetId = SO.SalesOrderId --SO.New_purchaseordersenddate must have a non-NULL value, so why was this a LEFT JOIN?
JOIN dbo.New_salespeople SP ON DS.New_SalespersonId = SP.New_salespeopleId --SP.New_SalesGroupIdName must have a non-NULL value, so why was this a LEFT JOIN?
LEFT JOIN dbo.SystemUser SU ON SP.New_SystemUserId = SU.SystemUserId --This actually looks like it can be a LEFT JOIN.
WHERE SO.New_purchaseordersenddate >= '20220101' --Don't use YEAR(<Column Name>) = <int Value> syntax, it isn't SARGable
AND SO.New_purchaseordersenddate < '20230101'
AND SP.New_SalesGroupIdName = 'LO'
GROUP BY SU.FullName,
SU.SystemUserId;
Doing such a sub-query is bad on a performance-wise point of view
better do it like this:
SELECT
SU.FullName as Salesperson,
COUNT(DS.new_dealsheetid) as Units,
SUM(SO.New_profits_sales_totaldealprofit) as TDP,
SUM(SO.New_profits_sales_totaldealprofit) / COUNT(DS.new_dealsheetid) as PPU,
SUM(csq.cnt) as Count
FROM New_dealsheet DS
LEFT JOIN SalesOrder SO ON DS.New_DSheetId = SO.SalesOrderId
LEFT JOIN New_salespeople SP ON DS.New_SalespersonId = SP.New_salespeopleId
LEFT JOIN SystemUser SU ON SP.New_SystemUserId = SU.SystemUserId
-- Moved subquery as sub-join
LEFT JOIN (SELECT a.OwnerId, YEAR(o.CreatedOn) as year, COUNT(*) cnt FROM Opportunity O
LEFT JOIN Account A ON O.AccountId = A.AccountId
GROUP BY a.OwnerId, YEAR(o.CreatedOn) as csq ON csq.OwnerId = su.SystemUserId and csqn.Year = 2022
WHERE
YEAR(SO.New_purchaseordersenddate) = 2022 AND
SP.New_SalesGroupIdName = 'LO'
GROUP BY SU.FullName
So you have a nice join and a clean result
The query above is untested

How to do operations between a column and a subquery

I would like to know how I can do operations between a column and a subquery, what I want to do is add to the field Subtotal what was obtained in the subquery Impuestos, the following is the query that I am using for this case.
Select
RC.PURCHID;
LRC.VALUEMST as 'Subtotal',
isnull((
select sum((CONVERT(float, TD1.taxvalue)/100)*LRC1.VALUEMST ) as a
FROM TAXONITEM TOI1
inner join TAXDATA TD1 ON (TD1.TAXCODE = TOI1.TAXCODE and RC.DATAAREAID = TD1.DATAAREAID)
inner join TRANS LRC1 on (LRC1.VEND = RC.RECID)
WHERE TOI1.TAXITEMGROUP = PL.TAXITEMGROUP and RC.DATAAREAID = TOI1.DATAAREAID
), 0) Impuestos
from VEND RC
inner join VENDTABLE VTB on VTB.ACCOUNTNUM = RC.INVOICEACCOUNT
inner join TRANS LRC on (LRC.VEND = RC.RECID)
inner join PURCHLINE PL on (PL.LINENUMBER =LRC.LINENUM and PL.PURCHID =RC.PURCHID)
where year (RC.DELIVERYDATE) =2021 and RC.PURCHASETYPE =3 order by RC.PURCHID;
Hope someone can give me some guidance when doing operations with subqueries.
A few disjointed facts that may help:
When a SELECT statement returns only one row with one column, you can enclose that statement in parenthesis and use it as a plain value. In your case, let's say that select sum(......= TOI1.DATAAREAID returns 500. Then, your outer select's second column is equivalent to isnull(500,0)
You mention in your question "subquery Impuestos". Keep in mind that, although you indeed used a subquery as we mentioned earlier, by the time it was enclosed in parentheses it is not treated as a subquery (more accurately: derived table), but as a value. Thus, the "Impuestos" is only a column alias at this point
I dislike and avoid subqueries before the from, makes things much harder to read. Here is a solution with apply which will keep your code mostly intact:
Select
RC.PURCHID,
LRC.VALUEMST as 'Subtotal',
isnull(subquery1.a, 0) as Impuestos
from VEND RC
inner join VENDTABLE VTB on VTB.ACCOUNTNUM = RC.INVOICEACCOUNT
inner join TRANS LRC on (LRC.VEND = RC.RECID)
inner join PURCHLINE PL on (PL.LINENUMBER =LRC.LINENUM and PL.PURCHID =RC.PURCHID)
outer apply
(
select sum((CONVERT(float, TD1.taxvalue)/100)*LRC1.VALUEMST ) as a
FROM TAXONITEM TOI1
inner join TAXDATA TD1 ON (TD1.TAXCODE = TOI1.TAXCODE and RC.DATAAREAID = TD1.DATAAREAID)
inner join TRANS LRC1 on (LRC1.VEND = RC.RECID)
WHERE TOI1.TAXITEMGROUP = PL.TAXITEMGROUP and RC.DATAAREAID = TOI1.DATAAREAID
) as subquery1
where year (RC.DELIVERYDATE) =2021 and RC.PURCHASETYPE =3 order by RC.PURCHID;

How do I optimize my query in MySQL?

I need to improve my query, specially the execution time.This is my query:
SELECT SQL_CALC_FOUND_ROWS p.*,v.type,v.idName,v.name as etapaName,m.name AS manager,
c.name AS CLIENT,
(SELECT SEC_TO_TIME(SUM(TIME_TO_SEC(duration)))
FROM activities a
WHERE a.projectid = p.projectid) AS worked,
(SELECT SUM(TIME_TO_SEC(duration))
FROM activities a
WHERE a.projectid = p.projectid) AS worked_seconds,
(SELECT SUM(TIME_TO_SEC(remain_time))
FROM tasks t
WHERE t.projectid = p.projectid) AS remain_time
FROM projects p
INNER JOIN users m
ON p.managerid = m.userid
INNER JOIN clients c
ON p.clientid = c.clientid
INNER JOIN `values` v
ON p.etapa = v.id
WHERE 1 = 1
ORDER BY idName
ASC
The execution time of this is aprox. 5 sec. If i remove this part: (SELECT SUM(TIME_TO_SEC(remain_time)) FROM tasks t WHERE t.projectid = p.projectid) AS remain_time
the execution time is reduced to 0.3 sec. Is there a way to get the values of the remain_time in order to reduce the exec.time ?
The SQL is invoked from PHP (if this is relevant to any proposed solution).
It sounds like you need an index on tasks.
Try adding this one:
create index idx_tasks_projectid_remaintime on tasks(projectid, remain_time);
The correlated subquery should just use the index and go much faster.
Optimizing the query as it is written would give significant performance benefits (see below). But the FIRST QUESTION TO ASK when approaching any optimization is whether you really need to see all the data - there is no filtering of the resultset implemented here. This is a HUGE impact on how you optimize a query.
Adding an index on the query above will only help if the optimizer is opening a new cursor on the tasks table for every row returned by the main query. In the absence of any filtering, it will be much faster to do a full table scan of the tasks table.
SELECT ilv.*, remaining.rtime
FROM (
SELECT p.*,v.type, v.idName, v.name as etapaName,
m.name AS manager, c.name AS CLIENT,
SEC_TO_TIME(asbq.worked) AS worked, asbq.worked AS seconds_worked,
FROM projects p
INNER JOIN users m
ON p.managerid = m.userid
INNER JOIN clients c
ON p.clientid = c.clientid
INNER JOIN `values` v
ON p.etapa = v.id
LEFT JOIN (
SELECT a.projectid, SUM(TIME_TO_SEC(duration)) AS worked
FROM activities a
GROUP BY a.projectid
) asbq
ON asbq.projectid=p.projectid
) ilv
LEFT JOIN (
(SELECT t.project_id, SUM(TIME_TO_SEC(remain_time)) as rtime
FROM tasks t
GROUP BY t.projectid) remaining
ON ilv.projectid=remaining.projectid

Strange performance issue with SELECT (SUBQUERY)

I have a stored procedure that has been having some issues lately and I finally narrowed it down to 1 SELECT. The problem is I cannot figure out exactly what is happening to kill the performance of this one query. I re-wrote it, but I am not sure the re-write is the exact same data.
Original Query:
SELECT
#userId, p.job, p.charge_code, p.code
, (SELECT SUM(b.total) FROM dbo.[backorder w/total] b WHERE b.ponumber = p.ponumber AND b.code = p.code)
, ISNULL(jm.markup, 0)
, (SELECT SUM(b.TOTAL_TAX) FROM dbo.[backorder w/total] b WHERE b.ponumber = p.ponumber AND b.code = p.code)
, p.ponumber
, p.billable
, p.[date]
FROM dbo.PO p
INNER JOIN dbo.JobCostFilter jcf
ON p.job = jcf.jobno AND p.charge_code = jcf.chargecode AND jcf.userno = #userId
LEFT JOIN dbo.JobMarkup jm
ON jm.jobno = p.job
AND jm.code = p.code
LEFT JOIN dbo.[Working Codes] wc
ON p.code = wc.code
INNER JOIN dbo.JOBFILE j
ON j.JOB_NO = p.job
WHERE (wc.brcode <> 4 OR #BmtDb = 0)
GROUP BY p.job, p.charge_code, p.code, p.ponumber, p.billable, p.[date], jm.markup, wc.brcode
This query will practically never finish running. It actually times out for some larger jobs we have.
And if I change the 2 subqueries in the select to read like joins instead:
SELECT
#userid, p.job, p.charge_code, p.code
, (SELECT SUM(b.TOTAL))
, ISNULL(jm.markup, 0)
, (SELECT SUM(b.TOTAL_TAX))
, p.ponumber, p.billable, p.[date]
FROM dbo.PO p
INNER JOIN dbo.JobCostFilter jcf
ON p.job = jcf.jobno AND p.charge_code = jcf.chargecode AND jcf.userno = 11190030
INNER JOIN [BACKORDER W/TOTAL] b
ON P.PONUMBER = b.ponumber AND P.code = b.code
LEFT JOIN dbo.JobMarkup jm
ON jm.jobno = p.job
AND jm.code = p.code
LEFT JOIN dbo.[Working Codes] wc
ON p.code = wc.code
INNER JOIN dbo.JOBFILE j
ON j.JOB_NO = p.job
WHERE (wc.brcode <> 4 OR #BmtDb = 0)
GROUP BY p.job, p.charge_code, p.code, p.ponumber, p.billable, p.[date], jm.markup, wc.brcode
The data comes out looking very nearly identical to me (though there are thousands of lines overall so I could be wrong), and it runs very quickly.
Any ideas appreciated..
Performace
In the second query you have less logical reads because the table [BACKORDER W/TOTAL] has been scanned only once. In the first query two separate subqueries are processed indenpendent and the table is scanned twice although both subqueries have the same predicates.
Correctness
If you want to check if two queries return the same resultset you can use the EXCEPT operator:
If both statements:
First SELECT Query...
EXCEPT
Second SELECT Query...
and
Second SELECT Query..
EXCEPT
First SELECT Query...
return an empty set the resultsets are identical.
In terms of correctness, you are inner joining [BACKORDER W/TOTAL] in the second query, so if the first query has Null values in the subqueries, these rows would be missing in the second query.
For performance, the optimizer is a heuristic - it will sometimes use spectacularly bad query plans, and even minimal changes can sometimes lead to a completely different query plan. Your best chance is to compare the query plans and see what causes the difference.

Super Slow Query - sped up, but not perfect... Please help

I posted a query yesterday (see here) that was horrible (took over a minute to run, resulting in 18,215 records):
SELECT DISTINCT
dbo.contacts_link_emails.Email, dbo.contacts.ContactID, dbo.contacts.First AS ContactFirstName, dbo.contacts.Last AS ContactLastName, dbo.contacts.InstitutionID,
dbo.institutionswithzipcodesadditional.CountyID, dbo.institutionswithzipcodesadditional.StateID, dbo.institutionswithzipcodesadditional.DistrictID
FROM
dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_3
INNER JOIN
dbo.contacts
INNER JOIN
dbo.contacts_link_emails
ON dbo.contacts.ContactID = dbo.contacts_link_emails.ContactID
ON contacts_def_jobfunctions_3.JobID = dbo.contacts.JobTitle
INNER JOIN
dbo.institutionswithzipcodesadditional
ON dbo.contacts.InstitutionID = dbo.institutionswithzipcodesadditional.InstitutionID
LEFT OUTER JOIN
dbo.contacts_def_jobfunctions
INNER JOIN
dbo.contacts_link_jobfunctions
ON dbo.contacts_def_jobfunctions.JobID = dbo.contacts_link_jobfunctions.JobID
ON dbo.contacts.ContactID = dbo.contacts_link_jobfunctions.ContactID
WHERE
(dbo.contacts.JobTitle IN
(SELECT JobID
FROM dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_1
WHERE (ParentJobID <> '1841')))
AND
(dbo.contacts_link_emails.Email NOT IN
(SELECT EmailAddress
FROM dbo.newsletterremovelist))
OR
(dbo.contacts_link_jobfunctions.JobID IN
(SELECT JobID
FROM dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_2
WHERE (ParentJobID <> '1841')))
AND
(dbo.contacts_link_emails.Email NOT IN
(SELECT EmailAddress
FROM dbo.newsletterremovelist AS newsletterremovelist))
ORDER BY EMAIL
With a lot of coaching and research, I've tuned it up to the following:
SELECT contacts.ContactID,
contacts.InstitutionID,
contacts.First,
contacts.Last,
institutionswithzipcodesadditional.CountyID,
institutionswithzipcodesadditional.StateID,
institutionswithzipcodesadditional.DistrictID
FROM contacts
INNER JOIN contacts_link_emails ON
contacts.ContactID = contacts_link_emails.ContactID
INNER JOIN institutionswithzipcodesadditional ON
contacts.InstitutionID = institutionswithzipcodesadditional.InstitutionID
WHERE
(contacts.ContactID IN
(SELECT contacts_2.ContactID
FROM contacts AS contacts_2
INNER JOIN contacts_link_emails AS contacts_link_emails_2 ON
contacts_2.ContactID = contacts_link_emails_2.ContactID
LEFT OUTER JOIN contacts_def_jobfunctions ON
contacts_2.JobTitle = contacts_def_jobfunctions.JobID
RIGHT OUTER JOIN newsletterremovelist ON
contacts_link_emails_2.Email = newsletterremovelist.EmailAddress
WHERE (contacts_def_jobfunctions.ParentJobID <> 1841)
GROUP BY contacts_2.ContactID
UNION
SELECT contacts_1.ContactID
FROM contacts_link_jobfunctions
INNER JOIN contacts_def_jobfunctions AS contacts_def_jobfunctions_1 ON
contacts_link_jobfunctions.JobID = contacts_def_jobfunctions_1.JobID
AND contacts_def_jobfunctions_1.ParentJobID <> 1841
INNER JOIN contacts AS contacts_1 ON
contacts_link_jobfunctions.ContactID = contacts_1.ContactID
INNER JOIN contacts_link_emails AS contacts_link_emails_1 ON
contacts_link_emails_1.ContactID = contacts_1.ContactID
LEFT OUTER JOIN newsletterremovelist AS newsletterremovelist_1 ON
contacts_link_emails_1.Email = newsletterremovelist_1.EmailAddress
GROUP BY contacts_1.ContactID))
While this query is now super fast (about 3 seconds), I've blown part of the logic somewhere - it only returns 14,863 rows (instead of the 18,215 rows that I believe is accurate).
The results seem near correct. I'm working to discover what data might be missing in the result set.
Can you please coach me through whatever I've done wrong here?
Thanks,
Russell Schutte
The main problem with your original query was that you had two extra joins just to introduce duplicates and then a DISTINCT to get rid of them.
Use this:
SELECT cle.Email,
c.ContactID,
c.First AS ContactFirstName,
c.Last AS ContactLastName,
c.InstitutionID,
izip.CountyID,
izip.StateID,
izip.DistrictID
FROM dbo.contacts c
INNER JOIN
dbo.institutionswithzipcodesadditional izip
ON izip.InstitutionID = c.InstitutionID
INNER JOIN
dbo.contacts_link_emails cle
ON cle.ContactID = c.ContactID
WHERE cle.Email NOT IN
(
SELECT EmailAddress
FROM dbo.newsletterremovelist
)
AND EXISTS
(
SELECT NULL
FROM dbo.contacts_def_jobfunctions cdj
WHERE cdj.JobId = c.JobTitle
AND cdj.ParentJobId <> '1841'
UNION ALL
SELECT NULL
FROM dbo.contacts_link_jobfunctions clj
JOIN dbo.contacts_def_jobfunctions cdj
ON cdj.JobID = clj.JobID
WHERE clj.ContactID = c.ContactID
AND cdj.ParentJobId <> '1841'
)
ORDER BY
email
Create the following indexes:
newsletterremovelist (EmailAddress)
contacts_link_jobfunctions (ContactID, JobID)
contacts_def_jobfunctions (JobID)
Do you get the same results when you do:
SELECT count(*)
FROM
dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_3
INNER JOIN
dbo.contacts
INNER JOIN
dbo.contacts_link_emails
ON dbo.contacts.ContactID = dbo.contacts_link_emails.ContactID
ON contacts_def_jobfunctions_3.JobID = dbo.contacts.JobTitle
SELECT COUNT(*)
FROM
contacts
INNER JOIN contacts_link_jobfunctions
ON contacts.ContactID = contacts_link_jobfunctions.ContactID
INNER JOIN contacts_link_emails
ON contacts.ContactID = contacts_link_emails.ContactID
If so keep adding each join conditon on until you don't get the same results and you will see where your mistake was. If all the joins are the same, then look at the where clauses. But I will be surprised if it isn't in the first join because the syntax you have orginally won't even work on SQL Server and it is pretty nonstandard SQL and may have been incorrect all along but no one knew.
Alternatively, pick a few of the records that are returned in the orginal but not the revised. Track them through the tables one at a time to see if you can find why the second query filters them out.
I'm not directly sure what is wrong, but when I run in to this situation, the first thing I do is start removing variables.
So, comment out the where clause. How many rows are returned?
If you get back the 11,604 rows then you've isolated the problems to the joins. Work though the joins, commenting each one out (remove the associated columns too) and figure out how many rows are eliminated.
As you do this, aim to find what is causing the desired rows to be eliminated. Once isolated, consider the join differences between the first query and the second query.
In looking at the first query, you could probably just modify that to eliminate any INs and instead do a EXISTS instead.
Consider your indexes as well. Any thing in the where or join clauses should probably be indexed.