Slow query with many joins - expanding the magic pill? - sql

I have a query that takes 20 minutes to run, even though I have an index for every column in the where clauses, and every column being joined:
SELECT DISTINCT skt.VCDRAWING_REG_NO, skb.NDRAWING_ORG_NO, skb.NDRAWING_ORG_REV_NO, skb.CAPPLY_START_DATE, skb.CAPPLY_END_DATE, skto.*
FROM SPM_ABS_TRANBASE skt
JOIN SPM_ABS_BASE skb
ON skt.NDRAWING_ORG_REV_NO = skb.NDRAWING_ORG_REV_NO
AND skt.NDRAWING_ORG_NO = skb.NDRAWING_ORG_NO
JOIN SPM_ABS_MODEL skm
ON skb.NDRAWING_ORG_REV_NO = skm.NDRAWING_ORG_REV_NO
AND skb.NDRAWING_ORG_NO = skm.NDRAWING_ORG_NO
JOIN SPM_ABS_TRANOPT skto
ON skt.NDRAWING_SYSTEM_NO = skto.NDRAWING_SYSTEM_NO
JOIN ModelImport mi
ON skm.CMODEL = mi.ModelCode
WHERE (skb.CAPPLY_START_DATE <= DATEADD(day, 2, GETDATE()) OR skb.CAPPLY_START_DATE IS NULL)
AND (skb.CAPPLY_END_DATE >= DATEADD(day, -2, GETDATE()) OR skb.CAPPLY_END_DATE IS NULL)
Here is my query plan.
One thing that puzzles me is this: If I add the following WHERE clause, the query returns in about 0.5 seconds:
AND mi.ModelCode = '3FBK5'
Now you're saying, well, duh, of course it gets much faster with that - the thing is, the ModelImport table contains only 351 records. Which means if I were to split up the query above into 351 queries, each with its own where clause for a distinct ModelCode - then I can get 100% of my query results in about 175 seconds, or 2.9 minutes. This is dramatically faster. Which tells me that something in the wide-open query is grossly inefficient, and the query plan is bad.
Here is my query plan with AND mi.ModelCode = '3FBK5' added.
After viewing my query plan, any ideas how I can speed this up?

Is it possible that you eliminate some of the joins as you are not selecting anything from those tables or applying any where conditions to those tables, e.g., skm and mi?

Without having the table schema and sizes it's a little hard to give an exact answer, but here are some updates to try.
Use group by instead of distinct
Don't use * in the select results (particularly with distinct) instead give specific list of columns to return
Avoid "or" statements in where clause (maybe use ISNULL instead)
Here is what the query might look like with these updates (though there probably some other columns from skto you would want to add)
SELECT skt.VCDRAWING_REG_NO,
skb.NDRAWING_ORG_NO,
skb.NDRAWING_ORG_REV_NO,
skb.CAPPLY_START_DATE,
skb.CAPPLY_END_DATE,
skto.NDRAWING_SYSTEM_NO
FROM SPM_ABS_TRANBASE skt
JOIN SPM_ABS_BASE skb ON
skt.NDRAWING_ORG_REV_NO = skb.NDRAWING_ORG_REV_NO
AND skt.NDRAWING_ORG_NO = skb.NDRAWING_ORG_NO
JOIN SPM_ABS_MODEL skm ON
skb.NDRAWING_ORG_REV_NO = skm.NDRAWING_ORG_REV_NO
AND skb.NDRAWING_ORG_NO = skm.NDRAWING_ORG_NO
JOIN SPM_ABS_TRANOPT skto ON
skt.NDRAWING_SYSTEM_NO = skto.NDRAWING_SYSTEM_NO
JOIN ModelImport mi ON
skm.CMODEL = mi.ModelCode
WHERE ISNULL(skb.CAPPLY_START_DATE, DATEADD(day, 2, GETDATE())) <= DATEADD(day, 2, GETDATE())
AND ISNULL(skb.CAPPLY_END_DATE,DATEADD(day, -2, GETDATE())) >= DATEADD(day, -2, GETDATE())
GROUP BY skt.VCDRAWING_REG_NO,
skb.NDRAWING_ORG_NO,
skb.NDRAWING_ORG_REV_NO,
skb.CAPPLY_START_DATE,
skb.CAPPLY_END_DATE,
skto.NDRAWING_SYSTEM_NO

Related

SQL Query to Find if (Count of date > X) = 0 For Group of ID

I apologize if the title is not be correct as I'm not sure what I need to ask for, since I don't know how to build the query.
I have the following query built to return a list of chemicals and other related fields.
SELECT DISTINCT
RDB.Chemical_Record.[Chemical_ID],
RDB.Chemical_Record.[Expires_Date],
RDB.Assay_Group.[Assay_Group_Name] AS [Assay Group],
RDB.Chemical.[Chemical_Name],
RDB.Chemical.[Product_Number],
RDB.Chemical_Record.[Lot_Number],
RDB.Storage_Location.[Location_Name]
FROM RDB.Chemical_Record
LEFT JOIN RDB.Chemical ON Chemical_Record.[Chemical_ID] = Chemical.[ID_Chemical]
LEFT JOIN RDB.Storage_Location ON Storage_Location.[ID_Storage_Location] = Chemical_Record.[Storage_Location_ID]
LEFT JOIN RDB.Chemical_To_AGroup ON Chemical_To_AGroup.[Chemical_ID] = Chemical_Record.[Chemical_ID]
LEFT JOIN RDB.Assay_Group ON Assay_Group.[ID_Assay_Group] = Chemical_To_AGroup.[Assay_Group_ID]
WHERE RDB.Chemical_Record.[Expires_Date] >= DATEADD(day,-60, GETDATE())
ORDER BY RDB.Chemical_Record.[Chemical_ID], RDB.Chemical_Record.[Expires_Date], RDB.Assay_Group.[Assay_Group_Name]
I am using this query in a VB.Net application where it exports the results to an Excel worksheet and then performs additional actions to delete the rows I don't need. The process to query is quick, but working with Excel from .Net is painful and slow.
Instead I'd like to build the query to return the exact results I want, which I think is possible, I just can't figure out how. I have tried using a combination of Count, Group and Having, but since I've never worked with those I can't get them to work for me.
Example:
SELECT
COUNT(RDB.Chemical_Record.[Chemical_ID]) Count_ID,
RDB.Chemical_Record.[Chemical_ID],
RDB.Chemical_Record.[Expires_Date]
FROM RDB.Chemical_Record
WHERE RDB.Chemical_Record.[Expires_Date] > DATEADD(day,30,GETDATE())
GROUP BY RDB.Chemical_Record.[Chemical_ID], RDB.Chemical_Record.[Expires_Date]
ORDER BY RDB.Chemical_Record.[Chemical_ID]
As you can see from this example, it doesn't return the count of ID's where Expiration Date > DATEADD(day,30,GETDATE()) nor does it return the ID's that I actually wanted.
What I need to return is all chemicals (ID) that DO NOT have an expiration date > Today + 30 for that specific ID. The screenshot below shows an example of the data that gets pulled. The yellow highlighted rows are the only two in that set that should get returned as there are no other chemicals of those two ID's with an expiration date > Today + 30. All the other ID's should not show up since they DO have ID's of COUNT(Expiration Date > Today + 30) > 0.
If someone could help me build the query using the appropriate Aggregate functions, it would be MUCH appreciated.
What I need to return is all chemicals (ID) that DO NOT have an expiration date > Today + 30 for that specific ID.
For this question, you can use a HAVING clause. No WHERE is needed:
SELECT COUNT(*) as Count_ID, cr.[Chemical_ID]
FROM RDB.Chemical_Record cr
GROUP BY cr.[Chemical_ID]
HAVING MAX(cr.Expires_Date) <= DATEADD(day, 30, GETDATE())
ORDER BY cr.[Chemical_ID]
Using the HAVING MAX solved my problem and I was then able to work out exactly what I needed. I had to do some more research to figure out how to bring all my columns back, but that wasn't as difficult.
Here is my final solution:
WITH CHEM AS (
SELECT RDB.Chemical_Record.[Chemical_ID]
FROM RDB.Chemical_Record
GROUP BY RDB.Chemical_Record.[Chemical_ID]
HAVING MAX(RDB.Chemical_Record.Expires_Date) <= DATEADD(day, 60, GETDATE())
)
SELECT DISTINCT
RDB.Chemical_Record.[Chemical_ID],
RDB.Chemical_Record.[Expires_Date],
RDB.Assay_Group.[Assay_Group_Name] AS [Assay Group],
RDB.Chemical.[Chemical_Name],
RDB.Chemical.[Product_Number],
RDB.Chemical_Record.[Lot_Number],
RDB.Storage_Location.[Location_Name]
FROM RDB.Chemical_Record
INNER JOIN CHEM ON CHEM.Chemical_ID = RDB.Chemical_Record.Chemical_ID
LEFT JOIN RDB.Chemical ON Chemical_Record.[Chemical_ID] = Chemical.[ID_Chemical]
LEFT JOIN RDB.Storage_Location ON Storage_Location.[ID_Storage_Location] = Chemical_Record.[Storage_Location_ID]
LEFT JOIN RDB.Chemical_To_AGroup ON Chemical_To_AGroup.[Chemical_ID] = Chemical_Record.[Chemical_ID]
LEFT JOIN RDB.Assay_Group ON Assay_Group.[ID_Assay_Group] = Chemical_To_AGroup.[Assay_Group_ID]
WHERE Expires_Date >= DATEADD(day, -60, GETDATE())
ORDER BY RDB.Chemical_Record.[Chemical_ID], RDB.Chemical_Record.Expires_Date
And a screenshot showing the resulting search:

Improve Query Performance, Adding where clause grids query to a halt

Running the following SQL results in a query that runs in around 0.338s
adding a where clause and query times out. All I want to achieve is a list of test results for a particular test_code
Result_Set will have many Test_Results on the index Result_Set_Row_ID
Date_Received_Index will have many Result_Sets on the index Result_Set_Row_ID
I have tried altering the order of JOINS, adding clauses to the join statements.
SELECT
Date_Received_Index.Registration_Number,
Date_Received_Index.Specimen_Number,
Result,
Result_Comment,
Result_Comment_Exp ,
Result_Exp,
Short_Exp,
Test_Code,
Test_Exp,
Test_Row_ID,
Units,
Result_Set.Set_Code ,
Result_Set.Date_Time_Authorised,
Result_Set.Date_Booked_In ,
Date_Received_Index.Discipline,
Date_Received_Index.Namespace
FROM
Result_Set
INNER JOIN Test_Result ON Result_Set.Result_Set_Row_ID = Test_Result.Result_Set_Row_ID
INNER JOIN Date_Received_Index ON (Date_Received_Index.Request_Row_ID = Result_Set.Request_Row_ID)
WHERE
DATEDIFF('D', Date_Received_Index.Date_Received, current_timestamp) < 1 AND
Date_Received_Index.Namespace = 'CHM'
adding a WHERE clause e.g.
DATEDIFF('D', Date_Received_Index.Date_Received, current_timestamp) < 1 AND
Date_Received_Index.Namespace = 'CHM'
AND Test_Code = 'K'
results in the query timing out
I would like to be able to construct an SQL statement that is performant and just selects the test_code specified in the where clause.
This comes down to the Query Plan. Can you share the query plan?
My suspicion is that the column Test_Code is not indexed and the addition to the WHERE clause is causing the optimizer to select the wrong query plan.
I think the SQL optimizer is not able to optimize the portion
DATEDIFF('D', Date_Received_Index.Date_Received, current_timestamp) < 1
without knowing your schema my question would be of the three columns used in the where clause Date_Received_Index.Date_Received, Date_Received_Index.Namespace, and Test_Code are any of these columns index. You already indicated Test_Code is not.
Depending on what version of Cache you are using you might try
SELECT
Date_Received_Index.Registration_Number,
Date_Received_Index.Specimen_Number,
Result,
Result_Comment,
Result_Comment_Exp ,
Result_Exp,
Short_Exp,
Test_Code,
Test_Exp,
Test_Row_ID,
Units,
Result_Set.Set_Code ,
Result_Set.Date_Time_Authorised,
Result_Set.Date_Booked_In ,
Date_Received_Index.Discipline,
Date_Received_Index.Namespace
FROM %PARALLEL
Result_Set
INNER JOIN Test_Result ON Result_Set.Result_Set_Row_ID = Test_Result.Result_Set_Row_ID
INNER JOIN Date_Received_Index ON (Date_Received_Index.Request_Row_ID = Result_Set.Request_Row_ID)
WHERE
DATEDIFF('D', Date_Received_Index.Date_Received, current_timestamp) < 1
AND Date_Received_Index.Namespace = 'CHM'
AND Test_Code = 'K'
Use of %PARALLEL can cause the query to be run using multiple threads. If the server has a large number of CPUs it may run faster even if it's not optimized.

Sybase DB - How to store a SELECT query result in a variable?

I need to store the result of the SELECT query below in a variable to cut down on computation time. The results are of the form 'X', 'Y', 'Z', ...
WHERE
PEL.kuerzel in (SELECT KL.kuerzel from ictq.KLE KL WHERE FachgruppeKuerzel=526)
Right now the SELECT query gets executed 3 times for each of ~2000 entries. If I was able to store the result locally, I would have to run it only once.
I'm working on a Sybase 11 database. How can I achieve this or anything similar?
The subquery where the snippet was taken from, out of a 150 line query alltogether:
SELECT list(PEL.kuerzel) from ictq.PEpisode PE
INNER JOIN ictq.PEpisodeLeistung PEL ON(PE.IDPATIENTEPISODE = PEL.IDPATIENTEPISODE)
WHERE
PE.IDPATIENTKLINIK = P.IDPATIENTKLINIK and
PEL.Datum between dateadd(month, -12, #startdatum) and #startdatum and
PEL.kuerzel in (SELECT kl.kuerzel from ictq.KLE kl where FachgruppeKuerzel=526)
I have no control over the structure and cannot add anything. The query in itself is legacy work and I'm happy it works as it is now. The slow computation, however, needs an overhaul.
It is often more efficient to use exists rather than in, or to move the in subquery to the from clause:
FROM PEL JOIN
(SELECT DISTINCT KL.kuerzel
FROM ictq.KLE KL
WHERE FachgruppeKuerzel = 526
) KL
ON PEL.kuerzel = KL.kuerzel;
For performance for this query, you want an index on ictq.KLE(FachgruppeKuerzel, kuerzel) and PEL(kuerzel).

Is there any performance or functional difference between these two SQL statements?

I'm maintaining someone else's SQL at the moment, and I came across this in a Stored Procedure:
SELECT
Location.ID,
Location.Location,
COUNT(Person.ID) As CountAdultMales
FROM
Transactions INNER JOIN
Location ON Transactions.FKLocationID = Location.ID INNER JOIN
Person ON Transactions.FKPersonID = Person.ID
AND DATEDIFF(YEAR, Person.DateOfBirth, GETDATE()) >= 18 AND Person.Gender = 1
WHERE
((Transactions.Deleted = 0) AND
(Person.Deleted = 0) AND
(Location.Deleted = 0))
Is there any difference between the above and this (which is how I would write it)
SELECT
Location.ID,
Location.Location,
COUNT(Person.ID) As CountAdultMales
FROM
Transactions INNER JOIN
Location ON Transactions.FKLocationID = Location.ID INNER JOIN
Person ON Transactions.FKPersonID = Person.ID
WHERE
((Transactions.Deleted = 0) AND
(Person.Deleted = 0) AND
(Location.Deleted = 0) AND
(DATEDIFF(YEAR, Person.DateOfBirth, GETDATE()) >= 18) AND
(Person.Gender = 1))
Personally, I find putting the conditions in the WHERE clause most readable, but I wondered if there were performance or other reasons to "conditionalise" (if there is such a word) the JOIN
Thanks
With an inner join this wont really make much of a difference as SQL has a query optimiser which will do its best to excecute the query in the most efficiant way (not perfect).
If this was an outer join it could make a difference though so its something to be aware of
both query's performance is same. There is no difference for inner join.
Performance is the same in your examples, however you can tune it this way:
SELECT
Location.ID,
Location.Location,
COUNT(Person.ID) As CountAdultMales
FROM
Transactions
INNER JOIN Location
ON Transactions.FKLocationID = Location.ID
INNER JOIN Person
ON Transactions.FKPersonID = Person.ID
WHERE
Transactions.Deleted = 0 AND
Person.Deleted = 0 AND
Location.Deleted = 0 AND
Person.DateOfBirth < dateadd(year, datediff(year, 0, getdate())-17, 0) AND
Person.Gender = 1
This way you are not making a calculation on all columns to get the year. instead you will simply be comparing the year with a static value which is much faster.
This query is selecting rows where people turns 18(or are older) before current year runs out.
Don't worry about the exec plan here, as SO members already stated, they should yield identical exec plan. That much about the question itself.
You should worry about code readability and maintaining if it is too early for it or there is little to none optimization possible.
Should that be really a join criteria or a filter? From just looking at the code itself I think it's part of the filter.
It depends which DB you are working with. In general the DB internal optimizer will sort this out.
Please have a look here: INNER JOIN ON vs WHERE clause

SQL Query optimization

I have some questions about my query. I call this store-procedure in my first page, so it is important for me if it is optimize enough.
I do some select with some basic where expression, Then I filter them with some expression I passed through this store-procedure.
It is also considerable for me to select top n and its gonna search through millions of items (but I have hundreds of items already) and then do some paging in my website.
Select top (#NumberOfRows)
...
from(
SELECT
row_number() OVER (ORDER BY tblEventOpen.TicketAt, tblEvent.EventName, tblEventDetail.TimeStart) as RowNumber
, ...
FROM --[...some inner join logic...]
WHERE
(tblEventOpen.isValid = 1) AND (tblEvent.isValid = 1) and
(tblCondition_ResellerDetail.ResellerID = 1) AND
(tblEventOpen.TicketAt >= GETDATE()) AND
(GETDATE() BETWEEN
DATEADD(minute, (tblEventDetail.TimeStart - 60 * tblCondition_ResellerDetail.StartTime) , tblEventOpen.TicketAt)
AND DATEADD(minute, (tblEventDetail.TimeStart - 60 * tblCondition_ResellerDetail.EndTime) , tblEventOpen.TicketAt))
) as t1
where RowNumber >= (#PageNumber -1) * #NumberOfRows and
(#city='' or #city is null or city like #city) and
(#At is null or #At=At) and
(#TimeStartInMinute=-1 or #TimeStartInMinute=TimeStartInMinute) and
(#EventName='' or EventName like #EventName) and
(#CategoryID=-1 or #CategoryID = CategoryID) and
(#EventID is null or #EventID = EventID) and
(#DetailID is null or #DetailID = DetailID)
ORDER BY RowNumber
I'm worry about this part:
(GETDATE() BETWEEN
DATEADD(minute, (tblEventDetail.TimeStart - 60 * tblCondition_ResellerDetail.StartTime) , tblEventOpen.TicketAt)
AND DATEADD(minute, (tblEventDetail.TimeStart - 60 * tblCondition_ResellerDetail.EndTime) , tblEventOpen.TicketAt))
How does table t1 execute? I mean after I put some where expression after t1 (line 17 and further), does it filter items after execution of t1? for example I filter result by rownumber of 10, so it mean the inner (...) as t1 select will only return 10 items, or it select all items then my outer select will take 10 of them?
I want to filter my result by some optional parameters, so I put something like #DetailID is null or #DetailID = DetailID, is it a good way?
Anything else should I consider to make it faster (more optimize)?
My comment on your query:
You're correct, you should worry about condition "GETDATE() BETWEEN ...". Comparing value with function involving more than 1 table will most likely scan entire search space. Simplify your condition or if possible add a computed column for such function
Put all conditions except "RowNumber >= ..." in inner query
Its okay to put optional condition the way you do. I do it too :-)
Make sure you have index at least one for each column employed in the where clause as the first column of the index, and then the primary key. It would be better if your primary key is clustered
Well, these are based on my own experience. It may or may be not applicable to your situation.
[UPDATE] Here's the complete query
Select top (#NumberOfRows)
...
from(
SELECT
row_number() OVER (ORDER BY tblEventOpen.TicketAt, tblEvent.EventName, tblEventDetail.TimeStart) as RowNumber
, ...
FROM --[...some inner join logic...]
WHERE
(tblEventOpen.isValid = 1) AND (tblEvent.isValid = 1) and
(tblCondition_ResellerDetail.ResellerID = 1) AND
(tblEventOpen.TicketAt >= GETDATE()) AND
(GETDATE() BETWEEN
DATEADD(minute, (tblEventDetail.TimeStart - 60 * tblCondition_ResellerDetail.StartTime) , tblEventOpen.TicketAt)
AND DATEADD(minute, (tblEventDetail.TimeStart - 60 * tblCondition_ResellerDetail.EndTime) , tblEventOpen.TicketAt)) and
(#city='' or #city is null or city like #city) and
(#At is null or #At=At) and
(#TimeStartInMinute=-1 or #TimeStartInMinute=TimeStartInMinute) and
(#EventName='' or EventName like #EventName) and
(#CategoryID=-1 or #CategoryID = CategoryID) and
(#EventID is null or #EventID = EventID) and
(#DetailID is null or #DetailID = DetailID)
) as t1
where RowNumber >= (#PageNumber -1) * #NumberOfRows
ORDER BY RowNumber
Whilst you can seek advice on your query, it is better to learn how to optimise it yourself.
You need to view the execution plan, identify the bottlenecks and then see if there is anything that can be done to make an improvement.
In SSMS you can click "Query" ---> "Include Actual Execution Plan" before you run your query. (Ctrl+M) is they keyboard shortcut.
Then execute your query. SSMS will create a new tab in the results pane. Which will show you how the SQL engine executes your query, you can hover over each node for more information. The cost % will be particularly interesting, allowing you to see the most expensive part of your query.
It's difficult to advise you any more without that execution plan, which is why a number of people commented on your question. Your schema and indexes change how the query is executed, so it's not something that someone can accuratly replicate in their own environment without scripts for tables / indexes etc.... Even then statistics could be out of date and other problems could arise.
You can also execute SET STATISTICS PROFILE ON to get a textual view of the plan (maybe useful to seek help).
There are a number of articles that can help you fix the bottlenecks, or post another question for more advice.
http://msdn.microsoft.com/en-us/library/ms178071.aspx
SQL Server Query Plan Analysis
Execution Plan Basics