Can this SQL Query be optimized to run faster? - sql

I have an SQL Query (For SQL Server 2008 R2) that takes a very long time to complete. I was wondering if there was a better way of doing it?
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND t.Code NOT IN (SELECT Code FROM ExcludedCodes)
Table1 has around 90Million rows in it and is indexed by Name and Code.
ExcludedCodes only has around 30 rows in it.
This query is in a stored procedure and gets called around 40k times, the total time it takes the procedure to finish is 27 minutes.. I believe this is my biggest bottleneck because of the massive amount of rows it queries against and the number of times it does it.
So if you know of a good way to optimize this it would be greatly appreciated! If it cannot be optimized then I guess im stuck with 27 min...
EDIT
I changed the NOT IN to NOT EXISTS and it cut the time down to 10:59, so that alone is a massive gain on my part. I am still going to attempt to do the group by statement as suggested below but that will require a complete rewrite of the stored procedure and might take some time... (as I said before, im not the best at SQL but it is starting to grow on me. ^^)

In addition to workarounds to get the query itself to respond faster, have you considered maintaining a column in the table that tells whether it is in this set or not? It requires a lot of maintenance but if the ExcludedCodes table does not change often, it might be better to do that maintenance. For example you could add a BIT column:
ALTER TABLE dbo.Table1 ADD IsExcluded BIT;
Make it NOT NULL and default to 0. Then you could create a filtered index:
CREATE INDEX n ON dbo.Table1(name)
WHERE IsExcluded = 0;
Now you just have to update the table once:
UPDATE t
SET IsExcluded = 1
FROM dbo.Table1 AS t
INNER JOIN dbo.ExcludedCodes AS x
ON t.Code = x.Code;
And ongoing you'd have to maintain this with triggers on both tables. With this in place, your query becomes:
SELECT #Count = COUNT(Name)
FROM dbo.Table1 WHERE IsExcluded = 0;
EDIT
As for "NOT IN being slower than LEFT JOIN" here is a simple test I performed on only a few thousand rows:
EDIT 2
I'm not sure why this query wouldn't do what you're after, and be far more efficient than your 40K loop:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
WHERE src.Code NOT IN (SELECT Code FROM dbo.ExcludedCodes)
GROUP BY src.Name;
Or the LEFT JOIN equivalent:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
LEFT OUTER JOIN dbo.ExcludedCodes AS x
ON src.Code = x.Code
WHERE x.Code IS NULL
GROUP BY src.Name;
I would put money on either of those queries taking less than 27 minutes. I would even suggest that running both queries sequentially will be far faster than your one query that takes 27 minutes.
Finally, you might consider an indexed view. I don't know your table structure and whether your violate any of the restrictions but it is worth investigating IMHO.

You say this gets called around 40K times. WHy? Is it in a cursor? If so do you really need a cursor. Couldn't you put the values you want for #name in a temp table and index it and then join to it?
select t.name, count(t.name)
from table t
join #name n on t.name = n.name
where NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t.code)
group by t.name
That might get you all your results in one query and is almost certainly faster than 40K separate queries. Of course if you need the count of all the names, it's even simpleer
select t.name, count(t.name)
from table t
NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t
group by t.name

NOT EXISTS typically performs better than NOT IN, but you should test it on your system.
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND NOT EXISTS (SELECT 1 FROM ExcludedCodes e WHERE e.Code = t.Code)
Without knowing more about your query it's tough to supply concrete optimization suggestions (i.e. code suitable for copy/paste). Does it really need to run 40,000 times? Sounds like your stored procedure needs reworking, if that's feasible. You could exec the above once at the start of the proc and insert the results in a temp table, which can keep the indexes from Table1, and then join on that instead of running this query.
This particular bit might not even be the bottleneck that makes your query run 27 minutes. For example, are you using a cursor over those 90 million rows, or scalar valued UDFs in your WHERE clauses?

Have you thought about doing the query once and populating the data in a table variable or temp table? Something like
insert into #temp (name, Namecount)
values Name, Count(name)
from table1
where name not in(select code from excludedcodes)
group by name
And don't forget that you could possibly use a filtered index as long as the excluded codes table is somewhat static.

Start evaluating the execution plan. Which is the heaviest part to compute?
Regarding the relation between the two tables, use a JOIN on indexed columns: indexes will optimize query execution.

Related

SQL best practice/performance when inserting into a table. To use a temp table or not

I have a select query that's come about from me trying to remove while loops from an existing query that was far too slow. As it stands I first select into a temp table.
Then from that temp table I insert into the final table using the values from the temp table.
Below is a simplified example of the flow of my query
select
b.BookId,
b.BookDescription,
a.Name,
a.BirthDate,
a.CountryOfOrigin,
into #tempTable
from library.Book b
left join authors.Authors a
on a.AuthorId = b.AuthorId
insert into bookStore.BookStore
([BookStoreEntryId]
[BookId],
[BookDescription],
[Author],
[AuthorBirthdate],
[AuthorCountryOfOrigin])
select
NEWID(),
t.BookId,
t.BookDescription,
t.Name,
t.Birthdate,
t.CountryOfOrigin
from #tempTable t
drop table #tempTable
Would it be better to move the select statement at the start, to below so that its incorporated into the insert statement, removing the need for the temp table?
There is no advantage at all to having a temporary table in this case. Just use the select query directly.
Sometimes, temporary tables can improve performance. One method is that a real table has real statistics (notably the number of rows). The optimizer can use that information for better execution plans.
Temporary tables can also improve performance if they explicit have an index on them.
However, they incur overhead of writing the table.
In this case, you just get all the overhead and there should be no benefit.
Actually, I could imagine one benefit under one circumstance. If the query took a long time to run -- say because the join required a nested loops join with no indexes -- then the destination table would be saved from locking and contention until all the rows are available for insert. That would be an unusual case, though.
Do in 1 step
insert into bookStore.BookStore
( /* [BookStoreEntryId] <-- assuming this is auto id*/
[BookId],
[BookDescription],
[Author],
[AuthorBirthdate],
[AuthorCountryOfOrigin])
SELECT distinct
b.BookId,
b.BookDescription,
a.Name,
a.BirthDate,
a.CountryOfOrigin,
from library.Book b
left join authors.Authors a
on a.AuthorId = b.AuthorId
your performance will depend on number of indexes in the target table. More indexes - slower insert. May be worth to disable them during insert and then rebuild them after the insert is completed

Why does IF (query) take over an hour, when query takes less than a second?

I have the following SQL:
IF EXISTS
(
SELECT
1
FROM
SomeTable T1
WHERE
SomeField = 1
AND SomeOtherField = 1
AND NOT EXISTS(SELECT 1 FROM SomeOtherTable T2 WHERE T2.KeyField = T1.KeyField)
)
RAISERROR ('Blech.', 16, 1)
The SomeTable table has around 200,000 rows, and the SomeOtherTable table has about the same.
If I execute the inner SQL (the SELECT), it executes in sub-second time, returning no rows. But, if I execute the entire script (IF...RAISERROR) then it takes well over an hour. Why?
Now, obviously, the execution plan is different - I can see that in Enterprise Manager - but again, why?
I could probably do something like SELECT #num = COUNT(*) WHERE ... and then IF #num > 0 RAISERROR but... I think that's missing the point somewhat. You can only code around a bug (and it sure looks like a bug to me) if you know that it exists.
EDIT:
I should mention that I already tried re-jigging the query into an OUTER JOIN as per #Bohemian's answer, but this made no difference to the execution time.
EDIT 2:
I've attached the query plan for the inner SELECT statement:
... and the query plan for the whole IF...RAISERROR block:
Obviously these show the real table/field names, but apart from that the query is exactly as shown above.
The IF does not magically turn off optimizations or damage the plan. The optimizer just noticed that EXISTS only needs one row at most (like a TOP 1). This is called a "row goal" and it normally happens when you do paging. But also with EXISTS, IN, NOT IN and such things.
My guess: if you write TOP 1 to the original query you get the same (bad) plan.
The optimizer tries to be smart here and only produce the first row using much cheaper operations. Unfortunately, it misestimates cardinality. It guesses that the query will produce lots of rows although in reality it produces none. If it estimated correctly you'd just get a more efficient plan, or it would not do the transformation at all.
I suggest the following steps:
fix the plan by reviewing indexes and statistics
if this didn't help, change the query to IF (SELECT COUNT(*) FROM ...) > 0 which will give the original plan because the optimizer does not have a row goal.
It's probably because the optimizer can figure out how to turn your query into a more efficient query, but somehow the IF prevents that. Only an EXPLAIN will tell you why the query is taking so long, but I can tell you how to make this whole thing more efficient... Indtead of using a correlated subquery, which is incredibly inefficient - you get "n" subqueries run for "n" rows in the main table - use a JOIN.
Try this:
IF EXISTS (
SELECT 1
FROM SomeTable T1
LEFT JOIN SomeOtherTable T2 ON T2.KeyField = T1.KeyField
WHERE SomeField = 1
AND SomeOtherField = 1
AND T2.KeyField IS NULL
) RAISERROR ('Blech.', 16, 1)
The "trick" here is to use s LEFT JOIN and filter out all joined rows by testing for a null in the WHERE clause, which is executed after the join is made.
Please try SELECT TOP 1 KeyField. Using primary key will work faster in my guess.
NOTE: I posted this as answer as I couldn't comment.

SubQuery vs TempTable before Merge

I have a complex query that I want to use as the Source of a Merge into a table. This will be executed over millions of rows. Currently I am trying to apply constraints to the data by inserting it into a temp table before the merge.
The operations are:
Filter out duplicate data.
Join some tables to pull in additional data
Insert into the temp table.
Here is the query.
-- Get all Orders that aren't in the system
WITH Orders AS
(
SELECT *
FROM [Staging].Orders o
WHERE NOT EXISTS
(
SELECT 1
FROM Maps.VendorBOrders vbo
JOIN OrderFact of
ON of.Id = vbo.OrderFactId
AND InternalOrderId = o.InternalOrderId
AND of.DataSetId = o.DataSetId
AND of.IsDelete = 0
)
)
INSERT INTO #VendorBOrders
(
CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,BillingProviderId
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,RelatedOrderId
,DataSetId
)
SELECT
vc.CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,bp.Id
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,ro.Id
,o.DataSetId
FROM
Orders o
-- Join related orders to match orders sharing same instance
JOIN Maps.VendorBRelatedOrder ro
ON ro.OrderControlNumber = o.OrderControlNumber
AND ro.EquitableCustomerId = o.EquitableCustomerId
AND ro.DataSetId = o.DataSetId
JOIN BillingProvider bp
ON bp.ProviderNPI = o.ProviderNPI
-- Join on customers and fail if the customer doesn't exist
LEFT OUTER JOIN [Maps].VendorBCustomer vc
ON vc.ExtenalCustomerId = o.ExtenalCustomerId
AND vc.VendorId = o.VendorId;
I am wondering if there is anything I can do to optimize it for time. I have tried using the DB Engine Tuner, but this query takes 100x more CPU Time than the other queries I am running. Is there anything else that I can look into or can the query not be improved further?
CTE is just syntax
That CTE is evaluated (run) on that join
First just run it as a select statement (no insert)
If the select is slow then:
Move that CTE to a #TEMP so it is evaluated once and materialized
Put an index (PK if applicable) on the three join columns
If the select is not slow then it is insert time on #VendorBOrders
Fist only create PK and sort the insert on the PK so as not to fragment that clustered index
Then AFTER the insert is complete build any other necessary indexes
Generally when I do speed testing I perform checks on the parts of SQL to see where the problem lies. Turn on the 'Execution plan' and see where a lot of the time is going. Also if you want to just do the quick and dirty highlight your CTE and run just that. Is that fast, yes, move on.
I have at times found a single index being off throws off a whole complex logic of joins by merely having the database do one part of something large and then finding that piece.
Another idea is that if you have a fast tempdb on a production environment or the like, dump your CTE to a temp table as well. Index on that and see if that speeds things up. Sometimes CTE's, table variables, and temp tables lose some performance at joins. I have found that creating an index on a partial object will improve performance at times but you are also putting more load on the tempdb to do this, so keep that in mind.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.

T-SQL: Why is my query faster, if I use a table variable?

can anybody explain me why this query takes 13 seconds:
SELECT Table1.Location, Table2.SID, Table2.CID, Table1.VID, COUNT(*)
FROM Table1 INNER JOIN
Table2 AS ON Table1.TID = Table2.TID
WHERE Table1.Last = Table2.Last
GROUP BY Table1.Location, Table2.SID, Table2.CID, Table1.VID
And this one only 1 second:
DECLARE #Test TABLE (Location INT, SID INT, CID INT, VID INT)
INSERT INTO #Test
SELECT Table1.Location, Table2.SID, Table2.CID, Table1.VID
FROM Table1 INNER JOIN
Table2 AS ON Table1.TID = Table2.TID
WHERE Table1.Last = Table2.Last
SELECT Location, SID, CID, VID, COUNT(*)
FROM #Test
GROUP BY Location, SID, CID, VID
When I remove the GROUP BY from the first query it needs only 1 second too. I also try to write a subselect and group the result, but it takes 13 seconds too. I don't unterstand this.
Compare the execution plans for the two queries.
The GROUP BY may perform better if you have an INDEX on each of the columns in the GROUP BY (either one each, or a combined single index of all the columns)
The reason your temporary version works better is probably because the GROUP BY is being performed on a much smaller subset of the data and therefore it is fast even without the index.
Your temp table method is by no means the wrong way of doing it. It is one of those situations where you weigh up the pro's and con's of each method. An index on the main table may slow up your inserts / updates and increase your database size. The temp table however may not perform adequately once the data size increases over time.
Could be the that in first query you group and count on larger query result than in second query where you work with already smaller dataset. Could be indexing problem. Once you fix it don't forget to re-check with larger result set because "worser" query can perform better with larger datasets.