SQL Server Query time out depending on Where Clause - sql

I have a query that uses 3 functions and a few different views beneath it, which are too complex to post here. The odd thing I am experiencing is when running the top level query, having more than 1 search key is causing the query to take around an hour to run, where splitting the query in two takes about 5 seconds per query.
Here is the top level query:
Select *
from dbo.vwSimpleInvoice i
inner join dbo.vwRPTInvoiceLineItemDetail d on i.InvoiceID = d.InvoiceID
When I add this where clause:
Where i.InvoiceID = 109581
The query takes about 3 seconds to run. Similarly when I add this where clause:
Where i.InvoiceID = 109582
it takes about 3 seconds.
When I add this where clause though:
Where i.InvoiceID in (109581, 109582)
I have had to kill the query after about 50 minutes, and it never returns any results.
This is occurring on a remote client's server running SQL Server 2008 R2 Express. When I run it locally (also on SQL Server 2008 R2 Express), I don't get the massive delay, the last where clause takes about 30 seconds to return. The client has a lot more data than me though.
Any idea where to start troubleshooting this?
Edit:
After the comments below I rebuilt indexes and stats, which improved performance of the first 2 where clauses, but had no effect on the third. I then played around with the query, and discovered that if I rewrote it as:
Select *
from dbo.vwSimpleInvoice i
inner join
(Select * from dbo.vwRPTInvoiceLineItemDetail) d on i.InvoiceID = d.InvoiceID
Where i.InvoiceID in (109581, 109582)
Performance returns to expected levels, around 200 ms. I am now more mystified than ever as to what is occurring...
Edit 2:
Actually, I am wrong. It wasn't rewriting the query like that, I accidentally changed the Where Clause during the rewrite to:
Where d.InvoiceID in (109581, 109582)
(Changed i to d).
Still at a bit of a loss as to why this makes such as massive difference on an Inner Join?
Further edit:
Playing around with this even further, I still cannot understand it.
Select InvoiceId from tblInvoice Where CustomerID = 2000
returns:
80442, 4988, 98497, 102483, 102484, 107958, 127063, 168444, 168531, 173382, 173487, 173633, 174013, 174160, 174240, 175389
Select * from dbo.vwRPTInvoiceLineItemDetail
Where InvoiceID in
(80442, 4988, 98497, 102483, 102484, 107958, 127063, 168444, 168531, 173382, 173487, 173633, 174013, 174160, 174240, 175389)
Runs: 31 Rows returned 110 ms
Select * from dbo.vwRPTInvoiceLineItemDetail
Where InvoiceID in
(Select InvoiceId from tblInvoice Where CustomerID = 2000)
Runs: 31 rows returned 65 minutes

The Problem you are experiencing is (almost certainly) due to a cached query plan, which is appropriate for some version of parameters passed to the query, but not for others (aka Parameter Sniffing).
This is a common occurance, and is often made worse by out of date statistics and/or badly fragmented indexes.
First step: ensure you have rebuilt all your indexes and that statistics on non-indexed columns are up to date. (Also, make sure your client has a regularly scheduled index maintenance job)
exec sp_msforeachtable "DBCC DBREINDEX('?')"
go
exec sp_msforeachtable "UPDATE STATISTICS ? WITH FULLSCAN, COLUMNS"
go
This is the canonical reference: Slow in the Application, Fast in SSMS?
If the problem still exists after rebuilding indexes and updating statistics, then you have a few options:
Use dynamic SQL (but read this first: The Curse and Blessings of
Dynamic SQL)
Use OPTIMIZE FOR
Use WITH(RECOMPILE)

Related

SQL - Why does adding CASE to an ORDER BY clause drastically cut performance? And how can I avoid this?

I'm looking to improve the performance of one of our stored procedures and have come across something that I'm struggling to find information on. I'm by no means a DBA, so my knowledge of SQL is not brilliant.
Here's a simplified version of the problem:
If I use the following query -
SELECT * FROM Product
ORDER BY Name
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
I get the results in around 20ms
However if I apply a conditional ordering -
DECLARE #so int = 1
SELECT * FROM Product
ORDER BY
CASE WHEN #so = 1 THEN Name END,
CASE WHEN #so = 2 THEN Name END DESC
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
The overall request in my mind is the same, but the results take 600ms, 30x longer.
The execution plans are drastically different, but being a novice I've no idea how to bring the execution path for the second case into line with the the first case.
Is this even possible, or should I look at creating separate procedures for the order by cases and move choosing the order logic to the code?
NB. This is using MS SQL Server
The reason is because SQL Server can no longer use an index. One solution is dynamic SQL. Another is a simple IF:
IF (#so = 1)
BEGIN
SELECT p.*
FROM Product p
ORDER BY Name
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
END;
ELSE
BEGIN
SELECT p.*
FROM Product p
ORDER BY Name DESC
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY;
END:
Gordon Linoff is right that this prevents an index from being used, but to expand a bit on that:
When SQL Server prepares execution of a query, it generates an execution plan. This is a query being compiled to steps that the database engine can execute. It's at this point, generally, that it looks at which indices are available for use, but at this point, parameter values are not yet known, so the query optimiser cannot see whether an index on name is useful.
The workarounds in his answer are valid, but I'd like to offer one more:
Add OPTION (RECOMPILE) to your query. This forces your query execution plan to be recompiled each time, and each time, the parameter values are known, and allows the optimiser to optimise for those specific parameter values. It will generally be a bit less efficient than fully dynamic SQL, since dynamic SQL allows each possible statement's execution plan to be cached, but it will likely be better than what you have now, and more maintainable than the other options.

How to speed up this simple query

With SourceTable having > 15MM records and Bad_Phrase having > 3K records, the following query takes almost 10 hours to run on SQL Server 2005 SP4.
Update [SourceTable]
Set Bad_Count = (Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%')
In English, this query is counting the number of times that any phrases listed in Bad_Phrase are a substring of the column [Name] in the SourceTable and then placing that result in the column Bad_Count.
I would like some suggestions on how to have this query run considerably faster.
For a lack of a better idea, here is one:
I don't know if SQL Server natively supports parallelizing an UPDATE statement, but you can try to do it yourself manually by partitioning the work that needs to be done.
For instance, just as an example, if you can run the following 2 update statements in parallel manually or by writing a small app, I'd be curious to see if you can bring down your total processing time.
Update [SourceTable]
Set Bad_Count=(
Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%'
)
where Name < 'm'
Update [SourceTable]
Set Bad_Count=(
Select count(*)
from Bad_Phrase
where [SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%'
)
where Name >= 'm'
So the 1st update statement takes care of updating all the rows whose names start with the letters a-l, and the 2nd query takes care of o-z.
It's just an idea, and you can try splitting this into smaller chunks and more parallel update statements, depending on the capacity of your SQL Server machine.
Sounds like your query is scanning the whole table. Does your tables have proper indexes on them. Putting an index on columns that appear in a where clause is a good place to start. You can also try and get the cost of the query in the Sql server management studio (display estimated execution cost) or if your willing to wait (display actual execution cost) are both buttons in the query window. The cost will provide insights as to what is taking forever and possibly steer you to wright faster queries.
You are updating the table using sub query with the same table, every row update will scan the whole table and that may cause too much execution time. I think is better if you will insert first all data in the #temp table and then use the #temp table in your update statement. Or you can JOIN the Source table and Temp table as well.

Inserting into temp table from view is very slow

I am using different temp tables in my query. When I execute the query below
select * from myView
It takes only 5 seconds to execute.
but when I execute
select * into #temp from myView
It takes 50 seconds (10 times more than above query).
We migrated from SQL Server 2000 to SQL Server 2008 R2. Before in SQL 2000 both of the query takes same time but in SQL Server 2008 it takes 10 times more to execute.
Old question, but as I had a similar issue (though on SQL Server 2014) and resolved it in a way which I have not seen on any readily available resource, thought I would share in hopes of it being helpful to someone else.
I had a similar situation: a view I had created was taking 21 seconds to return its complete result set, but would take 10+ minutes (at which point I stopped the query) when I converted it into a SELECT..INTO The SELECT was a simple one, with no joins and no predicates. My hunch was that the optimizer was altering the original plan based on the additional INTO statement which did not simply pull the data set as in the first instance, then perform the INSERT, but instead altered it in a way to run very sub-optimally.
I first tried an OPENQUERY, attempting to force the result set to be generated first, then inserted into the temp table. Total running time for this method was 23 seconds, obviously much closer to the original SELECT time. Following this, I returned to my original SELECT..INTO query and added an OPTION (FORCE ORDER) hint to try to replicate the OPENQUERY behavior. This seemed to have done the trick and the time was on par with the OPENQUERY method, 23 seconds.
I don't have enough time at the moment to compare the query plans, but as a quick and dirty option if you run into this issue, you can try:
select * into #temp from myView option (force order);
Yeah, I would check the Execution Plan for your command. there may be an overhead on a sort or something.
I think, your tempdb database is in trouble. May be slow I/O, fragmentation, broken RAID etc.
Do you have an order by clause in your select statement such as select * from myView order by col1 before inserting into temp table? If there is an order by, that slows down insertion into temp table heavily. If that is the case, remove order by while the insertion happens and order after the insertion happened like
select *
into #temp
from myView
then apply order by
select * from #temp order by col1

Subquery v/s inner join in sql server

I have following queries
First one using inner join
SELECT item_ID,item_Code,item_Name
FROM [Pharmacy].[tblitemHdr] I
INNER JOIN EMR.tblFavourites F ON I.item_ID=F.itemID
WHERE F.doctorID = #doctorId AND F.favType = 'I'
second one using sub query like
SELECT item_ID,item_Code,item_Name from [Pharmacy].[tblitemHdr]
WHERE item_ID IN
(SELECT itemID FROM EMR.tblFavourites
WHERE doctorID = #doctorId AND favType = 'I'
)
In this item table [Pharmacy].[tblitemHdr] Contains 15 columns and 2000 records. And [Pharmacy].[tblitemHdr] contains 5 columns and around 100 records. in this scenario which query gives me better performance?
Usually joins will work faster than inner queries, but in reality it will depend on the execution plan generated by SQL Server. No matter how you write your query, SQL Server will always transform it on an execution plan. If it is "smart" enough to generate the same plan from both queries, you will get the same result.
Here and here some links to help.
In Sql Server Management Studio you can enable "Client Statistics" and also Include Actual Execution Plan. This will give you the ability to know precisely the execution time and load of each request.
Also between each request clean the cache to avoid cache side effect on performance
USE <YOURDATABASENAME>;
GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
I think it's always best to see with our own eyes than relying on theory !
Sub-query Vs Join
Table one 20 rows,2 cols
Table two 20 rows,2 cols
sub-query 20*20
join 20*2
logical, rectify
Detailed
The scan count indicates multiplication effect as the system will have to go through again and again to fetch data, for your performance measure, just look at the time
join is faster than subquery.
subquery makes for busy disk access, think of hard disk's read-write needle(head?) that goes back and forth when it access: User, SearchExpression, PageSize, DrilldownPageSize, User, SearchExpression, PageSize, DrilldownPageSize, User... and so on.
join works by concentrating the operation on the result of the first two tables, any subsequent joins would concentrate joining on the in-memory(or cached to disk) result of the first joined tables, and so on. less read-write needle movement, thus faster
Source: Here
First query is better than second query.. because first query we are joining both table.
and also check the explain plan for both queries...

Why select Top clause could lead to long time cost

The following query takes forever to finish. But if I remove the top 10 clause, it finishs rather quickly. big_table_1 and big_table_2 are 2 tables with 10^5 records.
I used to believe that top clause will reduce the time cost, but it's apparently not here. Why???
select top 10 ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
There are other stackoverflow discussions on this same topic (links at bottom). As noted in the comments above it might have something to do with indexes and the optimizer getting confused and using the wrong one.
My first thought is that you are doing a select top serviceid from (select *....) and the optimizer may have difficulty pushing the query down to the inner queries and making using of the index.
Consider rewriting it as
select top 10 ServiceRequestID
from big_table_1
inner join big_table_2 cap2
on cap1.servicerequestid = cap2.customerreferencenumber
and big_table_1.statusid = 2
In your query, the database is probably trying to merge the results and return them and THEN limit it to the top 10 in the outer query. In the above query the database will only have to gather the first 10 results as results are being merged, saving loads of time. And if servicerequestID is indexed, it will be sure to use it. In your example, the query is looking for the servicerequestid column in a result set that has already been returned in a virtual, unindexed format.
Hope that makes sense. While hypothetically the optimizer is supposed to take whatever format we put SQL in and figure out the best way to return values every time, the truth is that the way we put our SQL together can really impact the order in which certain steps are done on the DB.
SELECT TOP is slow, regardless of ORDER BY
Why is doing a top(1) on an indexed column in SQL Server slow?
I had a similar problem with a query like yours. The query ordered but without the top clause took 1 sec, same query with top 3 took 1 minute.
I saw that using a variable for the top it worked as expected.
The code for your case:
declare #top int = 10;
select top (#top) ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
I cant explain why but I can give an idea:
try adding SET ROWCOUNT 10 before your query. It helped me in some cases. Bear in mind that this is a scope setting so you have to set it back to its original value after running your query.
Explanation:
SET ROWCOUNT: Causes SQL Server to stop processing the query after the specified number of rows are returned.
This can also depend on what you mean by "finished". If "finished" means you start seeing some display on a gui, that does not necessarily mean the query has completed executing. It can mean that the results are beginning to stream in, not that the streaming is complete. When you wrap this into a subquery, the outer query can't really do it's processing until all the results of the inner query are available:
the outer query is dependent on the length of time it takes to return the last row of the inner query before it can "finish"
running the inner query independently may only requires waiting until the first row is returned before seeing any results
In Oracle, there were "first_rows" and "all_rows" hints that were somewhat related to manipulating this kind of behaviour. AskTom discussion.
If the inner query takes a long time between generating the first row and generating the last row, then this could be an indicator of what is going on. As part of the investigation, I would take the inner query and modify it to have a grouping function (or an ordering) to force processing all rows before a result can be returned. I would use this as a measure of how long the inner query really takes for comparison to the time in the outer query takes.
Drifting off topic a bit, it might be interesting to try simulating something like this in Oracle: create a Pipelined function to stream back numbers; stream back a few (say 15), then spin for a while before streaming back more.
Used a jdbc client to executeQuery against the pipelined function. The Oracle Statement fetchSize is 10 by default. Loop and print the results with a timestamp. See if the results stagger. I could not test this with Postgresql (RETURN NEXT), since Postgres does not stream the results from the function.
Oracle Pipelined Function
A pipelined table function returns a row to its invoker immediately
after processing that row and continues to process rows. Response time
improves because the entire collection need not be constructed and
returned to the server before the query can return a single result
row. (Also, the function needs less memory, because the object cache
need not materialize the entire collection.)
Postgresql RETURN NEXT
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation.
JDBC Default Fetch Sizes
statement.setFetchSize(100);
When debugging things like this I find that the quickest way to figure out how SQL Server "sees" the two queries is to look at their query plans. Hit CTRL-L in SSMS in the query view and the results will show what logic it will use to build your results when the query is actually executed.
SQL Server maintains statistics about the data your tables, e.g. histograms of the number of rows with data in certain ranges. It gathers and uses these statistics to try to predict the "best" way to run queries against those tables. For example, it might have data that suggests for some inputs a particular subquery might be expected to return 1M rows, while for other inputs the same subquery might return 1000 rows. This can lead it to choose different strategies for building the results, say using a table scan (exhaustively search the table) instead of an index seek (jump right to the desired data). If the statistics don't adequately represent the data, the "wrong" strategy can be chosen, with results similar to what you're experiencing. I don't know if that's the problem here, but that's the kind of thing I would look for.
If you want to compare performances of your two queries, you have to run these two queries in the same situation ( with clean memory buffers ) and have mumeric statistics
Run this batch for each query to compare execution time and statistics results
(Do not run it on a production environment) :
DBCC FREEPROCCACHE
GO
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS IO ON
GO
SET STATISTICS TIME ON
GO
-- your query here
GO
SET STATISTICS TIME OFF
GO
SET STATISTICS IO OFF
GO
I've just had to investigate a very similar issue.
SELECT TOP 5 *
FROM t1 JOIN t2 ON t2.t1id = t1.id
WHERE t1.Code = 'MyCode'
ORDER BY t2.id DESC
t1 has 100K rows, t2 20M rows, The average number of rows from the joined tables for a t1.Code is about 35K. The actual resultset is only 3 rows because t1.Code = 'MyCode' only matches 2 rows which only have 3 corresponding rows in t2. Stats are up-to-date.
With the TOP 5 as above the query takes minutes, with the TOP 5 removed the query returns immediately.
The plans with and without the TOP are completely different.
The plan without the TOP uses an index seek on t1.Code, finds 2 rows, then nested loop joins 3 rows via an index seek on t2. Very quick.
The plan with the TOP uses an index scan on t2 giving 20M rows, then nested loop joins 2 rows via an index seek on t1.Code, then applies the top operator.
What I think makes my TOP plan so bad is that the rows being picked from t1 and t2 are some of the newest rows (largest values for t1.id and t2.id). The query optimiser has assumed that picking the first 5 rows from an evenly distributed average resultset will be quicker than the non-TOP approach. I tested this theory by using a t1.code from the very earliest rows and the response is sub-second using the same plan.
So the conclusion, in my case at least, is that the problem is a result of uneven data distribution that is not reflected in the stats.
TOP does not sort the results to my knowledge unless you use order by.
So my guess would be, as someone had already suggested, that the query isn't taking longer to execute. You simply start seeing the results faster when you don't have TOP in the query.
Try using #sql_mommy query, but make sure you have the following:
To get your query to run faster, you could create an index on servicerequestid and statusid in big_table_1 and an index on customerreferencenumber in big_table_2. If you create unclustered indexes, you should get an index only plan with very fast results.
If I remember correctly, the TOP results will be in the same order as the index you us on big_table_1, but I'm not sure.
GĂ­sli
It might be a good idea to compare the execution plans between the two. Your statistics might be out of date. If you see a difference between the actual execution plans, there is your difference in performance.
In most cases, you would expect better performance in the top 10. In your case, performance is worse. If this is the case you will not only see a difference between the execution plans, but you will also see a difference in the number of returned rows in the estimated execution plan and the actual execution plan, leading to the poor decission by the SQL engine.
Try again after recomputing your statistics (and while you're at it, rebuilding indices)
Also check if it helps to take out the where big_table_1.StatusId=2 and instead go for
select top 10 ServiceRequestID
from big_table_1 as cap1 INNER JOIN
big_table_2 as cap2
ON cap1.ServiceRequestID = cap2.CustomerReferenceNumber
WHERE cap1.StatusId=2
I find this format much more readable, though it should (though remotely possibly it doesn't) optimise to the same execution plan. The returned endresult will be identical regardless