Why select Top clause could lead to long time cost - sql

The following query takes forever to finish. But if I remove the top 10 clause, it finishs rather quickly. big_table_1 and big_table_2 are 2 tables with 10^5 records.
I used to believe that top clause will reduce the time cost, but it's apparently not here. Why???
select top 10 ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)

There are other stackoverflow discussions on this same topic (links at bottom). As noted in the comments above it might have something to do with indexes and the optimizer getting confused and using the wrong one.
My first thought is that you are doing a select top serviceid from (select *....) and the optimizer may have difficulty pushing the query down to the inner queries and making using of the index.
Consider rewriting it as
select top 10 ServiceRequestID
from big_table_1
inner join big_table_2 cap2
on cap1.servicerequestid = cap2.customerreferencenumber
and big_table_1.statusid = 2
In your query, the database is probably trying to merge the results and return them and THEN limit it to the top 10 in the outer query. In the above query the database will only have to gather the first 10 results as results are being merged, saving loads of time. And if servicerequestID is indexed, it will be sure to use it. In your example, the query is looking for the servicerequestid column in a result set that has already been returned in a virtual, unindexed format.
Hope that makes sense. While hypothetically the optimizer is supposed to take whatever format we put SQL in and figure out the best way to return values every time, the truth is that the way we put our SQL together can really impact the order in which certain steps are done on the DB.
SELECT TOP is slow, regardless of ORDER BY
Why is doing a top(1) on an indexed column in SQL Server slow?

I had a similar problem with a query like yours. The query ordered but without the top clause took 1 sec, same query with top 3 took 1 minute.
I saw that using a variable for the top it worked as expected.
The code for your case:
declare #top int = 10;
select top (#top) ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)

I cant explain why but I can give an idea:
try adding SET ROWCOUNT 10 before your query. It helped me in some cases. Bear in mind that this is a scope setting so you have to set it back to its original value after running your query.
Explanation:
SET ROWCOUNT: Causes SQL Server to stop processing the query after the specified number of rows are returned.

This can also depend on what you mean by "finished". If "finished" means you start seeing some display on a gui, that does not necessarily mean the query has completed executing. It can mean that the results are beginning to stream in, not that the streaming is complete. When you wrap this into a subquery, the outer query can't really do it's processing until all the results of the inner query are available:
the outer query is dependent on the length of time it takes to return the last row of the inner query before it can "finish"
running the inner query independently may only requires waiting until the first row is returned before seeing any results
In Oracle, there were "first_rows" and "all_rows" hints that were somewhat related to manipulating this kind of behaviour. AskTom discussion.
If the inner query takes a long time between generating the first row and generating the last row, then this could be an indicator of what is going on. As part of the investigation, I would take the inner query and modify it to have a grouping function (or an ordering) to force processing all rows before a result can be returned. I would use this as a measure of how long the inner query really takes for comparison to the time in the outer query takes.
Drifting off topic a bit, it might be interesting to try simulating something like this in Oracle: create a Pipelined function to stream back numbers; stream back a few (say 15), then spin for a while before streaming back more.
Used a jdbc client to executeQuery against the pipelined function. The Oracle Statement fetchSize is 10 by default. Loop and print the results with a timestamp. See if the results stagger. I could not test this with Postgresql (RETURN NEXT), since Postgres does not stream the results from the function.
Oracle Pipelined Function
A pipelined table function returns a row to its invoker immediately
after processing that row and continues to process rows. Response time
improves because the entire collection need not be constructed and
returned to the server before the query can return a single result
row. (Also, the function needs less memory, because the object cache
need not materialize the entire collection.)
Postgresql RETURN NEXT
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation.
JDBC Default Fetch Sizes
statement.setFetchSize(100);

When debugging things like this I find that the quickest way to figure out how SQL Server "sees" the two queries is to look at their query plans. Hit CTRL-L in SSMS in the query view and the results will show what logic it will use to build your results when the query is actually executed.
SQL Server maintains statistics about the data your tables, e.g. histograms of the number of rows with data in certain ranges. It gathers and uses these statistics to try to predict the "best" way to run queries against those tables. For example, it might have data that suggests for some inputs a particular subquery might be expected to return 1M rows, while for other inputs the same subquery might return 1000 rows. This can lead it to choose different strategies for building the results, say using a table scan (exhaustively search the table) instead of an index seek (jump right to the desired data). If the statistics don't adequately represent the data, the "wrong" strategy can be chosen, with results similar to what you're experiencing. I don't know if that's the problem here, but that's the kind of thing I would look for.

If you want to compare performances of your two queries, you have to run these two queries in the same situation ( with clean memory buffers ) and have mumeric statistics
Run this batch for each query to compare execution time and statistics results
(Do not run it on a production environment) :
DBCC FREEPROCCACHE
GO
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS IO ON
GO
SET STATISTICS TIME ON
GO
-- your query here
GO
SET STATISTICS TIME OFF
GO
SET STATISTICS IO OFF
GO

I've just had to investigate a very similar issue.
SELECT TOP 5 *
FROM t1 JOIN t2 ON t2.t1id = t1.id
WHERE t1.Code = 'MyCode'
ORDER BY t2.id DESC
t1 has 100K rows, t2 20M rows, The average number of rows from the joined tables for a t1.Code is about 35K. The actual resultset is only 3 rows because t1.Code = 'MyCode' only matches 2 rows which only have 3 corresponding rows in t2. Stats are up-to-date.
With the TOP 5 as above the query takes minutes, with the TOP 5 removed the query returns immediately.
The plans with and without the TOP are completely different.
The plan without the TOP uses an index seek on t1.Code, finds 2 rows, then nested loop joins 3 rows via an index seek on t2. Very quick.
The plan with the TOP uses an index scan on t2 giving 20M rows, then nested loop joins 2 rows via an index seek on t1.Code, then applies the top operator.
What I think makes my TOP plan so bad is that the rows being picked from t1 and t2 are some of the newest rows (largest values for t1.id and t2.id). The query optimiser has assumed that picking the first 5 rows from an evenly distributed average resultset will be quicker than the non-TOP approach. I tested this theory by using a t1.code from the very earliest rows and the response is sub-second using the same plan.
So the conclusion, in my case at least, is that the problem is a result of uneven data distribution that is not reflected in the stats.

TOP does not sort the results to my knowledge unless you use order by.
So my guess would be, as someone had already suggested, that the query isn't taking longer to execute. You simply start seeing the results faster when you don't have TOP in the query.
Try using #sql_mommy query, but make sure you have the following:
To get your query to run faster, you could create an index on servicerequestid and statusid in big_table_1 and an index on customerreferencenumber in big_table_2. If you create unclustered indexes, you should get an index only plan with very fast results.
If I remember correctly, the TOP results will be in the same order as the index you us on big_table_1, but I'm not sure.
GĂ­sli

It might be a good idea to compare the execution plans between the two. Your statistics might be out of date. If you see a difference between the actual execution plans, there is your difference in performance.
In most cases, you would expect better performance in the top 10. In your case, performance is worse. If this is the case you will not only see a difference between the execution plans, but you will also see a difference in the number of returned rows in the estimated execution plan and the actual execution plan, leading to the poor decission by the SQL engine.
Try again after recomputing your statistics (and while you're at it, rebuilding indices)
Also check if it helps to take out the where big_table_1.StatusId=2 and instead go for
select top 10 ServiceRequestID
from big_table_1 as cap1 INNER JOIN
big_table_2 as cap2
ON cap1.ServiceRequestID = cap2.CustomerReferenceNumber
WHERE cap1.StatusId=2
I find this format much more readable, though it should (though remotely possibly it doesn't) optimise to the same execution plan. The returned endresult will be identical regardless

Related

SQL - Why does adding CASE to an ORDER BY clause drastically cut performance? And how can I avoid this?

I'm looking to improve the performance of one of our stored procedures and have come across something that I'm struggling to find information on. I'm by no means a DBA, so my knowledge of SQL is not brilliant.
Here's a simplified version of the problem:
If I use the following query -
SELECT * FROM Product
ORDER BY Name
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
I get the results in around 20ms
However if I apply a conditional ordering -
DECLARE #so int = 1
SELECT * FROM Product
ORDER BY
CASE WHEN #so = 1 THEN Name END,
CASE WHEN #so = 2 THEN Name END DESC
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
The overall request in my mind is the same, but the results take 600ms, 30x longer.
The execution plans are drastically different, but being a novice I've no idea how to bring the execution path for the second case into line with the the first case.
Is this even possible, or should I look at creating separate procedures for the order by cases and move choosing the order logic to the code?
NB. This is using MS SQL Server
The reason is because SQL Server can no longer use an index. One solution is dynamic SQL. Another is a simple IF:
IF (#so = 1)
BEGIN
SELECT p.*
FROM Product p
ORDER BY Name
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
END;
ELSE
BEGIN
SELECT p.*
FROM Product p
ORDER BY Name DESC
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY;
END:
Gordon Linoff is right that this prevents an index from being used, but to expand a bit on that:
When SQL Server prepares execution of a query, it generates an execution plan. This is a query being compiled to steps that the database engine can execute. It's at this point, generally, that it looks at which indices are available for use, but at this point, parameter values are not yet known, so the query optimiser cannot see whether an index on name is useful.
The workarounds in his answer are valid, but I'd like to offer one more:
Add OPTION (RECOMPILE) to your query. This forces your query execution plan to be recompiled each time, and each time, the parameter values are known, and allows the optimiser to optimise for those specific parameter values. It will generally be a bit less efficient than fully dynamic SQL, since dynamic SQL allows each possible statement's execution plan to be cached, but it will likely be better than what you have now, and more maintainable than the other options.

Select top 4 with order by, but only if actually required?

I have part of a stored proc that is called thousands and thousands of times and as a result takes up the bulk of the whole thing. Having run it through execution plan it looks like the TOP 4 and Order By part is taking up a lot of that. The order by uses a function that although streamlined, will still be being used a fair bit.
This is an odd situation in that for 99.5% of the data there will be 4 or less results returned anyway, it's only for the 0.5% of times that we need the TOP 4. This is a requirement of the data algorithm so eliminating the TOP 4 entirely is not an option.
So lets say my syntax is
SELECT SomeField * SomeOtherField as MainField, SomeOtherField
FROM
(
SELECT TOP 4
SomeField, 1/dbo.[Myfunction](Param1, Param2, 34892) as SomeOtherField
FROM #MytempTable
WHERE
Param1 > #NextMargin1 AND Param1 < #NextMargin1End
AND Param2 > #NextMargin2 AND Param2 < #NextMargin2End
ORDER BY dbo.[MyFunction](Param1, Param2, 34892)
) d
Is there a way I can tell SQL server to do the order by if and only if there are more than 4 results returned after the where takes place? I don't need the order otherwise. Perhaps a table variable and count of the table in an if?
--- Update based on Davids Answer to try to work out why it was slower:
I did a check and can confirm that 96.5% of times there are 4 or less results so it's not a case of more data than expected.
Here is the execution plan for the insert into the #FunctionResults
And the breakdowns of the Insert and spool:
And then the execution plan for the selection of the top4 and orderby:
Please let me know if any further information or breakdowns are required, the size of #Mytemptable could typically be 28000 rows and it has index
CREATE INDEX MyIndex on #MyTempTable (Param1, Param2) INCLUDE ([SomeField])
This answer has been updated based on continued feedback from the question asker. The original suggestion was to attempt to use a table variable to store pre-calculations and select the top 4 from the results. However, in practice it appears that the optimizer was over-estimating the number of rows and choosing a bad execution plan.
In addition to the previous recommendations, I would also recommend updating statistics periodically after any change to this process to provide the query optimizer with updated information to make more informed decisions.
As this is a performance tuning process without direct access to the source environment, this answer is expected to change based on user feedback. Per the recommendation of #SteveFord above, the sample query below reflects the use of a CROSS APPLY to attempt to avoid multiple unnecessary function calls.
SELECT TOP 4
M.SomeField,
M.SomeField * 1/F.FunctionResults [SomeOtherField]
FROM #MytempTable M
CROSS APPLY (SELECT dbo.Myfunction(M.Param1, M.Param2, 34892)) F(FunctionResults)
ORDER BY F.FunctionResults

Subquery v/s inner join in sql server

I have following queries
First one using inner join
SELECT item_ID,item_Code,item_Name
FROM [Pharmacy].[tblitemHdr] I
INNER JOIN EMR.tblFavourites F ON I.item_ID=F.itemID
WHERE F.doctorID = #doctorId AND F.favType = 'I'
second one using sub query like
SELECT item_ID,item_Code,item_Name from [Pharmacy].[tblitemHdr]
WHERE item_ID IN
(SELECT itemID FROM EMR.tblFavourites
WHERE doctorID = #doctorId AND favType = 'I'
)
In this item table [Pharmacy].[tblitemHdr] Contains 15 columns and 2000 records. And [Pharmacy].[tblitemHdr] contains 5 columns and around 100 records. in this scenario which query gives me better performance?
Usually joins will work faster than inner queries, but in reality it will depend on the execution plan generated by SQL Server. No matter how you write your query, SQL Server will always transform it on an execution plan. If it is "smart" enough to generate the same plan from both queries, you will get the same result.
Here and here some links to help.
In Sql Server Management Studio you can enable "Client Statistics" and also Include Actual Execution Plan. This will give you the ability to know precisely the execution time and load of each request.
Also between each request clean the cache to avoid cache side effect on performance
USE <YOURDATABASENAME>;
GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
I think it's always best to see with our own eyes than relying on theory !
Sub-query Vs Join
Table one 20 rows,2 cols
Table two 20 rows,2 cols
sub-query 20*20
join 20*2
logical, rectify
Detailed
The scan count indicates multiplication effect as the system will have to go through again and again to fetch data, for your performance measure, just look at the time
join is faster than subquery.
subquery makes for busy disk access, think of hard disk's read-write needle(head?) that goes back and forth when it access: User, SearchExpression, PageSize, DrilldownPageSize, User, SearchExpression, PageSize, DrilldownPageSize, User... and so on.
join works by concentrating the operation on the result of the first two tables, any subsequent joins would concentrate joining on the in-memory(or cached to disk) result of the first joined tables, and so on. less read-write needle movement, thus faster
Source: Here
First query is better than second query.. because first query we are joining both table.
and also check the explain plan for both queries...

Does SQL Server TOP stop processing once it finds enough rows?

When you use the SQL Server TOP clause in a query, does the SQL Server engine stop searching for rows once it has enough to satisfy the TOP X needed to be returned?
Consider the following queries (assume some_text_field is unique and not set for full-text indexing):
SELECT
pk_id
FROM
some_table
WHERE
some_text_field = 'some_value';
and
SELECT TOP 1
pk_id
FROM
some_table
WHERE
some_text_field = 'some_value';
The first query would need to search the entire table and return all of the results it found. The way we have it setup though, that query would ever really return one value. So, would using TOP 1 prevent SQL server from scanning the rest of the table once it has found a match?
Yes, the query stops once it has found enough rows, and doesn't query the rest of the table(s).
Note however that you would probably want to have an index that the database can use for the query. In that case there isn't really any performance difference between getting the first match and getting all one matches.
Yes.
In this case you would get 1 undefined row (as TOP without ORDER BY doesn't guarantee any particular result) then it would stop processing (The TOP iterator in the plan won't request any more rows from child iterators).
If there is a blocking operator (such as SORT) in the plan before the TOP operator or parallel operators before the TOP it may end up doing a lot of work for rows not returned in the final result anyway though.

SQL massive performance difference using SELECT TOP x even when x is much higher than selected rows

I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.
SELECT col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
is taking upwards of 5 or 6 mins to complete.
However
SELECT TOP 6000 col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
completes in about 4 or 5 seconds.
This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rows out of 200,000.
So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)?
Has anyone else witnessed something like this?
Thanks
Table valued functions can have a non-linear execution time.
Let's consider function equivalent for this query:
SELECT (
SELECT SUM(mi.value)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
mo.value
This query (that calculates the running SUM) is fast at the beginning and slow at the end, since on each row from mo it should sum all the preceding values which requires rewinding the rowsource.
Time taken to calculate SUM for each row increases as the row numbers increase.
If you make mytable large enough (say, 100,000 rows, as in your example) and run this query you will see that it takes considerable time.
However, if you apply TOP 5000 to this query you will see that it completes much faster than 1/20 of the time required for the full table.
Most probably, something similar happens in your case too.
To say something more definitely, I need to see the function definition.
Update:
SQL Server can push predicates into the function.
For instance, I just created this TVF:
CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN (
SELECT *
FROM master
);
These queries:
SELECT *
FROM fn_test()
WHERE name = #name
SELECT TOP 1000 *
FROM fn_test()
WHERE name = #name
yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP)
I had the same problem, a simple query joining five tables returning 1000 rows took two minutes to complete. When I added "TOP 10000" to it it completed in less than one second. It turned out that the clustered index on one of the tables was heavily fragmented.
After rebuilding the index the query now completes in less than a second.
Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.
If dbo.some_table_function is a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.
If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.
Not directly related, but another SO question on TVFs
You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?
In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.
All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)
It's not necessarily true that the whole table is processed if col1 has an index.
The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.
If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.
I bet one is using an index seek, while the other isn't.
If you have a generic client:
SET SHOWPLAN_ALL ON;
GO
select ...;
go
see http://msdn.microsoft.com/en-us/library/ms187735.aspx for details.
I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.
Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.