Should I use Query Hint Fast number_rows / FASTFIRSTROW? - sql

I was reading over the documentation for query hints:
http://msdn.microsoft.com/en-us/library/ms181714(SQL.90).aspx
And noticed this:
FAST number_rows
Specifies that the query is optimized for fast retrieval of the first number_rows. This is a nonnegative integer. After the first number_rows are returned, the query continues execution and produces its full result set.
So when I'm doing a query like:
Select Name from Students where ID = 444
Should I bother with a hint like this? Assuming SQL Server 2005, when should I?
-- edit --
Also should one bother when limiting results:
Select top 10 * from Students OPTION (FAST 10)

The FAST hint only makes sense on complex queries where there are multiple alternatives the optimizer could choose from. For a simple query like your example it doesn't help with anything, the query optimizer will immediately determine that there is a trivial plan (seek in ID index, lookup Name if not covering) to satisfy the query and go for it. Even if no index exists on ID, the plan is still trivial (probably clustered scan).
To give an example where FAST would be useful consider a join between A and B, with an ORDER BY constraint. Say evaluating the join B first and nested loops A honors the ORDER BY constraint, so will produce fast results (no SORT necessary), but is more costly because of cardinality (B has many records that match the WHERE, while A has few). On the other hand evaluating B first and nested loop A would produce a query that does less IO hence is faster overall, but the result would have to be sorted first and SORT can only start after the join is evaluated, so the first result will come very late. The optimizer would normally pick the second plan because is more efficient overall. The FAST hint would cause the optimizer to pick the first plan, because it produces results faster.

When using TOP x, there's no benefit of also using OPTION FAST x. The query optimizer already makes its decisions based on how many rows you are retrieving. Same goes for trivial queries, such as querying for a particular value from a unique index.
Other than that, OPTION FAST x could help when you know the number of results is likely below x, but the query optimizer does not. Of course, if the query optimizer is choosing poor paths for complex queries with few results, your statistics may need to be updated. And if you guess wrong on x, the query may end up taking longer--almost always a risk when giving hints.
The above statement has not been tested--it may be that all queries take just as long to fully execute, if not longer. Getting the first 10 rows fast is great if there are only 8 rows, but theoretically the query still has to execute fully before finishing. The benefit I'm thinking may be there because the query execution takes a different path expecting fewer total records, when in fact it's really trying to get the first x faster. Those two types of optimizations may not be in alignment.

For that particular query, certainly not! It's only going to return one row — the row with ID = 444. SQL Server will select that row as efficiently as it can.
FAST 10 might be used in a situation where you could make use of the first 10 rows immediately, even as you continue to wait for further results.

Related

What is the efficiency of a query + subquery that finds the minimum parameter of a table in SQL?

I'm currently taking an SQL course and trying to understand efficiency of queries.
Given this query, what's the efficiency of it:
SELECT *
FROM Customers
WHERE Age = (SELECT MIN(Age)
FROM Customers)
What i'm trying to understand is if the subquery runs once at the beginning and then the query is O(n+n)?
Or does the subquery run everytime you go through a customer's age which makes it O(n^2)?
Thank you!
If you want to understand how the query optimizer interperets a query you have to review the execution / explain plan which almost every RDBMS makes available.
As noted in the comments you tell the RDBMS what you want, not how to get it.
Very often it helps to have a deeper understanding of the particular database engine being used in order to write a query in the most performant way, ie, to be able to think like the query processor.
Like any language, there's more than one way to skin a cat, so to speak, and with SQL there is usually more than one way to write a query that results in the same output - very often many ways, depending on the complexity.
How a query execution plan gets built and executed is determined by the query optimizer at compile time and depends on many factors, depending on the RDBMS, such as data cardinality, table size, row size, estimated number of rows, sargability, indexes, available resources, current load, concurrency, isolation level - just to name a few.
It often helps to write queries in the most performant way by thinking what you would have to do to accomplish the same task.
In your example, you are looking for all the rows in a table where a particular value equals another value. You have chosen to find that value by first looking for the minimum age - you would only have to do this once as it's a single scalar value, so it's reasonable to assume (but not guaranteed) the database engine would do the same.
You could also approach the problem by aggregating and limiting to the top qualifying row and including ties, if the syntax is supported by the RDBMS, and joining the results.
Ultimately there is no black and white answer.

Pre-fetching row counts before query - performance

I recently answered this question based on my experience:
Counting rows before proceeding to actual searching
but I'm not 100% satisfied with the answer I gave.
The question is basically this: Can I get a performance improvement by running a COUNT over a particular query before deciding to run the query that brings back the actual rows?
My intuition is this: you will only save the I/O and wire time associated with retrieving the data instead of the count because to count the data, you need to actually find the rows. The possible exception to this is when the query is a simple function of the indexes.
My question then is this: Is this always true? What other exception cases are there? From a pure performance perspective, in what cases would one want to do a COUNT before running the full query?
First, the answer to your question is highly dependent on the database.
I cannot think of a situation when doing a COUNT() before a query will shorten the overall time for both the query and the count().
In general, doing a count will pre-load tables and indexes into the page cache. Assuming the data fits in memory, this will make the subsequent query run faster (although not much faster if you have fast I/O and the database does read-ahead page reading). However, you have just shifted the time frame to the COUNT(), rather than reducing overall time.
To shorten the overall time (including the run time of the COUNT()) would require changing the execution plan. Here are two ways this could theoretically happen:
A database could update statistics as a table is read in, and these statistics, in turn, change the query plan for the main query.
A database could change the execution plan based on whether tables/indexes are already in the page cache.
Although theoretically possible, I am not aware of any database that does either of these.
You could imagine that intermediate results could be stored, but this would violate the dynamic nature of SQL databases. That is, updates/inserts could occur on the tables between the COUNT() and the query. A database engine could not maintain integrity and maintain such intermediate results.
Doing a COUNT() has disadvantages, relative to speeding up the subsequent query. The query plan for the COUNT() might be quite different from the query plan for the main query. Your example with indexes is one case. Another case would be in a columnar database, where different vertical partitions of the data do not need to be read.
Yet another case would be a query such as:
select t.*, r.val
from table t left outer join
ref r
on t.refID = r.refID
and refID is a unique index on the ref table. This join can be eliminated for a count, since there are not duplicates and all records in t are used. However, the join is clearly needed for this query. Once again, whether a SQL optimizer recognizes and acts on this situation is entirely the decision of the writers of the database. However, the join could theoretically be optimized away for the COUNT().

Does using the TOP X * format in SQL speed up queries significantly?

So lately when I run queries on huge tables I'll use the the top 10 * notation like so:
select top 10 * from BI_Sessions (nolock)
where SessionSID like 'b6d%'
and CreateDate between '03-15-2012' AND '05-18-2012'
I thought that it let's it run faster, but it doesn't seem so , this one took 4 minutes(or is that OK time)?
I guess I'm curious about whether the top functionality happens after it pulls all the data anyway(which would seem like it's inefficient).
thanks
It entirely depends on the query, with the exceptino of "Top 0". "Top 0" does return much faster.
In your case, the query has to look through the rows in a huge table to find rows that match the WHERE clause. If no rows are found, the number of rows being returned doesn't help. If the rows are at the end of the table scan, then the number of rows being returned doesn't help.
There are certain cases with more complicated queries where the "top" could affect performance. There is a difference between optimizing overall and for the first row returned. I'm not sure if SQL Server's optimizer recognizes this difference.
Well, it depends. If you do not have a covering index on BI_sessions and its a large database then the answer is probably. A good covering index may be something like: CreateDate, SessionSIS, and all the columns you actually need to return. If you do have a coveing index, then SQL will not even read the table, it will get all the data it needs from the covering index. Possibly if you specified the columns you actually need to return, 10 rows should come back in a fraction of a second.
for more useful info
http://www.mssqltips.com/sqlservertip/1078/improve-sql-server-performance-with-covering-index-enhancements/
and a bit more technical:
http://www.simple-talk.com/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/
also
http://www.sqlserverinternals.com/
and
http://www.insidesqlserver.com/thebooks.html

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?
The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?
You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.
A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.
It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

Which conditional statement is faster in SQL?

SELECT a, b FROM products WHERE (a = 1 OR b = 2)
or...
SELECT a, b FROM products WHERE NOT (a != 1 AND b != 2)
Both statements should achieve the same results. However, the second one avoids the infamously slow "OR" operand in SQL. Does that make the 2nd statement faster?
Traditionally the latter was easier for the optimiser to deal with in that it could easily resolve an and to a s-arg, which (loosely speaking) is a predicate that can be resolved using an index.
Historically, query optimisers could not resolve OR statements to s-args and queries using OR predicates could not make effective use of indexes. Thus, the recommendation was to avoid it and re-cast the query in terms like the latter example. More recent optimisers are better at recognising OR statements that are amenable to this transform, but complex OR statements may still confuse them, resulting in unnecessary table scans.
This is the origin of the 'OR is slow' meme. The performance is nothing to do with the efficiency of processing the expression but rather the ability of the optimiser to recognise opportunities to make use of indexes.
No, a != 1 and b != 2 is identical to a = 1 or b = 2.
The query optimizer will run the same query plan for both, at least in any marginally sophisticated implementation of Sql.
There are no inherently slow or fast operators in SQL. When you issue a query, you describe the results you want. If two semantically identical queries (especially simple ones like this) yield very different run times, your SQL implementation is not very clever.
SQL Server rewrites all queries before optimizing, and most likely both queries will be the same after rewriting.
YOu can examine their execution plans in SSMS, just hit Ctrl+L, most likely they will be the same.
Also run the following:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
and rerun your queries - you should see identical real execution costs.
Ideally OR should be faster in this case because for every n steps, if it already found a=1 then it will not test second condition. Also there is no inverse operator (NOT) involved.
However for AND to be true, SQL has to test both the conditions, so for every n steps there are 2n conditions evaluated where else in OR, the number of conditions evaluated will always be less then 2n. Plus it has an additional operator to be evaluated.
However if one of the a or b is indexed, the query execution plan may differ because indexed column comparison involves intersect and union join operations over individual compare result sets !!
Also it would be wrong to consider OR as slow operator, when you consider your complex queries with joins over multiple tables, that time OR could be a big problem as mentioned by other contributor in this question. But for smaller query, OR should be fine. Infact every query has its own challenges, it not only depends on whats documented on help file, but also depends on how your data is distributed, its repeatation and variance factor.