SELECT a, b FROM products WHERE (a = 1 OR b = 2)
or...
SELECT a, b FROM products WHERE NOT (a != 1 AND b != 2)
Both statements should achieve the same results. However, the second one avoids the infamously slow "OR" operand in SQL. Does that make the 2nd statement faster?
Traditionally the latter was easier for the optimiser to deal with in that it could easily resolve an and to a s-arg, which (loosely speaking) is a predicate that can be resolved using an index.
Historically, query optimisers could not resolve OR statements to s-args and queries using OR predicates could not make effective use of indexes. Thus, the recommendation was to avoid it and re-cast the query in terms like the latter example. More recent optimisers are better at recognising OR statements that are amenable to this transform, but complex OR statements may still confuse them, resulting in unnecessary table scans.
This is the origin of the 'OR is slow' meme. The performance is nothing to do with the efficiency of processing the expression but rather the ability of the optimiser to recognise opportunities to make use of indexes.
No, a != 1 and b != 2 is identical to a = 1 or b = 2.
The query optimizer will run the same query plan for both, at least in any marginally sophisticated implementation of Sql.
There are no inherently slow or fast operators in SQL. When you issue a query, you describe the results you want. If two semantically identical queries (especially simple ones like this) yield very different run times, your SQL implementation is not very clever.
SQL Server rewrites all queries before optimizing, and most likely both queries will be the same after rewriting.
YOu can examine their execution plans in SSMS, just hit Ctrl+L, most likely they will be the same.
Also run the following:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
and rerun your queries - you should see identical real execution costs.
Ideally OR should be faster in this case because for every n steps, if it already found a=1 then it will not test second condition. Also there is no inverse operator (NOT) involved.
However for AND to be true, SQL has to test both the conditions, so for every n steps there are 2n conditions evaluated where else in OR, the number of conditions evaluated will always be less then 2n. Plus it has an additional operator to be evaluated.
However if one of the a or b is indexed, the query execution plan may differ because indexed column comparison involves intersect and union join operations over individual compare result sets !!
Also it would be wrong to consider OR as slow operator, when you consider your complex queries with joins over multiple tables, that time OR could be a big problem as mentioned by other contributor in this question. But for smaller query, OR should be fine. Infact every query has its own challenges, it not only depends on whats documented on help file, but also depends on how your data is distributed, its repeatation and variance factor.
Related
So SQL Server does not have short-circuiting in the explicit manner as with for example if-statements in general-purpose programming languages.
So consider the following mock-up query:
SELECT * FROM someTable
WHERE name = 'someValue' OR name in (*some extremely expensive nested sub-query only needed to cover 0.01% of cases*)
Let's say there are only 3 rows in the table and all of them match name = 'someValue'. Will the expensive sub-query ever run?
Let's say there are 3 million rows and all but 1 could be fetched with the name = 'someValue' except 1 row which need to be fetched with the sub-query. Will the sub-query ever be evaluated when it is not needed?
If one has a similar real case, one might be ok with letting the 0.01% wait for the expensive sub-query to run before getting the results as long as the results are fetched quickly without the sub-query for the 99.99% of cases.
(I know that my specific example above could be handled explicitly with IF-statements in an SP, as suggested in this related thread:
Sql short circuit OR or conditional exists in where clause
but let's assume that is not an option.)
As the comments point out, the optimizer in SQL Server is pretty smart.
You could attempt the short-circuiting by using case. As the documentation states:
The CASE expression evaluates its conditions sequentially and stops with the first condition whose condition is satisfied.
Note that there are some exceptions involving aggregation. So, you could do:
SELECT t.*
FROM someTable t
WHERE 'true' = (CASE WHEN t.name = 'someValue' THEN 'true'
WHEN t.name in (*some extremely expensive nested sub-query only needed to cover 0.01% of cases*)
THEN 'true'
END)
This type of enforced ordering is generally considered a bad idea. One exception is when one of the paths might involve an error,such as a type conversion error) -- however, that is generally fixed nowadays with the try_ functions.
In your case, I suspect that replacing the IN with EXISTS and using appropriate indexes might eliminate almost all the performance penalty of the subquery. However, that is a different matter.
I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?
Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?
For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.
You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.
I was reading over the documentation for query hints:
http://msdn.microsoft.com/en-us/library/ms181714(SQL.90).aspx
And noticed this:
FAST number_rows
Specifies that the query is optimized for fast retrieval of the first number_rows. This is a nonnegative integer. After the first number_rows are returned, the query continues execution and produces its full result set.
So when I'm doing a query like:
Select Name from Students where ID = 444
Should I bother with a hint like this? Assuming SQL Server 2005, when should I?
-- edit --
Also should one bother when limiting results:
Select top 10 * from Students OPTION (FAST 10)
The FAST hint only makes sense on complex queries where there are multiple alternatives the optimizer could choose from. For a simple query like your example it doesn't help with anything, the query optimizer will immediately determine that there is a trivial plan (seek in ID index, lookup Name if not covering) to satisfy the query and go for it. Even if no index exists on ID, the plan is still trivial (probably clustered scan).
To give an example where FAST would be useful consider a join between A and B, with an ORDER BY constraint. Say evaluating the join B first and nested loops A honors the ORDER BY constraint, so will produce fast results (no SORT necessary), but is more costly because of cardinality (B has many records that match the WHERE, while A has few). On the other hand evaluating B first and nested loop A would produce a query that does less IO hence is faster overall, but the result would have to be sorted first and SORT can only start after the join is evaluated, so the first result will come very late. The optimizer would normally pick the second plan because is more efficient overall. The FAST hint would cause the optimizer to pick the first plan, because it produces results faster.
When using TOP x, there's no benefit of also using OPTION FAST x. The query optimizer already makes its decisions based on how many rows you are retrieving. Same goes for trivial queries, such as querying for a particular value from a unique index.
Other than that, OPTION FAST x could help when you know the number of results is likely below x, but the query optimizer does not. Of course, if the query optimizer is choosing poor paths for complex queries with few results, your statistics may need to be updated. And if you guess wrong on x, the query may end up taking longer--almost always a risk when giving hints.
The above statement has not been tested--it may be that all queries take just as long to fully execute, if not longer. Getting the first 10 rows fast is great if there are only 8 rows, but theoretically the query still has to execute fully before finishing. The benefit I'm thinking may be there because the query execution takes a different path expecting fewer total records, when in fact it's really trying to get the first x faster. Those two types of optimizations may not be in alignment.
For that particular query, certainly not! It's only going to return one row — the row with ID = 444. SQL Server will select that row as efficiently as it can.
FAST 10 might be used in a situation where you could make use of the first 10 rows immediately, even as you continue to wait for further results.
I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?
More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'
It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.
There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.
If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'
Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.
What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.
As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)
My system does some pretty heavy processing, and I've been attacking the performance in order to give me the ability to run more test runs in shorter times.
I have quite a few cases where a UDF has to get called on say, 5 million rows (and I pretty much thought there was no way around it).
Well, it turns out, there is a way to work around it and it gives huge performance improvements when UDFs are called over a set of distinct parameters somewhat smaller than the total set of rows.
Consider a UDF that takes a set of inputs and returns a result based on complex logic, but for the set of inputs over 5m rows, there are only 100,000 distinct inputs, say, and so it will only produce 100,000 distinct result tuples (my particular cases vary from interest rates to complex code assignments, but they are all discrete - the fundamental point with this technique is that you can simply determine if the trick will work by running the SELECT DISTINCT).
I found that by doing something like this:
INSERT INTO PreCalcs
SELECT param1
,param2
,dbo.udf_result(param1, param2) AS result
FROM (
SELECT DISTINCT param1, param2 FROM big_table
)
When PreCalcs is suitably indexed, the combination of that with:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON PreCalcs.param1 = big_table.param1
AND PreCalcs.param2 = big_table.param2
You get a HUGE boost in performance. Apparently, just because something is deterministic, it doesn't mean SQL Server is caching the past calls and re-using them, as one might think.
The only thing you have to watch out for is where NULL are allowed, then you need to fix up your joins carefully:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON (
PreCalcs.param1 = big_table.param1
OR COALESCE(PreCalcs.param1, big_table.param1) IS NULL
)
AND (
PreCalcs.param2 = big_table.param2
OR COALESCE(PreCalcs.param2, big_table.param2) IS NULL
)
Hope this helps and any similar tricks with UDFs, or refactoring queries for performance are welcome.
I guess the question is, why is manual caching like this necessary - isn't that the point of the server knowing that the function is deterministic? And if it makes such a big difference, and if UDFs are so expensive, why doesn't the optimizer just do it in the execution plan?
Yes, the optimizer will not manually memoize UDFs for you. Your trick is very nice in the cases where you can collapse the output set down in this way.
Another technique that can improve performance if your UDF's parameters are indices into other tables, and the UDF selects values from those tables to calculate the scalar result, is to rewrite your scalar UDF as a table-valued UDF that selects the result value over all your potential parameters.
I've used this approach when the tables we based the UDF query on were subject to a lot of inserts and updates, the involved query was relatively complex, and the number of rows the original UDF had to be applied to were large. You can achieve some great improvement in performance in this case, as the table-values UDF only needs to be run once and can run as an optimized set-oriented query.
How would SQL Server know that you have 100,000 discrete combinations within 5 million rows?
By using the PreCalcs table, you are simply running the udf over 100k rows rather that 5 million rows, before expanding back out again.
No optimiser in existence would be able to divine this useful information.
The scalar udf is a black box.
For a more practical solution, I'd use a computed, persisted columns that does the udf call.
So it's available in all queries can be indexed/included.
This suits OLTP more, maybe... I query a table to get trading cash and positions in real time in many different ways so this approach suits me to avoid the udf math overhead every time.