Surprising SQL speed increase - sql

I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?

More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'

It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.

There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.

If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'

Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.

What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.

As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)

Related

SQL Server uses scan instead of seek when using a window function and predicate contains a variable

Take a look at this fiddle: http://sqlfiddle.com/#!6/18324/2
Expand the very first execution plan, for the queries against view B.
Notice that the first query executes using index seek, while the second one - using index scan. In my real setup, with thousands of rows, this produces a performance hit that is quite considerable.
WTF???
The queries are equivalent, aren't they? Why does a literal produce seek and a variable - scan?
But more importantly: how can I work around this?
This post comes closest to the problem, and the solution that works from there is using option(recompile) (thank you, Martin Smith). However, that does not work for me, because my queries are being generated by my ORM library (which is Entity Framework) and I cannot amend them manually.
Rather what I'm looking for is a way to reformulate the B view so that the problem would not occur.
While fiddling with this problem, I have noticed that it is always the "Segment" block in the execution plan that loses the predicate. To verify this, I reformulated the query in terms of a subquery with min function (see view D). And voila! - both queries against the D view produce identical plans.
The bad news, however, is that I cannot use this min-powered trick, because in my real setup, the column Y is actually several columns, so that I can order by them, but I cannot take a min() of them.
So the second question would be: can anyone come up with a trick that is similar to min-powered subquery, but works for several columns?
NOTE 1: this is definitely not related to the tipping point, because there are just 2 records in the table.
NOTE 2: it also doesn't have to do with the presence of a view. See an example with view C: the server is happily using seek in that case.
May be this one will work
select a.X, a.Y from A a
cross apply
(select top 1 * from A t where t.X = a.X order by t.Y asc) as idx
SQLFiddle http://sqlfiddle.com/#!6/a3362/2
The queries do produce equivalent output, but in the eyes of the sql optimizer they are different. The article recommends looking at the OPTION clause (unfortunately not included before SQL 2005).
You can Brew Your Own Query on top of the Entity Framework, which may be your best bet to achieve the desired performance.
Here is my own answer.
Ultimately, I used the min-powered trick, and I got around the fact that Y is actually several columns by converting those columns into constant-length string representations (carefully tuned for sorting) and joining those strings together into one string. After that is done, I am able to use that joined string as the argument for min().
I still would like to know the proper way of doing this. If anyone happens to know it, I'd appreciate.

Measuring the complexity of SQL statements

The complexity of methods in most programming languages can be measured in cyclomatic complexity with static source code analyzers. Is there a similar metric for measuring the complexity of a SQL query?
It is simple enough to measure the time it takes a query to return, but what if I just want to be able to quantify how complicated a query is?
[Edit/Note]
While getting the execution plan is useful, that is not necessarily what I am trying to identify in this case. I am not looking for how difficult it is for the server to execute the query, I am looking for a metric that identifies how difficult it was for the developer to write the query, and how likely it is to contain a defect.
[Edit/Note 2]
Admittedly, there are times when measuring complexity is not useful, but there are also times when it is. For a further discussion on that topic, see this question.
Common measures of software complexity include Cyclomatic Complexity (a measure of how complicated the control flow is) and Halstead complexity (a measure of complex the arithmetic is).
The "control flow" in a SQL query is best related to "and" and "or" operators in query.
The "computational complexity" is best related to operators such as SUM or implicit JOINS.
Once you've decided how to categorize each unit of syntax of a SQL query as to whether it is "control flow" or "computation", you can straightforwardly compute Cyclomatic or Halstead measures.
What the SQL optimizer does to queries I think is absolutely irrelevant. The purpose of complexity measures is to characterize how hard is to for a person to understand the query, not how how efficiently it can be evaluated.
Similarly, what the DDL says or whether views are involved or not shouldn't be included in such complexity measures. The assumption behind these metrics is that the complexity of machinery inside a used-abstraction isn't interesting when you simply invoke it, because presumably that abstraction does something well understood by the coder. This is why Halstead and Cyclomatic measures don't include called subroutines in their counting, and I think you can make a good case that views and DDL information are those "invoked" abstractractions.
Finally, how perfectly right or how perfectly wrong these complexity numbers are doesn't matter much, as long they reflect some truth about complexity and you can compare them relative to one another. That way you can choose which SQL fragments are the most complex, thus sort them all, and focus your testing attention on the most complicated ones.
I'm not sure the retrieval of the query plans will answer the question: the query plans hide a part of the complexity about the computation performed on the data before it is returned (or used in a filter); the query plans require a significative database to be relevant. In fact, complexity, and length of execution are somewhat oppposite; something like "Good, Fast, Cheap - Pick any two".
Ultimately it's about the chances of making a mistake, or not understanding the code I've written?
Something like:
number of tables times (1
+1 per join expression (+1 per outer join?)
+1 per predicate after WHERE or HAVING
+1 per GROUP BY expression
+1 per UNION or INTERSECT
+1 per function call
+1 per CASE expression
)
Please feel free to try my script that gives an overview of the stored procedure size, the number of object dependencies and the number of parameters -
Calculate TSQL Stored Procedure Complexity
SQL queries are declarative rather than procedural: they don't specify how to accomplish their goal. The SQL engine will create a procedural plan of attack, and that might be a good place to look for complexity. Try examining the output of the EXPLAIN (or EXPLAIN PLAN) statement, it will be a crude description of the steps the engine will use to execute your query.
Well I don't know of any tool that did such a thing, but it seems to me that what would make a query more complicated would be measured by:
the number of joins
the number of where conditions
the number of functions
the number of subqueries
the number of casts to differnt datatypes
the number of case statements
the number of loops or cursors
the number of steps in a transaction
However, while it is true that the more comlex queries might appear to be the ones with the most possible defects, I find that the simple ones are very likely to contain defects as they are more likely to be written by someone who doesn't understand the data model and thus they may appear to work correctly, but in fact return the wrong data. So I'm not sure such a metric wouild tell you much.
In the absence of any tools that will do this, a pragmatic approach would be to ensure that the queries being analysed are consistently formatted and to then count the lines of code.
Alternatively use the size of the queries in bytes when saved to file (being careful that all queries are saved using the same character encoding).
Not brilliant but a reasonable proxy for complexity in the absence of anything else I think.
In programming languages we have several methods to compute the time complexity or space complexity.
Similarly we could compare with sql as well like in a procedure the no of lines you have with loops similar to a programming language but unlike only input usually in programming language in sql it would along with input will totally depend on the data in the table/view etc to operate plus the overhead complexity of the query itself.
Like a simple row by row query
Select * from table ;
// This will totally depend on no of
records say n hence O(n)
Select max(input) from table;
// here max would be an extra
overhead added to each
Therefore t*O(n) where t is max
Evaluation time
Here is an idea for a simple algorithm to compute a complexity score related to readability of the query:
Apply a simple lexer on the query (like ones used for syntax coloring in text editors or here on SO) to split the query in tokens and give each token a class:
SQL keywords
SQL function names
string literals with character escapes
string literals without character escape
string literals which are dates or date+time
numeric literals
comma
parenthesis
SQL comments (--, /* ... */)
quoted user words
non quoted user words: everything else
Give a score to each token, using different weights for each class (and differents weights for SQL keywords).
Add the scores of each token.
Done.
This should work quite well as for example counting sub queries is like counting the number of SELECT and FROM keywords.
By using this algorithm with different weight tables you can even measure the complexity in different dimensions. For example to have nuanced comparison between queries. Or to score higher the queries which use keywords or functions specific to an SQL engine (ex: GROUP_CONCAT on MySQL).
The algorithm can also be tweaked to take in account the case of SQL keywords: increase complexity if they are not consistently upper case. Or to account for indent (carriage return, position of keywords on a line)
Note: I have been inspired by #redcalx answer that suggested applying a standard formatter and counting lines of code. My solution is simpler however as it doesn't to build a full AST (abstract syntax tree).
Toad has a built-in feature for measuring McCabe cyclomatic complexity on SQL:
https://blog.toadworld.com/what-is-mccabe-cyclomatic-complexity
Well if you're are using SQL Server I would say that you should look at the cost of the query in the execution plan (specifically the subtree cost).
Here is a link that goes over some of the things you should look at in the execution plan.
Depending on your RDBMS, there might be query plan tools that can help you analyze the steps the RDBMS will take in fetching your query.
SQL Server Management Studio Express has a built-in query execution plan. Pervasive PSQL has its Query Plan Finder. DB2 has similar tools (forgot what they're called).
A good question. The problem is that for a SQL query like:
SELECT * FROM foo;
the complexity may depend on what "foo" is and on the database implementation. For a function like:
int f( int n ) {
if ( n == 42 ) {
return 0;
}
else {
return n;
}
}
there is no such dependency.
However, I think it should be possible to come up with some useful metrics for a SELECT, even if they are not very exact, and I'll be interested to see what answers this gets.
It's reasonably enough to consider complexity as what it would be if you coded the query yourself.
If the table has N rows then,
A simple SELECT would be O(N)
A ORDER BY is O(NlogN)
A JOIN is O(N*M)
A DROP TABLE is O(1)
A SELECT DISTINCT is O(N^2)
A Query1 NOT IN/IN Query2 would be O( O1(N) * O2(N) )

SQL efficiency - [=] vs [in] vs [like] vs [matches]

Just out of curiosity, I was wondering if there are any speed/efficiency differences in using [=] versus [in] versus [like] versus [matches] (for only 1 value) syntax for sql.
select field from table where field = value;
versus
select field from table where field in (value);
versus
select field from table where field like value;
versus
select field from table where field matches value;
I will add to that also exists and subquery.
But the performance depends on the optimizer of the given SQL engine.
In oracle you have a lot of differences between IN and EXISTS, but not necessarily in SQL Server.
The other thing that you have to consider is the selectivity of the column that you use. Some cases show that IN is better.
But you have to remember that IN is non-sargable (non search argument able) so it will not use the index to resolve the query, the LIKE and = are sargable and support the index
The best ? You should spend some time to test it in your environment
it depends on the underlying SQL engine. In MS-SQL, for example (according to the query planner output), IN clauses are converted to =, so there would be no difference
normally the "in" statement is used when there are several values to be compared. The engine walks the list for each value to see if one matches. if there is only one element then there is no difference in time vs the "=" statement.
the "like" expression is different in that is uses pattern matching to find the correct values, and as such requires a bit more work in the back end. For a single value, there wouldn't be a significant time difference because you only have one possible match, and the comparison would be the same type of comparison that would occur for "=" and "in".
basically, no, or at least the difference is so insignificant that you wouldn't notice.
The best practice for any question about what would be faster, is to measure. SQL engines are notoriously difficult to predict. You can look at the output of EXPLAIN PLAN to get a sense of it, but in the end, only measuring the performance on real data will tell you what you need to know.
In theory, a SQL engine could implement all three of these exactly the same, but they may not.

Cost of logic in a query

I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?
Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?
For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.
You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.

Does the way you write sql queries affect performance?

say i have a table
Id int
Region int
Name nvarchar
select * from table1 where region = 1 and name = 'test'
select * from table1 where name = 'test' and region = 1
will there be a difference in performance?
assume no indexes
is it the same with LINQ?
Because your qualifiers are, in essence, actually the same (it doesn't matter what order the where clauses are put in), then no, there's no difference between those.
As for LINQ, you will need to know what query LINQ to SQL actually emits (you can use a SQL Profiler to find out). Sometimes the query will be the simplest query you can think of, sometimes it will be a convoluted variety of such without you realizing it, because of things like dependencies on FKs or other such constraints. LINQ also wouldn't use an * for select.
The only real way to know is to find out the SQL Server Query Execution plan of both queries. To read more on the topic, go here:
SQL Server Query Execution Plan Analysis
Should it? No. SQL is a relational algebra and the DBMS should optimize irrespective of order within the statement.
Does it? Possibly. Some DBMS' may store data in a certain order (e.g., maintain a key of some sort) despite what they've been told. But, and here's the crux: you cannot rely on it.
You may need to switch DBMS' at some point in the future. Even a later version of the same DBMS may change its behavior. The only thing you should be relying on is what's in the SQL standard.
Regarding the query given: with no indexes or primary key on the two fields in question, you should assume that you'll need a full table scan for both cases. Hence they should run at the same speed.
I don't recommend the *, because the engine should look for the table scheme before executing the query. Instead use the table fields you want to avoid unnecessary overhead.
And yes, the engine optimizes your queries, but help him :) with that.
Best Regards!
For simple queries, likely there is little or no difference, but yes indeed the way you write a query can have a huge impact on performance.
In SQL Server (performance issues are very database specific), a correlated subquery will usually have poor performance compared to doing the same thing in a join to a derived table.
Other things in a query that can affect performance include using SARGable1 where clauses instead of non-SARGable ones, selecting only the fields you need and never using select * (especially not when doing a join as at least one field is repeated), using a set-bases query instead of a cursor, avoiding using a wildcard as the first character in a a like clause and on and on. There are very large books that devote chapters to more efficient ways to write queries.
1 "SARGable", for those that don't know, are stage 1 predicates in DB2 parlance (and possibly other DBMS'). Stage 1 predicates are more efficient since they're parts of indexes and DB2 uses those first.