Indexing does not work with LIKE and NOT operator in Sql Server, Is it a MYTH - sql

I have seen in many articles on Sql Server which states that when we are writing a query we should avoid using NOT and LIKE operator in our query because indexing is not applied on it.
eg.
SELECT [CityId]
,[CityName]
,[StateId]
FROM [LIMS].[dbo].[City]
WHERE CityId NOT IN(1, 2)
I executed above query and found that indexing was getting used to filter the records.
Following is the execution plan, which clearly shows Clustured Index seek. This clearly violates what I used to think and read.
Was my previous understanding incorrect ?

That indexing does not work with LIKE and NOT operators is just a rule of thumb. SQL Server (or any competent RDBMS, for that matter) will use the best algorithm it can in most cases. So, if you could manually seek an index, so could SQL Server.
In the particular example you provided, it is unclear whether a seek is any more efficient than a scan because most of the records are going to be returned anyhow. So, I wouldn't read much into that particular execution plan.
Bottom line: Learn and understand how your database system internally organizes data and indices so you won't have to rely on rules of thumb.

Related

Does SQLite have an eval command?

I have a query referenced in Why is SQLite refusing to use available indexes when adding a JOIN? that is a compound query. When the segments of the query are evaluated individually, the query plan generated applies the relevant indicies and runs smoothly. However, when run together (via a JOIN) it fails to do so. Therefore, I was wondering if there was a way to create a query that runs 'eval' on the subquery and passes that to the outer query to force SQLite to use the query plans that would have been generated had they been done individually.
The answer to your other question tells you why already: indexes are not used when they're not useful.
In essence:
If it's cheapest to hop back and forth on disk pages to fetch a handful of rows that match a query, an index gets used.
If it's cheapest to just read the entire mess and filter out uneeded rows, an index is not used.
Some databases (e.g. Postgres) offer an intermediary level between the two in the form of a bitmap index scan: it amounts to the second with a pre-flight check based on the index, to avoid visiting disk pages that contain no matching rows.
That's all there is to it, really: a few rows, index; lots of rows, no index.
Naturally, poorly written queries don't use indexes either, but that's for different reasons: they just confuse the query planner, and while smart the latter is not all-knowing. Joining on a union or an aggregate, in particular, are a prime recipe for not using indexes. (And that is what you are doing.)
Per usual you should write your queries and indexes that way so that Sqlite's query optimizer recognizes the optimal indexes and just uses them.
But as your question in this case is more specific it seems you look for an equivalent of SQL Server's FORCE(INDEX) clause.
As I have read about it in Sqlite there is the clause INDEXED BY, though it seems Sqlite's community's opinions about it are split (probably because of what I mentioned in my first sentence)
link 1 sqlite.org's documentation about it
link 2 for a tutorial on that

How to rewrite SQL queries who give better performance in MySQL

I have a project where all SQL queries was already written. Most of them can be improved, and I want to do that.
But how I can check and analyze the both queries [pre-written and I write same queries using different way].
I want to analyze the queries if I rewrite them then I can check if I did right or if old query is better than mine.
I need a queries analyzer tool for Mysql based RDBMS.
[if any mistake goes then edit it]
MySQL has an EXPLAIN statement, which tells you what execution path the engine will follow. You can use that to determine what changes to make:
EXPLAIN SELECT foo,bar from glurch WHERE baz > 1 ORDER BY foo;
See: http://dev.mysql.com/doc/refman/5.0/en/explain-output.html
Pay special attention to the rows column, which shows how many table rows need to be examined to execute that part of the query, as well as the Extra column, which show more importantly, how the query gets executed.
For example, if you see "Using temporary" in the Extra column, that usually means that the DB will need to write the query results out to a temporary table (possibly on disk), sort them, and then re-read them (or at least some of them).
You can avoid temporary tables and other nasty performance killers by providing appropriate indexes. And not just enough indexes, but rather the correct indexes. Just because you've indexed a column doesn't mean that index can be used in your query.
When you have a good RDMBS engine, the query parser should determine the most performant execution plan for your query, no matter how it is written.
I mean: the order of the tables in your from and join clauses, or the order of your filter-criteria should not matter.
The first thing to do, is look at the execution plan and analyze it. Check wether you've to add indexes, or modify indexes for instance.
Your question is also a bit vague. Can you give a sample of a query which you'd like to refactor ?

Cost of logic in a query

I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?
Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?
For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.
You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.

Surprising SQL speed increase

I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?
More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'
It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.
There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.
If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'
Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.
What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.
As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)

How do you interpret a query's explain plan?

When attempting to understand how a SQL statement is executing, it is sometimes recommended to look at the explain plan. What is the process one should go through in interpreting (making sense) of an explain plan? What should stand out as, "Oh, this is working splendidly?" versus "Oh no, that's not right."
I shudder whenever I see comments that full tablescans are bad and index access is good. Full table scans, index range scans, fast full index scans, nested loops, merge join, hash joins etc. are simply access mechanisms that must be understood by the analyst and combined with a knowledge of the database structure and the purpose of a query in order to reach any meaningful conclusion.
A full scan is simply the most efficient way of reading a large proportion of the blocks of a data segment (a table or a table (sub)partition), and, while it often can indicate a performance problem, that is only in the context of whether it is an efficient mechanism for achieving the goals of the query. Speaking as a data warehouse and BI guy, my number one warning flag for performance is an index based access method and a nested loop.
So, for the mechanism of how to read an explain plan the Oracle documentation is a good guide: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/ex_plan.htm#PFGRF009
Have a good read through the Performance Tuning Guide also.
Also have a google for "cardinality feedback", a technique in which an explain plan can be used to compare the estimations of cardinality at various stages in a query with the actual cardinalities experienced during the execution. Wolfgang Breitling is the author of the method, I believe.
So, bottom line: understand the access mechanisms. Understand the database. Understand the intention of the query. Avoid rules of thumb.
This subject is too big to answer in a question like this. You should take some time to read Oracle's Performance Tuning Guide
The two examples below show a FULL scan and a FAST scan using an INDEX.
It's best to concentrate on your Cost and Cardinality. Looking at the examples the use of the index reduces the Cost of running the query.
It's a bit more complicated (and i don't have a 100% handle on it) but basically the Cost is a function of CPU and IO cost, and the Cardinality is the number of rows Oracle expects to parse. Reducing both of these is a good thing.
Don't forget that the Cost of a query can be influenced by your query and the Oracle optimiser model (eg: COST, CHOOSE etc) and how often you run your statistics.
Example 1:
SCAN http://docs.google.com/a/shanghainetwork.org/File?id=dd8xj6nh_7fj3cr8dx_b
Example 2 using Indexes:
INDEX http://docs.google.com/a/fukuoka-now.com/File?id=dd8xj6nh_9fhsqvxcp_b
And as already suggested, watch out for TABLE SCAN. You can generally avoid these.
Looking for things like sequential scans can be somewhat useful, but the reality is in the numbers... except when the numbers are just estimates! What is usually far more useful than looking at a query plan is looking at the actual execution. In Postgres, this is the difference between EXPLAIN and EXPLAIN ANALYZE. EXPLAIN ANALYZE actually executes the query, and gets real timing information for every node. That lets you see what's actually happening, instead of what the planner thinks will happen. Many times you'll find that a sequential scan isn't an issue at all, instead it's something else in the query.
The other key is identifying what the actual expensive step is. Many graphical tools will use different sized arrows to indicate how much different parts of the plan cost. In that case, just look for steps that have thin arrows coming in and a thick arrow leaving. If you're not using a GUI you'll need to eyeball the numbers and look for where they suddenly get much larger. With a little practice it becomes fairly easy to pick out the problem areas.
Really for issues like these, the best thing to do is ASKTOM. In particular his answer to that question contains links to the online Oracle doc, where a lot of the those sorts of rules are explained.
One thing to keep in mind, is that explain plans are really best guesses.
It would be a good idea to learn to use sqlplus, and experiment with the AUTOTRACE command. With some hard numbers, you can generally make better decisions.
But you should ASKTOM. He knows all about it :)
The output of the explain tells you how long each step has taken. The first thing is to find the steps that have taken a long time and understand what they mean. Things like a sequential scan tell you that you need better indexes - it is mostly a matter of research into your particular database and experience.
One "Oh no, that's not right" is often in the form of a table scan. Table scans don't utilize any special indexes and can contribute to purging of every useful in memory caches. In postgreSQL, for example, you will find it looks like this.
Seq Scan on my_table (cost=0.00..15558.92 rows=620092 width=78)
Sometimes table scans are ideal over, say, using an index to query the rows. However, this is one of those red-flag patterns that you seem to be looking for.
Basically, you take a look at each operation and see if the operations "make sense" given your knowledge of how it should be able to work.
For example, if you're joining two tables, A and B on their respective columns C and D (A.C=B.D), and your plan shows a clustered index scan (SQL Server term -- not sure of the oracle term) on table A, then a nested loop join to a series of clustered index seeks on table B, you might think there was a problem. In that scenario, you might expect the engine to do a pair of index scans (over the indexes on the joined columns) followed by a merge join. Further investigation might reveal bad statistics making the optimizer choose that join pattern, or an index that doesn't actually exist.
look at the percentage of time spent in each subsection of the plan, and consider what the engine is doing. for example, if it is scanning a table, consider putting an index on the field(s) that is is scanning for
I mainly look for index or table scans. This usually tells me I'm missing an index on an important column that's in the where statement or join statement.
From http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx:
If you see any of the following in an
execution plan, you should consider
them warning signs and investigate
them for potential performance
problems. Each of them are less than
ideal from a performance perspective.
* Index or table scans: May indicate a need for better or additional indexes.
* Bookmark Lookups: Consider changing the current clustered index,
consider using a covering index, limit
the number of columns in the SELECT
statement.
* Filter: Remove any functions in the WHERE clause, don't include wiews
in your Transact-SQL code, may need
additional indexes.
* Sort: Does the data really need to be sorted? Can an index be used to
avoid sorting? Can sorting be done at
the client more efficiently?
It is not always possible to avoid
these, but the more you can avoid
them, the faster query performance
will be.
Rules of Thumb
(you probably want to read up on the details too:
Oracle Docs
ASKTOM
SQL Server Docs
)
Bad
Table Scans of Several Large Tables
Good
Using a unique index
Index includes all required fields
Most Common Win
In about 90% of performance problems I have seen, the easiest win is to break up a query with lots (4 or more) of tables into 2 smaller queries and a temporary table.