During an interview I was asked a weird question. I could not find the correct answer for it, so posting the question below:
I have an index on a column Stud_Name. I am searching for name using a wild card. My query is
a) select * from Stud_Details where Stud_Name like 'A%'
b) select * from Stud_Details where Stud_Name like '%A'.
c) select * from Stud_Details where Stud_Name not like 'A%'
In which case would the SQL server use the Index, that I have created on Stud_Name?
PS: If this question seems idiotic don't get mad on me, get mad on the interviewer who asked this to me.
Also I don't have any info regarding how the index was created. This info above is all I have.
In what cases can SQL Server use an Index on Stud_Name?
Option (a) is the only one that can be used in an index seek. like 'A%' can get converted to a range seek on >= A and <B
Option (b) can't use an index seek as the leading wildcard prevents this. It could still scan an index though.
Option (c) could in theory be converted to two range seeks (< 'A' OR >= 'B' but I've just checked and SQL Server does not do that (even in cases where this would eliminate 100% of the table and with a FORCESEEK hint). Again it can scan an index though.
In what cases will SQL Server use an Index on Stud_Name?
This depends on cardinality estimates and whether the index is covering or not and the relative width of the index rows vs the base table rows.
Assuming the index is not covering then any rows found that match the WHERE clause will need lookups to retrieve the column values. The greater the number of estimated lookups the less likely the non covering index is to be used.
For b+c the choice is index scan + lookups vs table scan with no lookups. The favourability of doing an index scan will be higher if the index is much narrower than the table. If they are similar sizes there is not much IO benefit from reading the index rather than the table in the first place.
This question skirts around what I'm wondering, but the answers don't exactly address it.
It would seem that in general '=' is faster than 'like' when using wildcards. This appears to be the conventional wisdom. However, lets suppose I have a column containing a limited number of different fixed, hardcoded, varchar identifiers, and I want to select all rows matching one of them:
select * from table where value like 'abc%'
and
select * from table where value = 'abcdefghijklmn'
'Like' should only need to test the first three chars to find a match, whereas '=' must compare the entire string. In this case it would seem to me that 'like' would have an advantage, all other things being equal.
This is intended as a general, academic question, and so should not matter which DB, but it arose using SQL Server 2005.
See https://web.archive.org/web/20150209022016/http://myitforum.com/cs2/blogs/jnelson/archive/2007/11/16/108354.aspx
Quote from there:
the rules for index usage with LIKE
are loosely like this:
If your filter criteria uses equals =
and the field is indexed, then most
likely it will use an INDEX/CLUSTERED
INDEX SEEK
If your filter criteria uses LIKE,
with no wildcards (like if you had a
parameter in a web report that COULD
have a % but you instead use the full
string), it is about as likely as #1
to use the index. The increased cost
is almost nothing.
If your filter criteria uses LIKE, but
with a wildcard at the beginning (as
in Name0 LIKE '%UTER') it's much less
likely to use the index, but it still
may at least perform an INDEX SCAN on
a full or partial range of the index.
HOWEVER, if your filter criteria uses
LIKE, but starts with a STRING FIRST
and has wildcards somewhere AFTER that
(as in Name0 LIKE 'COMP%ER'), then SQL
may just use an INDEX SEEK to quickly
find rows that have the same first
starting characters, and then look
through those rows for an exact match.
(Also keep in mind, the SQL engine
still might not use an index the way
you're expecting, depending on what
else is going on in your query and
what tables you're joining to. The
SQL engine reserves the right to
rewrite your query a little to get the
data in a way that it thinks is most
efficient and that may include an
INDEX SCAN instead of an INDEX SEEK)
It's a measureable difference.
Run the following:
Create Table #TempTester (id int, col1 varchar(20), value varchar(20))
go
INSERT INTO #TempTester (id, col1, value)
VALUES
(1, 'this is #1', 'abcdefghij')
GO
INSERT INTO #TempTester (id, col1, value)
VALUES
(2, 'this is #2', 'foob'),
(3, 'this is #3', 'abdefghic'),
(4, 'this is #4', 'other'),
(5, 'this is #5', 'zyx'),
(6, 'this is #6', 'zyx'),
(7, 'this is #7', 'zyx'),
(8, 'this is #8', 'klm'),
(9, 'this is #9', 'klm'),
(10, 'this is #10', 'zyx')
GO 10000
CREATE CLUSTERED INDEX ixId ON #TempTester(id)CREATE CLUSTERED INDEX ixId ON #TempTester(id)
CREATE NONCLUSTERED INDEX ixTesting ON #TempTester(value)
Then:
SET SHOWPLAN_XML ON
Then:
SELECT * FROM #TempTester WHERE value LIKE 'abc%'
SELECT * FROM #TempTester WHERE value = 'abcdefghij'
The resulting execution plan shows you that the cost of the first operation, the LIKE comparison, is about 10 times more expensive than the = comparison.
If you can use an = comparison, please do so.
You should also keep in mind that when using like, some sql flavors will ignore indexes, and that will kill performance. This is especially true if you don't use the "starts with" pattern like your example.
You should really look at the execution plan for the query and see what it's doing, guess as little as possible.
This being said, the "starts with" pattern can and is optimized in sql server. It will use the table index. EF 4.0 switched to like for StartsWith for this very reason.
If value is unindexed, both result in a table-scan. The performance difference in this scenario will be negligible.
If value is indexed, as Daniel points out in his comment, the = will result in an index lookup which is O(log N) performance. The LIKE will (most likely - depending on how selective it is) result in a partial scan of the index >= 'abc' and < 'abd' which will require more effort than the =.
Note that I'm talking SQL Server here - not all DBMSs will be nice with LIKE.
You are asking the wrong question. In databases is not the operator performance that matters, is always the SARGability of the expression, and the coverability of the overall query. Performance of the operator itself is largely irrelevant.
So, how do LIKE and = compare in terms of SARGability? LIKE, when used with an expression that does not start with a constant (eg. when used LIKE '%something') is by definition non-SARGabale. But does that make = or LIKE 'something%' SARGable? No. As with any question about SQL performance the answer does not lie with the query of the text, but with the schema deployed. These expression may be SARGable if an index exists to satisfy them.
So, truth be told, there are small differences between = and LIKE. But asking whether one operator or other operator is 'faster' in SQL is like asking 'What goes faster, a red car or a blue car?'. You should eb asking questions about the engine size and vechicle weight, not about the color... To approach questions about optimizing relational tables, the place to look is your indexes and your expressions in the WHERE clause (and other clauses, but it usually starts with the WHERE).
A personal example using mysql 5.5: I had an inner join between 2 tables, one of 3 million rows and one of 10 thousand rows.
When using a like on an index as below(no wildcards), it took about 30 seconds:
where login like '12345678'
using 'explain' I get:
When using an '=' on the same query, it took about 0.1 seconds:
where login ='12345678'
Using 'explain' I get:
As you can see, the like completely cancelled the index seek, so query took 300 times more time.
= is much faster than LIKE, even without wildcard. I tested on MySQL with 11GB of data and more than 100 million of records, the f_time column is indexed.
SELECT * FROM XXXXX WHERE f_time = '1621442261'
#took 0.00sec and return 330 records
SELECT * FROM XXXXX WHERE f_time LIKE '1621442261'
#took 44.71sec and return 330 records
Besides all the answers, there this to consider:
'like' is case insensitive, so every character needs to be compared twice, whereas the '=' only compares once for identical characters.
This issue arises with or without indexes.
Maybe you are looking about Full Text Search.
In contrast to full-text search, the LIKE Transact-SQL predicate works on
character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.
I was working with a huge database that has more then 400M records and I put LIKE in search query. Here is the final results.
There were three tables tb1, tb2 and tb3. When I use EQUAL for in all tables QUERY the response time was 193ms. and when I put LIKE in one of he table the response time was 19.22 sec. and for all table LIKE response time was 112 Sec
I have a few tables where I need to get the data related to foo. The size of the tables are about 10^8 rows.
So I need to get all rows where the column include substring 'foo' from these tables.
select * from bar where my_col like '%foo%';
I know this is slow so I check the possible values:
select distinct my_col from bar where my_col like '%foo%';
-- => ('xx_foo', 'yy_foo', 'xx_foo_xx', 'foo' ... 'xx_foo_yy')
The number of possible values varies between 3 and 20.
Now how slow is '%foo%' really?
select * from bar where my_col like '%foo%';
-- or
select * from bar where my_col in('foo', 'xx_foo' ... 'foo_yy'); -- list_size = 20
Any general rule on when to use what, or is testing the speed for different cases the only way to go?
Edit: I do not own the table and no index exists on the column foo. So it needs to do a full table scan no matter what.
If you use %foo%, you will get a full-table scan, which is slow.
If you use IN with a list of values, than an index can be used if it exists on the column on which you have the condition.
So, if you are able, you should avoid using %foo%. Depending on how often new values may appear in the table, you might consider using an extra table holding the distinct values and use it when querying your main table, and update that extra table whenever new distinct value comes to play (if it is possible in your design).
A search using the like operator will sure lead to a table scan when the pattern starts with a %. When using the in operator and the values are not more than a few percent of the values in the table an index can be used, if it exists. Check the cardinality concept:
http://en.wikipedia.org/wiki/Cardinality_%28SQL_statements%29
The DBMS knows about the cardinalities keeping statistics about the tables. If your column has high cardinality and an index on it then an index scan is likely when using the in operator. To update the statistics issue an analyze command.
I am running following query.
SELECT Table_1.Field_1,
Table_1.Field_2,
SUM(Table_1.Field_5) BALANCE_AMOUNT
FROM Table_1, Table_2
WHERE Table_1.Field_3 NOT IN (1, 3)
AND Table_2.Field_2 <> 2
AND Table_2.Field_3 = 'Y'
AND Table_1.Field_1 = Table_2.Field_1
AND Table_1.Field_4 = '31-oct-2011'
GROUP BY Table_1.Field_1, Table_1.Field_2;
I have created index for columns (Field_1,Field_2,Field_3,Field_4) of Table_1 but the index is not getting used.
If I remove the SUM(Table_1.Field_5) from select clause then index is getting used.
I am confused if optimizer is not using this index or its because of SUM() function I have used in query.
Please share your explaination on the same.
When you remove the SUM you also remove field_5 from the query. All the data needed to answer the query can then be found in the index, which may be quicker than scanning the table. If you added field_5 to the index the query with SUM might use the index.
If your query is returning the large percentage of table's rows, Oracle may decide that doing a full table scan is cheaper than "hopping" between the index and the table's heap (to get the values in Table_1.Field_5).
Try adding Table_1.Field_5 to the index (thus covering the whole query with the index) and see if this helps.
See the Index-Only Scan: Avoiding Table Access at Use The Index Luke for conceptual explanation of what is going on.
As you mentioned, the presence of the summation function results in the the Index being overlooked.
There are function based indexes:
A function-based index includes columns that are either transformed by a function, such as the UPPER function, or included in an expression, such as col1 + col2.
Defining a function-based index on the transformed column or expression allows that data to be returned using the index when that function or expression is used in a WHERE clause or an ORDER BY clause. Therefore, a function-based index can be beneficial when frequently-executed SQL statements include transformed columns, or columns in expressions, in a WHERE or ORDER BY clause.
However, as with all, function based indexes have their restrictions:
Expressions in a function-based index cannot contain any aggregate functions. The expressions must reference only columns in a row in the table.
Though I see some good answers here couple of important points are being missed -
SELECT Table_1.Field_1,
Table_1.Field_2,
SUM(Table_1.Field_5) BALANCE_AMOUNT
FROM Table_1, Table_2
WHERE Table_1.Field_3 NOT IN (1, 3)
AND Table_2.Field_2 <> 2
AND Table_2.Field_3 = 'Y'
AND Table_1.Field_1 = Table_2.Field_1
AND Table_1.Field_4 = '31-oct-2011'
GROUP BY Table_1.Field_1, Table_1.Field_2;
Saying that having SUM(Table_1.Field_5) in select clause causes index not to be used in not correct. Your index on (Field_1,Field_2,Field_3,Field_4) can still be used. But there are problems with your index and sql query.
Since your index is only on (Field_1,Field_2,Field_3,Field_4) even if your index gets used DB will have to access the actual table row to fetch Field_5 for applying filter. Now it completely depends on the execution plan charted out of sql optimizer which one is cost effective. If SQL optimizer figures out that full table scan has less cost than using index it will ignore the index. Saying so I will now tell you probable problems with your index -
As others have states you could simply add Field_5 to the index so that there is no need for separate table access.
Your order of index matters very much for performance. For eg. in your case if you give order as (Field_4,Field_1,Field_2,Field_3) then it will be quicker since you have equality on Field_4 -Table_1.Field_4 = '31-oct-2011'. Think of it this was -
Table_1.Field_4 = '31-oct-2011' will give you less options to choose final result from then Table_1.Field_3 NOT IN (1, 3). Things might change since you are doing a join. It's always best to see the execution plan and design your index/sql accordingly.
In terms of performance, how does the like operator behaves when applied to strings with multiple % placeholders?
for example:
select A from table_A where A like 'A%'
takes the same time to select than
select A from table_A where A like 'A%%'
???
Your queries:
select A from table_A where A like 'A%'
and
select A from table_A where A like 'A%%'
^ optimizer will remove second redundant %
are equivalent, the optimizer will remove the second % in the second query
just like it would remove the 1=1 from:
select A from table_A where A like 'A%%' and 1=1
However, this query is very different:
select A from table_A where A like '%A%'
The when using 'A%' it will use the index to find everything starting with an A, like a person using a phone book would quickly look for the start of a name. However when using '%A%' it will scan the entire table looking for anything containing an A, thus slower and no index usage. Like if you had to find every name in the phone book that contained an A, that would take a while!
It will treat them same. If there is an index on column A, it will use that index just as it would with a single wildcard. However, if you were to add a leading wildcard, that would force a table scan regardless of whether an index existed or not.
For the most part the pattern that you're using will not affect the performance of the query. The key to the performance for this is the appropriate use of indexes. In your example, an index on the column will work well because it will seek values that start with 'A', then match the full pattern. There may be some more-challenging patterns around, but the performance difference is negligible between them.
There is one important condition where the wildcard character will hurt performance. And, that is when it is at the beginning of of the pattern. For, example, '%A' will gain no benefit from an index, because it indicates you want to match on any value that starts with any valid character. All rows must be evaluated to meet this criteria.