Related
I have this question about SQL Server indexes that has been bugging me of late.
Imagine a table like this:
CREATE TABLE TelephoneBook (
FirstName nvarchar(50),
LastName nvarchar(50),
PhoneNumber nvarchar(50)
)
with an index like this:
CREATE NONCLUSTERED INDEX IX_LastName ON TelephoneBook (
LastName,
FirstName,
PhoneNumber
)
and imagine that this table has hundreds of thousands of rows.
Let's say I want to select everyone whose last name starts with a B and the firstname is 'John'. I would write the following query:
SELECT
*
FROM TelephoneBook
WHERE LastName like 'B%'
AND FirstName='John'
Since the index can help to reduce the number of rows we need to scan because it groups all of the LastNames that start with a B anyway, does it also do this for the FirstName? Or does the database scan every row that starts with a B to find the ones with the first name 'John'?
In other words, how are the second, third, fourth, ... columns sorted in an index? Are they alphabetical in this case as well, so it's pretty easy to find Johanna? Or are they in some sort of a random or different order?
EDIT: why I ask, is because I have just read that in the above SELECT statement, the index will only be used to narrow down the search to the records where the lastname starts with a B, but that the index will NOT be used to find all of the rows with Johanna in it (and will resort to scanning all of the 'B' rows). And I'm wondering why that is? What am I not getting?
As a convenient shorthand, the keys of an index are used for the where clause up to the first inequality. like with a wildcard is considered an inequality.
So, the index will only be used for looking up the first value. However, the entries will probably be scanned to match on the first name, so you will still get index usage.
Of course, the optimizer may decide not to use the index at all, if it decides that a full-table scan is more appropriate.
Gordon's answer is correct in this instance with the specified query. In general, you should be aware that it's not so much grouping records together in "buckets" based on the values of the columns, but rather ordering them according to the index's key columns. In other words, your records in this index will be ordered according to LastName, and for records that share the same LastName value they will be further ordered by FirstName value, and then by PhoneNumber value. You didn't specify a sort order for your columns on this index, but SQL Server defaults unspecified sort orders to ASC(ending), so those columns are indeed lexically sorted in the index .
In your particular case, the query optimizer has decided to look at the index for the first column to determine which records to grab, as Gordon's answer mentions, but SQL Server will reorder predicates if the optimizer decides that would be better, and may use more columns of the index or none at all, depending on the query itself and statistics on the records you are querying.
Logically speaking, the index is sorted by key values in the order of the key. So in this case, LastName (sorted as text), FirstName (sorded as text) and then PhoneNumber (sorted as text)... Any included columns are not sorted at all.
In your case, we know that trailing wildcards are still SARGable, so we'd expect to see an index seek narrowing the data down to all data w/ LastNames starting w/ "B", from that data pool, it will be further filtered to include only those rows that have FirstName = 'John'. You can think of it as an index seek followed by a range seek.
This question skirts around what I'm wondering, but the answers don't exactly address it.
It would seem that in general '=' is faster than 'like' when using wildcards. This appears to be the conventional wisdom. However, lets suppose I have a column containing a limited number of different fixed, hardcoded, varchar identifiers, and I want to select all rows matching one of them:
select * from table where value like 'abc%'
and
select * from table where value = 'abcdefghijklmn'
'Like' should only need to test the first three chars to find a match, whereas '=' must compare the entire string. In this case it would seem to me that 'like' would have an advantage, all other things being equal.
This is intended as a general, academic question, and so should not matter which DB, but it arose using SQL Server 2005.
See https://web.archive.org/web/20150209022016/http://myitforum.com/cs2/blogs/jnelson/archive/2007/11/16/108354.aspx
Quote from there:
the rules for index usage with LIKE
are loosely like this:
If your filter criteria uses equals =
and the field is indexed, then most
likely it will use an INDEX/CLUSTERED
INDEX SEEK
If your filter criteria uses LIKE,
with no wildcards (like if you had a
parameter in a web report that COULD
have a % but you instead use the full
string), it is about as likely as #1
to use the index. The increased cost
is almost nothing.
If your filter criteria uses LIKE, but
with a wildcard at the beginning (as
in Name0 LIKE '%UTER') it's much less
likely to use the index, but it still
may at least perform an INDEX SCAN on
a full or partial range of the index.
HOWEVER, if your filter criteria uses
LIKE, but starts with a STRING FIRST
and has wildcards somewhere AFTER that
(as in Name0 LIKE 'COMP%ER'), then SQL
may just use an INDEX SEEK to quickly
find rows that have the same first
starting characters, and then look
through those rows for an exact match.
(Also keep in mind, the SQL engine
still might not use an index the way
you're expecting, depending on what
else is going on in your query and
what tables you're joining to. The
SQL engine reserves the right to
rewrite your query a little to get the
data in a way that it thinks is most
efficient and that may include an
INDEX SCAN instead of an INDEX SEEK)
It's a measureable difference.
Run the following:
Create Table #TempTester (id int, col1 varchar(20), value varchar(20))
go
INSERT INTO #TempTester (id, col1, value)
VALUES
(1, 'this is #1', 'abcdefghij')
GO
INSERT INTO #TempTester (id, col1, value)
VALUES
(2, 'this is #2', 'foob'),
(3, 'this is #3', 'abdefghic'),
(4, 'this is #4', 'other'),
(5, 'this is #5', 'zyx'),
(6, 'this is #6', 'zyx'),
(7, 'this is #7', 'zyx'),
(8, 'this is #8', 'klm'),
(9, 'this is #9', 'klm'),
(10, 'this is #10', 'zyx')
GO 10000
CREATE CLUSTERED INDEX ixId ON #TempTester(id)CREATE CLUSTERED INDEX ixId ON #TempTester(id)
CREATE NONCLUSTERED INDEX ixTesting ON #TempTester(value)
Then:
SET SHOWPLAN_XML ON
Then:
SELECT * FROM #TempTester WHERE value LIKE 'abc%'
SELECT * FROM #TempTester WHERE value = 'abcdefghij'
The resulting execution plan shows you that the cost of the first operation, the LIKE comparison, is about 10 times more expensive than the = comparison.
If you can use an = comparison, please do so.
You should also keep in mind that when using like, some sql flavors will ignore indexes, and that will kill performance. This is especially true if you don't use the "starts with" pattern like your example.
You should really look at the execution plan for the query and see what it's doing, guess as little as possible.
This being said, the "starts with" pattern can and is optimized in sql server. It will use the table index. EF 4.0 switched to like for StartsWith for this very reason.
If value is unindexed, both result in a table-scan. The performance difference in this scenario will be negligible.
If value is indexed, as Daniel points out in his comment, the = will result in an index lookup which is O(log N) performance. The LIKE will (most likely - depending on how selective it is) result in a partial scan of the index >= 'abc' and < 'abd' which will require more effort than the =.
Note that I'm talking SQL Server here - not all DBMSs will be nice with LIKE.
You are asking the wrong question. In databases is not the operator performance that matters, is always the SARGability of the expression, and the coverability of the overall query. Performance of the operator itself is largely irrelevant.
So, how do LIKE and = compare in terms of SARGability? LIKE, when used with an expression that does not start with a constant (eg. when used LIKE '%something') is by definition non-SARGabale. But does that make = or LIKE 'something%' SARGable? No. As with any question about SQL performance the answer does not lie with the query of the text, but with the schema deployed. These expression may be SARGable if an index exists to satisfy them.
So, truth be told, there are small differences between = and LIKE. But asking whether one operator or other operator is 'faster' in SQL is like asking 'What goes faster, a red car or a blue car?'. You should eb asking questions about the engine size and vechicle weight, not about the color... To approach questions about optimizing relational tables, the place to look is your indexes and your expressions in the WHERE clause (and other clauses, but it usually starts with the WHERE).
A personal example using mysql 5.5: I had an inner join between 2 tables, one of 3 million rows and one of 10 thousand rows.
When using a like on an index as below(no wildcards), it took about 30 seconds:
where login like '12345678'
using 'explain' I get:
When using an '=' on the same query, it took about 0.1 seconds:
where login ='12345678'
Using 'explain' I get:
As you can see, the like completely cancelled the index seek, so query took 300 times more time.
= is much faster than LIKE, even without wildcard. I tested on MySQL with 11GB of data and more than 100 million of records, the f_time column is indexed.
SELECT * FROM XXXXX WHERE f_time = '1621442261'
#took 0.00sec and return 330 records
SELECT * FROM XXXXX WHERE f_time LIKE '1621442261'
#took 44.71sec and return 330 records
Besides all the answers, there this to consider:
'like' is case insensitive, so every character needs to be compared twice, whereas the '=' only compares once for identical characters.
This issue arises with or without indexes.
Maybe you are looking about Full Text Search.
In contrast to full-text search, the LIKE Transact-SQL predicate works on
character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.
I was working with a huge database that has more then 400M records and I put LIKE in search query. Here is the final results.
There were three tables tb1, tb2 and tb3. When I use EQUAL for in all tables QUERY the response time was 193ms. and when I put LIKE in one of he table the response time was 19.22 sec. and for all table LIKE response time was 112 Sec
In terms of performance, how does the like operator behaves when applied to strings with multiple % placeholders?
for example:
select A from table_A where A like 'A%'
takes the same time to select than
select A from table_A where A like 'A%%'
???
Your queries:
select A from table_A where A like 'A%'
and
select A from table_A where A like 'A%%'
^ optimizer will remove second redundant %
are equivalent, the optimizer will remove the second % in the second query
just like it would remove the 1=1 from:
select A from table_A where A like 'A%%' and 1=1
However, this query is very different:
select A from table_A where A like '%A%'
The when using 'A%' it will use the index to find everything starting with an A, like a person using a phone book would quickly look for the start of a name. However when using '%A%' it will scan the entire table looking for anything containing an A, thus slower and no index usage. Like if you had to find every name in the phone book that contained an A, that would take a while!
It will treat them same. If there is an index on column A, it will use that index just as it would with a single wildcard. However, if you were to add a leading wildcard, that would force a table scan regardless of whether an index existed or not.
For the most part the pattern that you're using will not affect the performance of the query. The key to the performance for this is the appropriate use of indexes. In your example, an index on the column will work well because it will seek values that start with 'A', then match the full pattern. There may be some more-challenging patterns around, but the performance difference is negligible between them.
There is one important condition where the wildcard character will hurt performance. And, that is when it is at the beginning of of the pattern. For, example, '%A' will gain no benefit from an index, because it indicates you want to match on any value that starts with any valid character. All rows must be evaluated to meet this criteria.
The WHERE clause of one of my queries looks like this:
and tbl0.Type = 'Alert'
AND (tbl0.AccessRights like '%'+TblCUG0.userGroup+'%'
or tbl0.AccessRights like 'All' )
AND (tbl0.ExpiryDate > CONVERT(varchar(8), GETDATE(), 1)
or tbl0.ExpiryDate is null)
order by tbl0.Priority,tbl0.PublishedDate desc, tbl0.Title asc
I will like to know on which columns can I create indexes and which type of index will best suit. Also I have heard that indexes dont work with Like and Wild cards at the starting. So what should be the approach to optimize the queries.
1 and tbl0.Type = 'Alert'
2 AND (tbl0.AccessRights like '%'+TblCUG0.userGroup+'%'
3 or tbl0.AccessRights like 'All' )
4 AND (tbl0.ExpiryDate > CONVERT(varchar(8), GETDATE(), 1)
5 or tbl0.ExpiryDate is null)
most likely, you will not be able to use an index with a WHERE clause like this.
Line 1, You could create an index on tbl0.Type, but if you have many rows and few actual values, SQL Server will just skip the index and table scan anyway. Also, having nothing to do with the index issue, a column like this, a code/flag value is better as a fixed width value char(1), tiny int, etc, where "A"=alert or 1=alert. I would name the column XyzType, where Xyz is what the type describes (DoctorType, CarType, etc). I would create a new table XyzTye, with a FK back to this column in tb10. this new table would have two columns XyzType PK and XyzDescription, where you expand out the name.
Line 2, are you combining multiple values into tbl0.AccessRights? and trying to use the LIKE to find values within it? if so, split this out into a different table and then you can remove the like and possibly add an index there.
Line 3, OR kills an index usage. Imagine looking through the phone book for all names that are "Smith" or start with "G", you can't just use the index. You may try splitting the query into a UNION or UNION ALL around the OR so an index can be used (one part looks for "Smith" and the other part looks for "G"). You have not provided enough of the query to determine if this is possible or not in your case. You many need to use a derived table that contains this UNION so you can join it to the rest of your query.
Line 4, tbl0.ExpiryDate could benifit from a index, but the or will kill its usage, see the Line 3 comment.
Line 5, you may try the OR union trick discussed above, or just not use NULL, put in a a default like '01/01/3000' so you don't need the OR.
SQL Server's Database Tuning Advisor can suggest which indexes will optimize your query, including covering indexes that will optimize the selected columns that you do not include in your query. Just because you add an index doesn't mean that the query optimizer will use it. Some indexes may cost more to use than others, so the optimizer will choose the bext indexes using the underlying tables' statistics.
Out-of-hand you could use add all ordering and criteria columns to an index, but that would be useless if for example, there are too few distinct Priority values to make it worth the storage.
You are right about LIKE and wildcards. An index is a btree which means that it can speed quick searches for specific values or range queries. A wildcard at the beginning means that the query will have to touch all records to check whether they match the pattern. A wildcard at the end means that the query will only have to touch items that start with the substring up to the wildcard, partially turning this into a range query that can benefit from an index.
I don't know much about database optimization, but I'm trying to understand this case.
Say I have the following table:
cities
===========
state_id integer
name varchar(32)
slug varchar(32)
Now, say I want to perform queries like this:
SELECT * FROM cities WHERE state_id = 123 AND slug = 'some_city'
SELECT * FROM cities WHERE state_id = 123
If I want the "slug" for a city to be unique within its particular state, I'd add a unique index on state_id and slug.
Is that index enough? Or should I also add another on state_id so the second query is optimized? Or does the second query automatically use the unique index?
I'm working on PostgreSQL, but I feel this case is so simple that most DBMS work similarly.
Also, I know this surely doesn't make a difference on small tables, but my example is a simple one. Think of 200k+ rows tables.
Thanks!
A single unique index on (state_id, slug) should be sufficient. To be sure, of course, you'll need to run EXPLAIN and/or ANALYZE (perhaps with the help of something like http://explain.depesz.com/), but ultimately what indexes are appropriate depends very closely on what kind of queries you will be running. Remember, indexes make SELECTs faster and INSERTs, UPDATEs, and DELETEs slower, so you ideally want only as many indexes as are actually necessary.
Also, PostgreSQL has a smart query optimizer: it will use radically different search plans for queries on small tables and huge tables. If the table is small, it will just do a sequential scan and not even bother with any indexes, since the overhead of working with them is higher than just brute-force sifting through the table. This changes to a different plan once the table size passes a threshold, and may change again if the table gets larger again, or if you change your SELECT, or....
Summary: you can't trust the results of EXPLAIN and ANALYZE on datasets much smaller or different than your actual data. Make it work, then make it fast later (if you need to).
[EDIT: Misread the question... Hopefully my answer is more relevant now!]
In your case, I'd suggest 1 index on (state_id, slug). If you ever need to search just by slug, add an index on just that column. If you have those, then adding another index on state_id is unnecessary as the first index already covers it.
An index can be used whenever an initial segment of its columns are used in a WHERE clause. So e.g. an index on columns A, B and C will optimise queries containing WHERE clauses involving A, B and C, WHERE clauses with just A and B, or WHERE clauses with just A. Note that the order that columns appear in the index definition is very important -- this example index cannot be used for WHERE clauses involving just B and/or C.
(Of course it's up to the query optimiser whether or not a particular index actually gets used, but in your case with 200k rows, you can guarantee that a simple search by state_id or slug or both will use one of the indices.)
Any decent optimizer will see an index on three columns - say:
CREATE INDEX idx_1 ON SomeTable(Col1, Col2, Col3);
and will use that index for any of the following conditions:
WHERE Col1 = ...something...
WHERE Col1 = ...something... AND Col2 = ...otherthing...
WHERE Col3 = ....whatnot....
AND Col1 = ...something....
AND Col2 = ...otherthing...
That is, it will use the index if there are conditions applied to any contiguous leading subset of the columns of the index. Although I used equality, it can also apply to ranges (open - just greater than, for example) or closed (between two values).
To do optimization use EXPLAIN http://www.postgresql.org/docs/7.4/static/sql-explain.html and see for your self.
But optimization is not the most important reason to make those indexes; first it is a constraint inhibiting a database from not being logical.