SQL: like v. equals performance comparison - sql

I have a large table (100 million rows) which is properly indexed in a traditional RDBMS system (Oracle, MySQL, Postgres, SQL Server, etc.). I would like to perform a SELECT query which can be formulated with either of the following criteria options:
One that can be represented by a single criteria:
LIKE "T40%"
which only looks for matches at the beginning of the string field due to the wildcard
or
One that requires a list of say 200 exact criteria:
WHERE IN("T40.x21","T40.x32","T40.x43")
etc.
All other things being equal. Which should I expect to be more performant?

Assuming that both queries return the same set of rows (i.e. the list of items that you supply in the IN expression is exhaustive) you should expect almost identical performance, perhaps with some advantage for the LIKE query.
RDBMS engines have been using index searches for begins-with LIKE queries, so LIKE 'T40%' will produce records after an index search
Your IN query would be optimized for index search as well, perhaps giving RDBMS a tighter lower and upper bounds. However, there would be an additional filtering step to eliminate records outside your IN list, which is a waste of CPU cycles under the assumption that all rows would be returned anyway.
In case you'd parameterize your query, the second query becomes harder to pass to an RDBMS from your host program. All other things being equal, I would use LIKE.

i would suggest to go with LIKE operator because the ESCAPE OPTION Has to be used along with '\' symbol to increase the exact matching the character string.

Related

Which performances better in a query, 'LIKE' or '='?

I am using asp.net mvc3.
Which of the two '=' or 'LIKE' will perform faster in my SQL query.
= will be faster than LIKE because testing against exact matches is faster in SQL. The LIKE expression needs to scan the string and search for occurrences of the given expression for each row. Of course that's not a reason not to use LIKE. Databases are pretty optimized and unless you discover that this is a performance bottleneck for your application you should not be afraid of using it. Doing premature optimizations like this is not good.
As Darin says, searching for equality is likely to be faster - it allows better use of indexes etc. It partly depends on the kind of LIKE operation - leading substring LIKE queries (WHERE Name LIKE 'Fred%') can be optimized in the database with cunning indexes (you'd need to check whether any special work is needed to enable that in your database). Trailing substring matches could potentially be optimized in the same sort of way, but I don't know whether most databases handle this. Arbitrary matches (e.g. WHERE Name LIKE '%Fred%Bill%') would be very hard to optimize.
However, you should really be driven by functionality - do you need pattern-based matching, or exact equality? Given that they don't do the same thing, which results do you want? If you have a LIKE pattern which doesn't specify any wildcards, I would hope that the query optimizer could notice that and give you the appropriate optimization anyway - although you'd want to test that.
If you're wondering whether or not to include pattern-matching functionality, you'll need to work out whether your users are happy to have that for occasional "power searches" at the cost of speed - without knowing much about your use case, it's hard to say...
Equal and like are different operators so are not comparable
Equal is exact match
LIKE is pattern matching
That said, LIKE without wildcards should run the same as equal. But you wouldn't run that.
And it depends on indexes. Every row will need examined without an index for any operator.
Note: LIKE '%something' can never be optimised by an index (edit: see comments)
Equal is fastest
Then LIKE 'something%'
LIKE '%something' is slowest.
The last one have to go through the entire column to find a match. Hence it's the slowest one.
As you are talking about using them interchangeably I assume the desired semantics are equality.
i.e. You are talking about queries such as
WHERE bar = 'foo' vs WHERE bar LIKE 'foo'
This contains no wild cards so the queries are semantically equivalent. However you should probably use the former as
It is clearer what the expression of the query is
In this case the search term does not contain any characters of particular significance to the LIKE operator but if you wanted to search for bar = '10% off' you would need to escape these characters when using LIKE.
Trailing space is significant in LIKE queries but not = (tested on SQL Server and MySQL not sure what the standard says here)
You don't specify RDBMS, in the case of SQL Server just to discuss a few possible scenarios.
1. The bar column is not indexed.
In that case both queries will involve a full scan of all rows. There might be some minor difference in CPU time because of the different semantics around how trailing space should be treated.
2. The bar column has a non unique index.
In that case the = query will seek into the index where bar = 'foo' and then follow the index along until it finds the first row where bar <> 'foo'. The LIKE query will seek into the index where bar >= 'foo' following the index along until it finds the first row where bar > 'foo'
3. The bar column has a unique index.
In that case the = query will seek into the index where bar = 'foo' and return that row if it exists and not scan any more. The LIKE query will still do the range seek on Start: bar >= 'foo' End: bar <= 'foo' so will still examine the next row.

Multiple indexes on one column

Using Oracle, there is a table called User.
Columns: Id, FirstName, LastName
Indexes: 1. PK(Id), 2. UPPER(FirstName), 3. LOWER(FirstName), 4. Index(FirstName)
As you can see index 2, 3, 4 are indexes on the same column - FirstName.
I know this creates overhead, but my question is on selecting how will the database react/optimize?
For instance:
SELECT Id FROM User u WHERE
u.FirstName LIKE 'MIKE%'
Will Oracle hit the right index or will it not?
The problem is that via Hibernate this slows down the query VERY much (so it uses prepared statements).
Thanks.
UPDATE: Just to clarify indexes 2 and 3 are functional indexes.
In addition to Mat's point that either index 2 or 3 should be redundant because you should choose one approach to doing case-insensitive searches and to Richard's point that it will depend on the selectivity of the index, be aware that there are additional concerns when you are using the LIKE clause.
Assuming you are using bind variables (which it sounds like you are based on your use of prepared statements), the optimizer has to guess at how selective the actual bind value is going to be. Something short like 'S%' is going to be very non-selective, causing the optimizer to generally prefer a table scan. A longer string like 'Smithfield-Manning%', on the other hand, is likely to be very selective and would likely use index 4. How Oracle handles this variability will depend on the version.
In Oracle 10, Oracle introduced bind variable peeking. This meant that the first time Oracle parsed a query after a reboot (or after the query plan being aged out of the shared pool), Oracle looked at the bind value and decided what plan to use based on that value. Assuming that most of your queries would benefit from the index scan because users are generally searching on relatively selective values, this was great if the first query after a reboot had a selective condition. But if you got unlucky and someone did a WHERE firstname LIKE 'S%' immediately after a reboot, you'd be stuck with the table scan query plan until the query plan was removed from the shared pool.
Starting in Oracle 11, however, the optimizer has the ability to do adaptive cursor sharing. That means that the optimizer will try to figure out that WHERE firstname LIKE 'S%' should do a table scan and WHERE firstname LIKE 'Smithfield-Manning%' should do an index scan and will maintain multiple query plans for the same statement in the shared pool. That solves most of the problems that we had with bind variable peeking in earlier versions.
But even here, the accuracy of the optimizer's selectivity estimates are generally going to be problematic for medium-length strings. It's generally going to know that a single-character string is very weakly selective and that a 20 character string is highly selective but even with a 256 bucket histogram, it's not going to have a whole lot of information about how selective something like WHERE firstname LIKE 'Smit%' really is. It may know roughly how selective 'Sm%' is based on the column histogram but it's guessing rather blindly at how selective the next two characters are. So it's not uncommon to end up in a situation where most of the queries work efficiently but the optimizer is convinced that WHERE firstname LIKE 'Cave%' isn't selective enough to use an index.
Assuming that this is a common query, you may want to consider using Oracle's plan stability features to force Oracle to use a particular plan regardless of the value of a bind variable. This may mean that users that enter a single character have to wait even longer than they would otherwise have waited because the index scan is substantially less efficient than doing a table scan. But that may be worth it for other users that are searching for short but reasonably distinctive last names. And you may do things like add a ROWNUM limiter to the query or add logic to the front end that requires a minimum number of characters in the search box to avoid situations where a table scan would be more efficient.
It's a bit strange to have both the upper and lower function-based indexes on the same field. And I don't think the optimizer will use either in your query as it its.
You should pick one or the other (and probably drop the last one too), and only ever query on the upper (or lower)-case with something like:
select id from user u where upper(u.firstname) like 'MIKE%'
Edit: look at this post too, has some interesting info How to use a function-based index on a column that contains NULLs in Oracle 10+?
It may not hit any of your indexes, because you are returning ID in the SELECT clause, which is not covered by the indexes.
If the index is very selective, and Oracle decides it is still worthwhile using it to find 'MIKE%' then perform a lookup on the data to get the ID column, then it may use 4. Index(FirstName). 2 and 3 will only be used if the column searched uses the exact function defined in the index.

SQL `LIKE` complexity

Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?
Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.
If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.
Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.

Surprising SQL speed increase

I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?
More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'
It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.
There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.
If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'
Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.
What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.
As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)

Using IN or a text search

I want to search a table to find all rows where one particular field is one of two values. I know exactly what the values would be, but I'm wondering which is the most efficient way to search for them:
for the sake of example, the two values are "xpoints" and "ypoints". I know for certain that there will be no other values in that field which has "points" at the end, so the two queries I'm considering are:
WHERE `myField` IN ('xpoints', 'ypoints')
--- or...
WHERE `myField` LIKE '_points'
which would give the best results in this case?
As always with SQL queries, run it through the profiler to find out. However, my gut instinct would have to say that the IN search would be quicker. Espcially in the example you gave, if the field was indexed, it would only have to do 2 lookups. If you did a like search, it may have to do a scan, because you are looking for records that end with a certain value. It would also be more accurate as LIKE '_points' could also return 'gpoints', or any other similar string.
Unless all of the data items in the column in question start with 'x' or 'y', I believe IN will always give you a better query. If it is indexed, as #Kibbee points out, you will only have to perform 2 lookups to get both. Alternatively, if it is not indexed, a table scan using IN will only have to check the first letter most of the time whereas with LIKE it will have to check two characters every time (assuming all items are at least 2 characters) -- since the first character is allowed to be anything.
Try it and see. Create a large amount of test data, Also, try it with and without an index on myfield. While you are at it, see if there's a noticeable difference between
LIKE 'points' and LIKE 'xpoint'.
It depends on what the optimizer does with each query.
For small amounts of data, the difference will be negligible. Do whichever one makes more sense. For large amounts of data the amount of disk I/O matters much more than the amount of CPU time.
I'm betting that IN will get you better results than LIKE, if there is an index on myfield. I'm also betting that 'xpoint_' runs faster than '_points'. But there's nothing like trying it yourself.
MySQL can't use an index when using string comparisons such as LIKE '%foo' or '_foo', but can use an index for comparisons like 'foo%' and 'foo_'.
So in your case, IN will be much faster assuming that the field is indexed.
If you're working with a limited set of possible values, it's worth specifying the field as an ENUM - MySQL will then store it internally as an integer and make this sort of lookup much faster, and save disk space.
It will be faster to do the IN-version than the LIKE-version. Especially when your wildcard isn't at the end of the comparison, but even under ideal conditions IN would still be ideal up until your query nears the size of your max-query insert.

Categories