SQL `LIKE` complexity - sql

Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?

Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.

If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.

Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.

Related

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

Index on multiple bit fields in SQL Server

We currently have a scenario where one table effectively has several (10 to 15) boolean flags (not nullable bit fields). Unfortunately, it is not really possible to simplify this too much on a logical level, because any combination of the boolean values is permissible.
The table in question is a transactional table which may end up having tens of millions of rows, and both insert and select performance is fairly critical. Although we are not quite sure of the distribution of the data at this time, the combination of all flags should provide relative good cardinality, i.e. make it a "worthwhile" index for SQL Server to make use of.
Typical select query scenarios might be to select records based on 3 or 4 of the flags only, e.g. WHERE FLAG3=1 AND FLAG7=0 AND FLAG9=1. It would not be practical to create separate indexes for all combinations of the flags used by these select queries, as there will be many of them.
Given this situation, what would be the recommended approach to effectively index these fields? The table is new, so there is no existing data to worry about yet, and we have a fair amount of flexibility in the actual implementation of the table.
There are two main options that we are considering at the moment:
Create a single index which includes all the bit fields (this would probably include 1 or 2 other int fields which would always used). My concern is that given the typical usage of only including a few of the fields, this approach would skip the index and resort to a table scan. Let's call this Option A (Having read some of the replies, it seems that this approach would not work well, since the order of the fields in the index would make a difference making it impossible to index effectively on ALL the fields).
Effectively do what I believe SQL Server is doing internally, and encode the bit fields into a single int field using binary operators (AND-ing and OR-ing numbers together: 1, 2, 4, 8, etc). My concern here is that we'd need to do some kind of calculation to query on this encoded field, which would skip the index again. Maintenance and the complexity of this solution is also a concern. Let's call this Option B. Additional info: The argument for this approach is that we could have a relatively simple and short index which includes one or two other fields from the table and this field. The other fields would narrow down the number of records needing to be evaluated, and since the encoded field would contain all of our bit fields, SQL Server would be able to perform the calculation using the data retrieved from the index directly (i.e. an index scan) as opposed to the table (i.e. a table scan).
At the moment, we are heavily leaning towards Option B. For completeness, this would be running on SQL Server 2008.
Any advice would be greatly appreciated.
Edit: Spelling, clarity, query example, additional info on Option B.
A single BIT column typically is not selective enough to be even considered for use in an index. So an index on a single BIT column really doesn't make sense - on average, you'd always have to search about half the entries in the table (50% selectiveness) and so the SQL Server query optimizer will instead use a table scan.
If you create a single index on all 15 bit columns, then you don't have that problem - since you have 15 yes/no options, your index will become quite selective.
Trouble is: the sequence of the bit columns is important. Your index will only ever be considered if your SQL statement uses at least 1-n of the left-most BIT columns.
So if your index is on
Col1,Col2,Col3,....,Col14,Col15
then it might be used for a query that uses
Col1
Col1 and Col2
Col1 and Col2 and Col3
....
and so on. But it cannot be used for a query that specifies Col6,Col9 and Col14.
Because of that, I don't really think an index on your collection of BIT columns really makes a lot of sense.
Are those 15 BIT columns the only columns you use for querying? If not, I would try to combine those BIT columns that you use most for selection with other columns, e.g. have an index on Name and Col7 or something (then your BIT columns can add some additional selectivity to another index)
Whilst there are probably ways to solve your indexing problem against your existing table schema, I would reduce this to a normalisation problem:
e.g I would highly recommend creating a series of new tables:
Lookup table for the names of this bit flags. e.g. CREATE TABLE Flags (id int IDENTITY(1,1), Name varchar(256)) (you don't have to make id an identity-seed column if you want to manually control the id's - e.g. 2,4,8,16,32,64,128 as binary flags.)
Create a new link-table that contains the id's of the original data table and the new link table e.g. CREATE TABLE DataFlags_Link (id int IDENTITY(1,1), MyFlagId int, DataId int)
You could then create an index on the DataFlags_Link table and write queries like:
SELECT Data.*
FROM Data
INNER JOIN DataFlags_Link ON Data.id = DataFlags_Link.DataId
WHERE DataFlags_Link.MyFlagId IN (4,7,2,8)
As for performance, that's where good DBA maintenance comes in. You'll want to set the INDEX fill-factor and padding on your tables appropriately and run regular index defragmentation or rebuild your indexes on a schedule.
Performance and maintenance go hand-in-hand with databases. You can't have one without the other.
Whilst I think Neil Fenwick's answer is probably right, I think the real answer is to try out the different options and see which one is fast enough.
Option 1 is probably the most straightforward solution, and therefore likely the most maintainable - and it may well be fast enough.
I would build a prototype database, with the "option 1" schema, and use something like http://www.red-gate.com/products/sql-development/sql-data-generator/ or http://sourceforge.net/projects/dbmonster/ to create twice as much data as you expect to need, and then build the queries you expect to need. Agree an acceptable response time, and only consider a "faster" schema if you exceed those response times (and you can't throw hardware at the problem).
Neil's solution is probably just as obvious and maintainable as "option 1" - and it should be easy to index. However, I'd still test it by creating a prototype schema and generating a lot of test data...

Multiple indexes on one column

Using Oracle, there is a table called User.
Columns: Id, FirstName, LastName
Indexes: 1. PK(Id), 2. UPPER(FirstName), 3. LOWER(FirstName), 4. Index(FirstName)
As you can see index 2, 3, 4 are indexes on the same column - FirstName.
I know this creates overhead, but my question is on selecting how will the database react/optimize?
For instance:
SELECT Id FROM User u WHERE
u.FirstName LIKE 'MIKE%'
Will Oracle hit the right index or will it not?
The problem is that via Hibernate this slows down the query VERY much (so it uses prepared statements).
Thanks.
UPDATE: Just to clarify indexes 2 and 3 are functional indexes.
In addition to Mat's point that either index 2 or 3 should be redundant because you should choose one approach to doing case-insensitive searches and to Richard's point that it will depend on the selectivity of the index, be aware that there are additional concerns when you are using the LIKE clause.
Assuming you are using bind variables (which it sounds like you are based on your use of prepared statements), the optimizer has to guess at how selective the actual bind value is going to be. Something short like 'S%' is going to be very non-selective, causing the optimizer to generally prefer a table scan. A longer string like 'Smithfield-Manning%', on the other hand, is likely to be very selective and would likely use index 4. How Oracle handles this variability will depend on the version.
In Oracle 10, Oracle introduced bind variable peeking. This meant that the first time Oracle parsed a query after a reboot (or after the query plan being aged out of the shared pool), Oracle looked at the bind value and decided what plan to use based on that value. Assuming that most of your queries would benefit from the index scan because users are generally searching on relatively selective values, this was great if the first query after a reboot had a selective condition. But if you got unlucky and someone did a WHERE firstname LIKE 'S%' immediately after a reboot, you'd be stuck with the table scan query plan until the query plan was removed from the shared pool.
Starting in Oracle 11, however, the optimizer has the ability to do adaptive cursor sharing. That means that the optimizer will try to figure out that WHERE firstname LIKE 'S%' should do a table scan and WHERE firstname LIKE 'Smithfield-Manning%' should do an index scan and will maintain multiple query plans for the same statement in the shared pool. That solves most of the problems that we had with bind variable peeking in earlier versions.
But even here, the accuracy of the optimizer's selectivity estimates are generally going to be problematic for medium-length strings. It's generally going to know that a single-character string is very weakly selective and that a 20 character string is highly selective but even with a 256 bucket histogram, it's not going to have a whole lot of information about how selective something like WHERE firstname LIKE 'Smit%' really is. It may know roughly how selective 'Sm%' is based on the column histogram but it's guessing rather blindly at how selective the next two characters are. So it's not uncommon to end up in a situation where most of the queries work efficiently but the optimizer is convinced that WHERE firstname LIKE 'Cave%' isn't selective enough to use an index.
Assuming that this is a common query, you may want to consider using Oracle's plan stability features to force Oracle to use a particular plan regardless of the value of a bind variable. This may mean that users that enter a single character have to wait even longer than they would otherwise have waited because the index scan is substantially less efficient than doing a table scan. But that may be worth it for other users that are searching for short but reasonably distinctive last names. And you may do things like add a ROWNUM limiter to the query or add logic to the front end that requires a minimum number of characters in the search box to avoid situations where a table scan would be more efficient.
It's a bit strange to have both the upper and lower function-based indexes on the same field. And I don't think the optimizer will use either in your query as it its.
You should pick one or the other (and probably drop the last one too), and only ever query on the upper (or lower)-case with something like:
select id from user u where upper(u.firstname) like 'MIKE%'
Edit: look at this post too, has some interesting info How to use a function-based index on a column that contains NULLs in Oracle 10+?
It may not hit any of your indexes, because you are returning ID in the SELECT clause, which is not covered by the indexes.
If the index is very selective, and Oracle decides it is still worthwhile using it to find 'MIKE%' then perform a lookup on the data to get the ID column, then it may use 4. Index(FirstName). 2 and 3 will only be used if the column searched uses the exact function defined in the index.

Optimizing MySQL Queries: Is it always possible to optimize a query so that it doesn't use "ALL"

According to the MySQL documentation regarding Optimizing Queries With Explain:
* ALL: A full table scan is done for each combination of rows from the previous tables. This is normally not good if the table is the first table not marked const, and usually very bad in all other cases. Normally, you can avoid ALL by adding indexes that allow row retrieval from the table based on constant values or column values from earlier tables.
Does this mean that any query that uses ALL can be optimized so that it is no longer is doing a full table scan?
In other words, by adding the correct indexes to the table, is it possible to always avoid using ALL? Or are there some cases where ALL is unavoidable, no matter what indexes you add?
It's almost (there are cases where doing full scan is actually cheaper) always possible to optimize ONE query to avoid full scan by creating appropriate indexes. However, if you run multiple queries against the same table there are scenarios when either some of them will end up doing full scan or you'll end up with more indexes then you have columns in your table :-)
Yes, there are some queries where you'd be hard-pressed to produce an appropriate index. For example:
SELECT * FROM mytable WHERE colA * arg0 - colB > arg1
I'm not entirely sure why you'd want to make such a query, though :)
That said, too many indexes will use up more cache memory and disk space, and slow down updates and inserts.

SQL Server Index - Any improvement for LIKE queries?

We have a query that runs off a fairly large table that unfortunately needs to use LIKE '%ABC%' on a couple varchar fields so the user can search on partial names, etc. SQL Server 2005
Would adding an index on these varchar fields help any in terms of select query performance when using LIKE or does it basically ignore the indexes and do a full scan in those cases?
Any other possible ways to improve performance when using LIKE?
Only if you add full-text searching to those columns, and use the full-text query capabilities of SQL Server.
Otherwise, no, an index will not help.
You can potentially see performance improvements by adding index(es), it depends a lot on the specifics :)
How much of the total size of the row are your predicated columns? How many rows do you expect to match? Do you need to return all rows that match the predicate, or just top 1 or top n rows?
If you are searching for values with high selectivity/uniqueness (so few rows to return), and the predicated columns are a smallish portion of the entire row size, an index could be quite useful. It will still be a scan, but your index will fit more rows per page than the source table.
Here is an example where the total row size is much greater than the column size to search across:
create table t1 (v1 varchar(100), b1 varbinary(8000))
go
--add 10k rows of filler
insert t1 values ('abc123def', cast(replicate('a', 8000) as varbinary(8000)))
go 10000
--add 1 row to find
insert t1 values ('abc456def', cast(replicate('a', 8000) as varbinary(8000)))
go
set statistics io on
go
select * from t1 where v1 like '%456%'
--shows 10001 logical reads
--create index that only contains the column(s) to search across
create index t1i1 on t1(v1)
go
select * from t1 where v1 like '%456%'
--or can force to
--shows 37 logical reads
If you look at the actual execution plan you can see the engine scanned the index and did a bookmark lookup on the matching row. Or you can tell the optimizer directly to use the index, if it hadn't decide to use this plan on its own:
select * from t1 with (index(t1i1)) where v1 like '%456%'
If you have a bunch of columns to search across only a few that are highly selective, you could create multiple indexes and use a reduction approach. E.g. first determine a set of IDs (or whatever your PK is) from your highly selective index, then search your less selective columns with a filter against that small set of PKs.
If you always need to return a large set of rows you would almost certainly be better off with a table scan.
So the possible optimizations depend a lot on the specifics of your table definition and the selectivity of your data.
HTH!
-Adrian
The only other way (other than using fulltext indexing) you could improve performance is to use "LIKE ABC%" - don't add the wildcard on both ends of your search term - in that case, an index could work.
If your requirements are such that you have to have wildcards on both ends of your search term, you're out of luck...
Marc
Like '%ABC%' will always perform a full table scan. There is no way around that.
You do have a couple of alternative approaches. Firstly full text searching, it's really designed for this sort of problem so I'd look at that first.
Alternatively in some circumstances it might be appropriate to denormalize the data and pre-process the target fields into appropriate tokens, then add these possible search terms into a separate one to many search table. For example, if my data always consisted of a field containing the pattern 'AAA/BBB/CCC' and my users were searching on BBB then I'd tokenize that out at insert/update (and remove on delete). This would also be one of those cases where using triggers, rather than application code, would be much preferred.
I must emphasis that this is not really an optimal technique and should only be used if the data is a good match for the approach and for some reason you do not want to use full text search (and the database performance on the like scan really is unacceptable). It's also likely to produce maintenance headaches further down the line.
create statistics on that column. sql srever 2005 has optimized the in string search so you might benfit from that.