Multiple indexes on one column - sql

Using Oracle, there is a table called User.
Columns: Id, FirstName, LastName
Indexes: 1. PK(Id), 2. UPPER(FirstName), 3. LOWER(FirstName), 4. Index(FirstName)
As you can see index 2, 3, 4 are indexes on the same column - FirstName.
I know this creates overhead, but my question is on selecting how will the database react/optimize?
For instance:
SELECT Id FROM User u WHERE
u.FirstName LIKE 'MIKE%'
Will Oracle hit the right index or will it not?
The problem is that via Hibernate this slows down the query VERY much (so it uses prepared statements).
Thanks.
UPDATE: Just to clarify indexes 2 and 3 are functional indexes.

In addition to Mat's point that either index 2 or 3 should be redundant because you should choose one approach to doing case-insensitive searches and to Richard's point that it will depend on the selectivity of the index, be aware that there are additional concerns when you are using the LIKE clause.
Assuming you are using bind variables (which it sounds like you are based on your use of prepared statements), the optimizer has to guess at how selective the actual bind value is going to be. Something short like 'S%' is going to be very non-selective, causing the optimizer to generally prefer a table scan. A longer string like 'Smithfield-Manning%', on the other hand, is likely to be very selective and would likely use index 4. How Oracle handles this variability will depend on the version.
In Oracle 10, Oracle introduced bind variable peeking. This meant that the first time Oracle parsed a query after a reboot (or after the query plan being aged out of the shared pool), Oracle looked at the bind value and decided what plan to use based on that value. Assuming that most of your queries would benefit from the index scan because users are generally searching on relatively selective values, this was great if the first query after a reboot had a selective condition. But if you got unlucky and someone did a WHERE firstname LIKE 'S%' immediately after a reboot, you'd be stuck with the table scan query plan until the query plan was removed from the shared pool.
Starting in Oracle 11, however, the optimizer has the ability to do adaptive cursor sharing. That means that the optimizer will try to figure out that WHERE firstname LIKE 'S%' should do a table scan and WHERE firstname LIKE 'Smithfield-Manning%' should do an index scan and will maintain multiple query plans for the same statement in the shared pool. That solves most of the problems that we had with bind variable peeking in earlier versions.
But even here, the accuracy of the optimizer's selectivity estimates are generally going to be problematic for medium-length strings. It's generally going to know that a single-character string is very weakly selective and that a 20 character string is highly selective but even with a 256 bucket histogram, it's not going to have a whole lot of information about how selective something like WHERE firstname LIKE 'Smit%' really is. It may know roughly how selective 'Sm%' is based on the column histogram but it's guessing rather blindly at how selective the next two characters are. So it's not uncommon to end up in a situation where most of the queries work efficiently but the optimizer is convinced that WHERE firstname LIKE 'Cave%' isn't selective enough to use an index.
Assuming that this is a common query, you may want to consider using Oracle's plan stability features to force Oracle to use a particular plan regardless of the value of a bind variable. This may mean that users that enter a single character have to wait even longer than they would otherwise have waited because the index scan is substantially less efficient than doing a table scan. But that may be worth it for other users that are searching for short but reasonably distinctive last names. And you may do things like add a ROWNUM limiter to the query or add logic to the front end that requires a minimum number of characters in the search box to avoid situations where a table scan would be more efficient.

It's a bit strange to have both the upper and lower function-based indexes on the same field. And I don't think the optimizer will use either in your query as it its.
You should pick one or the other (and probably drop the last one too), and only ever query on the upper (or lower)-case with something like:
select id from user u where upper(u.firstname) like 'MIKE%'
Edit: look at this post too, has some interesting info How to use a function-based index on a column that contains NULLs in Oracle 10+?

It may not hit any of your indexes, because you are returning ID in the SELECT clause, which is not covered by the indexes.
If the index is very selective, and Oracle decides it is still worthwhile using it to find 'MIKE%' then perform a lookup on the data to get the ID column, then it may use 4. Index(FirstName). 2 and 3 will only be used if the column searched uses the exact function defined in the index.

Related

Why "IN " query tag is so costly in sql stored procedures?

How can I improve my performance issue? I have an sql query with 'IN' I guess 'IN' making some costly performance issue. But I need index my sql query?
My sql query:
SELECT [p].[ReferencedxxxId]
FROM [Common].[xxxReference] AS [p]
WHERE ([p].[IsDeleted] = 0)
AND (([p].[ReferencedxyzType] = #__refxyzType_0)
AND [p].[ReferencedxxxId] IN ('42342','ffsdfd','5345345345'))
My solution: (BUT I NEED YOUR HELP FOR BETTER ADVISE) Whichone is correct clustered or nonclustred index?
USE [xxx]
GO
CREATE NONCLUSTERED INDEX IX_NonClusteredIndexDemo_xxxId
ON [Common].[xxxReference](xxxId)
INCLUDE ([ID],[ReferencedxxxId])
WITH (DROP_EXISTING=ON, ONLINE=ON, FILLFACTOR=90)
GO
Second:
CREATE INDEX xxxReference_ReferencedxxxId_index
ON [Common].[xxxReference] (ReferencedxxxId)[/code]
Whichone is correct or do you have better solution?
The performance problem of this query is not the result of using the IN operator.
This operator performs very well with small lists (say, less than 1000 members).
The performance bottle neck here is the fact that SQL Server performs an index scan instead of an index seek (which is very costly), and the key lookup, which is 20% of the query cost.
To avoid both problems, you can add an index on IsDeleted, ReferencedxyzType and ReferencedxxxId - probably in this exact order.
SQL Performance tuning is a science that tends to look a little like art or magic - either way you look at it it requires a good knowledge of both the theory and practice of index settings and the relevant systems requirements.
Therefor, my suggestion is this: Do not attempt to solve it yourself with the help of strangers on the internet. Get an expert for a consulting job for a couple of hours/days to analyze the system and help you fine-tune it.
Learn whatever you can during this process. Ask questions about everything that is not trivial. This will be money well spent.
Couple of things:
If you have a SELECT statement inside the IN, that should be avoided
and should be replaced with an EXISTS clause. But in your above
example, that is not relevant as you have direct values inside IN.
Using EXISTS and NOT EXISTS instead of IN and NOT IN helps SQL
Server to not needing to scan each value of the column for each
values inside the IN / NOT IN and rather can short circuit the
search once a match or non-match found.
Avoid the implicit conversion. They degrade the performance due to
many reasons including i> SQL Server not able to find proper
statistics on an index and hence not able to leverage an index and
would rather go make use of a clustered index available in the table
(which may not be covering your query), ii> Not assigning proper
required RAM during memory allocation phase of the query by storage
engine, iii> Cardinality estimation becomes wrong as SQL Server
would not have statistics on the computed value of that column, and
rather probably had statistics on that column.
If you look at your execution plan posted above, you will see a
yellow mark in your 'SELECT'. If you hover over it, you will see
one/more warning messages. If your warning is related to implicit
conversion, try to use proper datatypes during comparison.
Eg. What is the datatype of the column '[ReferencedxxxId]'? If it
is not an NVARCHAR and is rather a VARCHAR, then I would suggest:
Make the values inside the IN as VARCHAR (currently you are making them NVARCHAR). This way you will still be able to take full advantage of the rowstore index created on [ReferencedxxxId] column.
If you must have the values as NVARCHAR inside the IN clause, then you should:
CONVERT/CAST the column [ReferencedxxxId] in your IN clause. This is going to get rid of the Implicit conversion but you will no longer be able to take full advantage of the rowstore index on [ReferencedxxxId] column.
+
Rather create a clustered/nonclustered columnstore index on the table covering the columns used in the query. This should significantly enhance the performance of your SELECT query.
If you decided to go with the route of using rowstore index by correcting the values inside the IN, you need to make sure that you create a clustered/nonclustered index which covers the query. Meaning the index covers the columns on which you are doing search ([ReferencedxxxId], [ReferencedxxxType], [IsDeleted]) and then including the columns used in SELECT statement under INCLUDE clause (if it is a nonclustered index)
Also, when you are creating a composite rowstore index, try to keep the order of columns inside the index high cardinality to low cardinality from left to right to make the best use of that index.
On the basis of assuming an OLTP based system and not OLAP, my first pass would be an NC Index - given isDeleted is likely to have the least selectivity, I would place it last, first pass would be an NC index ReferencedxyzType, ReferencedxxxId, IsDeleted
I might even be tempted in a higher volume scenario to move the IsDeleted out of the index onto an include instead, since it provides so little selectivity to the index itself.
There is clearly already a clustered index in place on the table (from the query plan we can see it), we don't have the details of what is in it.
The question around clustered vs non-clustered is more complex and requires a lot more knowledge of the system and usage.

Query with LIKE, increasingly slow with a smaller resultset

Say I have a Person table with 200000 records, there's a clustered index on it's GUID primary key. This GUID is generated using the NEWSEQUENTIALID() construct provided by SQL Server (2008 R2). Furthermore there is a regular index on the LastName (varchar(256)) column.
For every record I've generated a unique name (Lastname_1 through Lastname_200000), now I'm playing around with some queries and have come to find that the more restrictive my criteria is, the slower SQL Server will return actual results. And this performance implication is quite severe.
E.g.:
SELECT * FROM Person WHERE Lastname LIKE '%Lastname_123456%'
Is much slower than
SELECT * FROM Person WHERE Lastname LIKE '%Lastname_123%'
Responsetimes are measured by setting statistics on:
SET STATISTICS TIME ON
I can imagine this being caused
1) Because of the LIKE clause itself, since it starts with % it isn't possible to use the inde on that particular column,
2) SQL having to think more about my 'bigger question'.
Is there any truth in this? Is there some way to avoid this?
Edit:
To add some context to this question, this is part of a use case for a 'free search'. I would very much like the system to be fast when a user enters a full lastname.
How should I make these cases perform? Should I avoid the '%xxx%' construction and go for 'xxx%' like construction? Which does add alot of speed, but at the cost of some flexibility for the user...
You are right on with number 2, since the second LIKE must match more characters in the string, SQL stops searching when it finds a character that doesn't match so it takes less string matching iterations to find a smaller search string - even though you get more results back.
As for #1 - SQL will use an index if possible for a LIKE, but will probably do an index scan (probably the clustered index) since a seek is not possible with a wildcard. It also depends on what's included in the index - since you are selecting all columns, it's likely that a table scan is happening instead since the index you 'could' use is not covering your query (unless it's using the clustered index)
Check your execution plan - you will likely see a table scan
Usually, SQL Server does not use indexes on a LIKE.
This article can help guide you

Tips and Tricks to speed up an SQL [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Does the order of columns in a WHERE clause matter?
These are the basics SQL Function and Keywords.
Is there any tips or trick to speed up your SQL ?
For example; I have a query with a lot of keywords. (AND, GROUP BY, ORDER BY, IN, BETWEEN, LIKE... etc.)
Which Keyword should be on top in my query?
How can i decide it?
Example;
Where NUMBER IN (156, 646)
AND DATE BETWEEN '01/01/2011' AND '01/02/2011'
OR
Where DATE BETWEEN '01/01/2011' AND '01/02/2011'
AND NUMBER IN (156, 646)
Which one is faster? Depends of what?
Don't use functions in the where clause. Because the query engine must execute the function for every single row.
There are no "tricks".
Given the competition between the database vendors about which one is "faster", any "trick" that is always true would be implemented in the database itself. (The tricks are implemented in the part of the database called "optimizer").
There are only things to be aware of, but they typically can't be reduced into:
Use feature X
Avoid feature Y
Model like this
Never model like that
Look at all the raging questions/discussions about indexes, index types, index strategies, clustering, single column keys, compound keys, referential integrity, access paths, joins, join mechanisms, storage engines, optimizer behaviour, datatypes, normalization, query transformations, denormalization, procedures, buffer cache, resultset cache, application cache, modeling, aggregation, functions, views, indexed views, set processing, procedural processing and the list goes on.
All of them was invented to attack a specific problem area. Variations on that problem make the "trick" more or less suitable. Very often the tricks have zero effect, and sometimes sometimes flat out horrible. Why? Because when we don't understand why something works, we are basically just throwing features at the problem until it goes away.
The key point here is that there is a reason why something makes a query go faster, and the understanding of what that something is, is crucial to the process of understanding why a different unrelated query is slow, and how to deal with it. And it is never a trick, nor magic.
We (humans) are lazy, and we want to be thrown that fish when what we really need is to learn how to catch it.
Now, what specific fish do YOU want to catch?
Edited for comments:
The placement of your predicates in the where clause makes no difference since the order in which they are processed is determined by the database. Some of the things which will affect that order (for your example) are :
Whether or not the query can be rewritten against an indexed view
What indexes are available that covers one or both of columns NUMBER and DATE and in what order they exist in that index
The estimated selectivity of your predicates, which basically mean the estimated percentage of rows matched by your predicate. The lower % the more likely the optimizer is to use your index efficiently.
The clustering factor (or whatever the name is in SQL Server) if SQL Server factors that into the query cost. This has to do with how the order of the index entries aligns with the physical order of the table rows. Better alignment = reduces cost for higher % of rows fetched via that index.
Now, if the only values you have in column NUMBER are 156, 646 and they are pretty much evenly spread, an index would be useless. A full scan would be a better alternative.
On the other hand, if those are unique order numbers (backed by a unique index), the optimizer will pick that index and drive the query from there. Similarily, if the rows having a DATE between the first and second of January 2011 make up a small enough % of the rows, an index leading with DATE will be considered.
Or if you include order by NUMBER, DATE another parameter comes into the equation; the cost of sorting. An index on (NUMBER, DATE) will now seem more attractive to the optimizer, because even though it might not be the most efficient way of aquiring the rows, the sorting (which is expensive) can be skipped.
Or, if your query included a join to another table (say customer) on customer_id and you also had a filter on customer.ssn, again the equation changes, because (since you did a good job with foreign keys and a backing index) you will now have a very efficient access path into your first table, without using the indexes in NUMBER or DATE. Unless you only have one customer and all of the 10 million orders where his...
Read about sargable queries (ones which can use the index vice ones which can't). Avoid correlated subqueries, functions in where clauses, cursors and while loops. Don't use select * especially if you have joins, never return more than the data you need.
Actually there are whole books written on performance tuning, get one and read it for the datbase you are using as the techniques vary from database to database.
Learn to use indexes properly.
http://Use-The-Index-Luke.com/

SQL `LIKE` complexity

Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?
Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.
If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.
Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.

Do indexes work with "IN" clause

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..