Query with LIKE, increasingly slow with a smaller resultset - sql

Say I have a Person table with 200000 records, there's a clustered index on it's GUID primary key. This GUID is generated using the NEWSEQUENTIALID() construct provided by SQL Server (2008 R2). Furthermore there is a regular index on the LastName (varchar(256)) column.
For every record I've generated a unique name (Lastname_1 through Lastname_200000), now I'm playing around with some queries and have come to find that the more restrictive my criteria is, the slower SQL Server will return actual results. And this performance implication is quite severe.
E.g.:
SELECT * FROM Person WHERE Lastname LIKE '%Lastname_123456%'
Is much slower than
SELECT * FROM Person WHERE Lastname LIKE '%Lastname_123%'
Responsetimes are measured by setting statistics on:
SET STATISTICS TIME ON
I can imagine this being caused
1) Because of the LIKE clause itself, since it starts with % it isn't possible to use the inde on that particular column,
2) SQL having to think more about my 'bigger question'.
Is there any truth in this? Is there some way to avoid this?
Edit:
To add some context to this question, this is part of a use case for a 'free search'. I would very much like the system to be fast when a user enters a full lastname.
How should I make these cases perform? Should I avoid the '%xxx%' construction and go for 'xxx%' like construction? Which does add alot of speed, but at the cost of some flexibility for the user...

You are right on with number 2, since the second LIKE must match more characters in the string, SQL stops searching when it finds a character that doesn't match so it takes less string matching iterations to find a smaller search string - even though you get more results back.
As for #1 - SQL will use an index if possible for a LIKE, but will probably do an index scan (probably the clustered index) since a seek is not possible with a wildcard. It also depends on what's included in the index - since you are selecting all columns, it's likely that a table scan is happening instead since the index you 'could' use is not covering your query (unless it's using the clustered index)
Check your execution plan - you will likely see a table scan

Usually, SQL Server does not use indexes on a LIKE.
This article can help guide you

Related

Optimize SQL Query on SQLite3 by using indexes

I'm trying to optimize a SQL Query by creating indexes to have the best performances.
Table definition
CREATE TABLE Mots (
numero INTEGER NOT NULL,
fk_dictionnaires integer(5) NOT NULL,
mot varchar(50) NOT NULL,
ponderation integer(20) NOT NULL,
drapeau varchar(1) NOT NULL,
CONSTRAINT pk_mots PRIMARY KEY(numero),
CONSTRAINT uk_dico_mot_mots UNIQUE(fk_dictionnaires, mot),
CONSTRAINT fk_mots_dictionnaires FOREIGN KEY(fk_dictionnaires) REFERENCES Dictionnaires(numero)
);
Indexes definition
CREATE INDEX idx_dictionnaires ON mots(fk_dictionnaires DESC);
CREATE INDEX idx_mots_ponderation ON mots(ponderation);
CREATE UNIQUE INDEX idx_mots_unique ON mots(fk_dictionnaires, mot);
SQL Query :
SELECT numero, mot, ponderation, drapeau
FROM mots
WHERE mot LIKE 'ar%'
AND fk_dictionnaires=1
AND LENGTH(mot)>=4
ORDER BY ponderation DESC
LIMIT 5;
Query Plan
0|0|0|SEARCH TABLE mots USING INDEX idx_dictionnaires (fk_dictionnaires=?) (~2 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY
Defined indexes don't seem used and the query lasts (according to the .timer) :
CPU Time: user 0.078001 sys 0.015600
However, when I removed the fk_dictionnaires=1. My indexes are correctly used and the performances are around 0.000000-0.01XXXXXX sec
0|0|0|SCAN TABLE mots USING INDEX idx_mots_ponderation (~250000 rows)
I found out some similars questions on stackoverflow but no anwser help me.
Removing a Temporary B Tree Sort from a SQLite Query
Similar issue
How can I improve the performances by using indexes or/and by changing the SQL Query?
Thanks in advance.
SQLite seems to think that the idx_dictionnaires index is very sparse and concludes that if it scans using idx_dictionnaires, it will only have to examine a couple of rows. However, the performance results you quote suggest that it must be examining more than just a couple rows. First, why don't you try ANALYZE mots, so SQLite will have up-to-date information on the cardinality of each index available?
Here is something else which might help, from the SQLite documentation:
Terms of the WHERE clause can be manually disqualified for use with indices by prepending a unary + operator to the column name. The unary + is a no-op and will not slow down the evaluation of the test specified by the term. But it will prevent the term from constraining an index. So, in the example above, if the query were rewritten as:
SELECT z FROM ex2 WHERE +x=5 AND y=6;
The + operator on the x column will prevent that term from constraining an index. This would force the use of the ex2i2 index.
Note that the unary + operator also removes type affinity from an expression, and in some cases this can cause subtle changes in the meaning of an expression. In the example above, if column x has TEXT affinity then the comparison "x=5" will be done as text. But the + operator removes the affinity. So the comparison "+x=5" will compare the text in column x with the numeric value 5 and will always be false.
If ANALYZE mots isn't enough to help SQLite choose the best index to use, you can use this feature to force it to use the index you want.
You could also try compound indexes -- it looks like you already defined one on fk_dictionnaires,mot, but SQLite isn't using it. For the "fast" query, SQLite seemed to prefer using the index on ponderation, to avoid sorting the rows at the end of the query. If you add an index on fk_dictionnaires,ponderation DESC, and SQLite actually uses it, it could pick out the rows which match fk_dictionnaires=1 without a table scan and avoid sorting at the end.
POSTSCRIPT: The compound index I suggested above "fixed" the OP's performance problem, but he also asked how and why it works. #AGeiser, I'll use a brief illustration to try to help you understand DB indexes intuitively:
Imagine you need to find all the people in your town whose surnames start with "A". You have a directory of all the names, but they are in random order. What do you do? You have no choice but to read through the whole directory, and pick out the ones which start with "A". Sounds like a lot of work, right? (This is like a DB table with no indexes.)
But what if somebody gives you a phone book, with all the names in alphabetical order? Now you can just find the first and last entries which start with "A" (using something like a binary search), and take all the entries in that range. You don't have to even look at all the other names in the book. This will be way faster. (This is like a DB table with an index; in this case, call it an index on last_name,first_name.)
Now what if you want all the people whose names start with "A", but in the case that 2 people have the same name, you want them to be ordered by postal code? Even if you get the needed names quickly using the "phone book" (ie. index on last_name,first_name), you will still have to sort them all manually... so it starts sounding like a lot of work again. What could make this job really easy?
It would take another "phone book" -- but one in which the entries are ordered first by name, and then by postal code. With a "phone book" like that, you could quickly select the range of entries which you need, and you wouldn't even need to sort them -- they would already be in the desired order. (This is an index on last_name,first_name,postal_code.)
I think this illustration should make it clear how indexes can help SELECT queries, not just by reducing the number of rows which must be examined, but also by (potentially) eliminating the need for a separate "sort" phase after the needed rows are found. Hopefully it also makes it clear that a compound index on a,b is completely different from one on b,a. I could go on giving more "phone book" examples, but this answer would become so long that it would be more like a blog post. To build your intuition on which indexes are likely to benefit a query, I recommend the book from O'Reilly on "SQL Antipatterns" (especially chapter 13, "Index Shotgun").

Multiple indexes on one column

Using Oracle, there is a table called User.
Columns: Id, FirstName, LastName
Indexes: 1. PK(Id), 2. UPPER(FirstName), 3. LOWER(FirstName), 4. Index(FirstName)
As you can see index 2, 3, 4 are indexes on the same column - FirstName.
I know this creates overhead, but my question is on selecting how will the database react/optimize?
For instance:
SELECT Id FROM User u WHERE
u.FirstName LIKE 'MIKE%'
Will Oracle hit the right index or will it not?
The problem is that via Hibernate this slows down the query VERY much (so it uses prepared statements).
Thanks.
UPDATE: Just to clarify indexes 2 and 3 are functional indexes.
In addition to Mat's point that either index 2 or 3 should be redundant because you should choose one approach to doing case-insensitive searches and to Richard's point that it will depend on the selectivity of the index, be aware that there are additional concerns when you are using the LIKE clause.
Assuming you are using bind variables (which it sounds like you are based on your use of prepared statements), the optimizer has to guess at how selective the actual bind value is going to be. Something short like 'S%' is going to be very non-selective, causing the optimizer to generally prefer a table scan. A longer string like 'Smithfield-Manning%', on the other hand, is likely to be very selective and would likely use index 4. How Oracle handles this variability will depend on the version.
In Oracle 10, Oracle introduced bind variable peeking. This meant that the first time Oracle parsed a query after a reboot (or after the query plan being aged out of the shared pool), Oracle looked at the bind value and decided what plan to use based on that value. Assuming that most of your queries would benefit from the index scan because users are generally searching on relatively selective values, this was great if the first query after a reboot had a selective condition. But if you got unlucky and someone did a WHERE firstname LIKE 'S%' immediately after a reboot, you'd be stuck with the table scan query plan until the query plan was removed from the shared pool.
Starting in Oracle 11, however, the optimizer has the ability to do adaptive cursor sharing. That means that the optimizer will try to figure out that WHERE firstname LIKE 'S%' should do a table scan and WHERE firstname LIKE 'Smithfield-Manning%' should do an index scan and will maintain multiple query plans for the same statement in the shared pool. That solves most of the problems that we had with bind variable peeking in earlier versions.
But even here, the accuracy of the optimizer's selectivity estimates are generally going to be problematic for medium-length strings. It's generally going to know that a single-character string is very weakly selective and that a 20 character string is highly selective but even with a 256 bucket histogram, it's not going to have a whole lot of information about how selective something like WHERE firstname LIKE 'Smit%' really is. It may know roughly how selective 'Sm%' is based on the column histogram but it's guessing rather blindly at how selective the next two characters are. So it's not uncommon to end up in a situation where most of the queries work efficiently but the optimizer is convinced that WHERE firstname LIKE 'Cave%' isn't selective enough to use an index.
Assuming that this is a common query, you may want to consider using Oracle's plan stability features to force Oracle to use a particular plan regardless of the value of a bind variable. This may mean that users that enter a single character have to wait even longer than they would otherwise have waited because the index scan is substantially less efficient than doing a table scan. But that may be worth it for other users that are searching for short but reasonably distinctive last names. And you may do things like add a ROWNUM limiter to the query or add logic to the front end that requires a minimum number of characters in the search box to avoid situations where a table scan would be more efficient.
It's a bit strange to have both the upper and lower function-based indexes on the same field. And I don't think the optimizer will use either in your query as it its.
You should pick one or the other (and probably drop the last one too), and only ever query on the upper (or lower)-case with something like:
select id from user u where upper(u.firstname) like 'MIKE%'
Edit: look at this post too, has some interesting info How to use a function-based index on a column that contains NULLs in Oracle 10+?
It may not hit any of your indexes, because you are returning ID in the SELECT clause, which is not covered by the indexes.
If the index is very selective, and Oracle decides it is still worthwhile using it to find 'MIKE%' then perform a lookup on the data to get the ID column, then it may use 4. Index(FirstName). 2 and 3 will only be used if the column searched uses the exact function defined in the index.

Sql Excution Plan Shows Different result for same inputs

declare #name varchar(156)
set #name ='sara'
--Query 1:
SELECT [PNAME] FROM [tbltest] where [PNAME] like '%'+#name+'%'
--Query 2:
SELECT [PNAME] FROM [tbltest] where [PNAME] like '%sara%'
suppose that there is a NoneClustered Index on [PNAME] column of [tbltest].
when running Queries, Excution plan show index Seek For Query 1 and Index Scan for Query 2.
i expected that Excution Paln Show Index Scan For both queries,but because of using parameter in the first Query,it Show Index Seek.
So what i the mater?
in both query we used '%' at oth side,and know that in this state ,sql does not consider index
but why in first Query Excution Plan Show Index Seek?
thanks
Query one uses a parameter, query 2 a constant.
The plan for query 2 will not be reused if you change the constant value.
The query for plan 1 can be. In this case, SQL Server (simply) leaves it's options open for reusing the plan.
AKA: the queries are not the same.
If you force parameterisation, then you should make both queries run like query 1. But I haven't tried...
If you do DBCC SHOW_STATISTICS on your table and the index that is being used, look for "String Index = YES" in the first row of the output. SQL Server maintains some sort of additional stats for satisfying queries like '%x'
In the first query, you'll probably see computed scalar values - look in the query plan for LikeRangeStart('%'+#name+'%'). The Index Seek is against those values as opposed to the index scan against %sara%.
How this works I don't know. Why SQL Server would not be smart enough to convert 'sara' to a constant and do the query the same way I don't know either. But I think that's what's going on.
Against %sara% it does an index scan, reading the entire index. Against %+#name+% it creates RangeStart/RangeEnd/RangeInfo computed values and uses them to do an index seek somehow taking advantage of the addtional string statistics.
I think that Mike is on the right track about whether you are hitting the index or not. Your follow-up regarding cost might need more of an understanding about how your data is distributed within the table. I've seen instances when hitting an index is more costly due to do the need for two disk reads. To understand why, you'll have to know how your data is distributed across the index, how many records will fit into a page, and what your caching scheme is.
I will say that it may be difficult to tune a query with a leading %. The database will need to fully traverse your index (or table) and hit every node looking for a value that contains "sara". Depending on your needs, you might want to consider full-text search (i.e., is the parameter value in this query used because it's provided as input from a user of your application).

Performance effect of using TOP 1 in a SELECT query

I have a User table where there are a Username and Application columns. Username may repeat but combination of Username + Application is unique, but I don't have the unique constraint set on the table (for performance)
Question: will there be any difference (performance-wise) between :
SELECT * FROM User where UserName='myuser' AND Application='myapp'
AND -
SELECT TOP 1 * FROM User where UserName='myuser' AND Application='myapp'
As combination of Username + Application is unique, both queries will always return no more than one record, so TOP 1 doesn't affect the result. I always thought that adding TOP 1 will really speed things up as sql server would stop looking after it found one match, but I recently read in an article that using TOP will actually slow things down and it's recommended to avoid, though they haven't explained why.
Any comments?
Thank you!
Andrey
You may get some performance difference from just using top, but the real performance you get by using indexes.
If you have an index for the UserName and Application fields, the database doesn't even have to touch the table until it has isolated the single record. Also, it will already know from the table statistics that the values are unique, so using top makes no difference.
If there's more than one row in the results and no ORDER BY clause, the "TOP 1" saves a ton of work for the server. If there's an order by clause the server still has to materialize the entire result set anyway, and if there's only one row it doesn't really change anything.
I think it depends on the query execution plan that SQL generates ... In the past on previous versions of SQL Server I have seen the use of a superfluous 'TOP' deliver definite performance benefits with complex queries with many joins. But definitely not in all cases.
I guess the best advice I can give is to try it out on a case by case basis.
you say you do not enforce the constraint, that translates there is no unique index on (UserName, Application) or (Application, UserName). Can the query use an access path that seeks either on UserName or Application? In other words, is any of these two columns indexed? If yes, then the plan will pick the most selective one which is indexed and do a range scan, possibly a nested loop with a bookmark lookup if the index is non-clustered, then a filter. Top 1 will stop the query after the first filter is matched, but whether this makes a difference depends on the cardinality of the data (how many records the range scan finds and how many satisfy the filter).
If there is no index then it will do a full clustered scan no matter what. Top 1 will stop the scan on first match, whether this is after processing 1 record or after processing 999 mil. records depdends on the actual user name and application...
The only thing that wil make a real difference is to allow the query to do a seek for both values, ie. have a covering index. The constraint would be enforced through exactly such a covering index. In other words: by turning off the constraint, presumably for write performance, be prepared to pay the price at reads. Is this read important? Did you do any measurement to confirm that the extra index write of the constraint would be critically dampening the performance?

Do indexes work with "IN" clause

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..