Performance effect of using TOP 1 in a SELECT query - sql

I have a User table where there are a Username and Application columns. Username may repeat but combination of Username + Application is unique, but I don't have the unique constraint set on the table (for performance)
Question: will there be any difference (performance-wise) between :
SELECT * FROM User where UserName='myuser' AND Application='myapp'
AND -
SELECT TOP 1 * FROM User where UserName='myuser' AND Application='myapp'
As combination of Username + Application is unique, both queries will always return no more than one record, so TOP 1 doesn't affect the result. I always thought that adding TOP 1 will really speed things up as sql server would stop looking after it found one match, but I recently read in an article that using TOP will actually slow things down and it's recommended to avoid, though they haven't explained why.
Any comments?
Thank you!
Andrey

You may get some performance difference from just using top, but the real performance you get by using indexes.
If you have an index for the UserName and Application fields, the database doesn't even have to touch the table until it has isolated the single record. Also, it will already know from the table statistics that the values are unique, so using top makes no difference.

If there's more than one row in the results and no ORDER BY clause, the "TOP 1" saves a ton of work for the server. If there's an order by clause the server still has to materialize the entire result set anyway, and if there's only one row it doesn't really change anything.

I think it depends on the query execution plan that SQL generates ... In the past on previous versions of SQL Server I have seen the use of a superfluous 'TOP' deliver definite performance benefits with complex queries with many joins. But definitely not in all cases.
I guess the best advice I can give is to try it out on a case by case basis.

you say you do not enforce the constraint, that translates there is no unique index on (UserName, Application) or (Application, UserName). Can the query use an access path that seeks either on UserName or Application? In other words, is any of these two columns indexed? If yes, then the plan will pick the most selective one which is indexed and do a range scan, possibly a nested loop with a bookmark lookup if the index is non-clustered, then a filter. Top 1 will stop the query after the first filter is matched, but whether this makes a difference depends on the cardinality of the data (how many records the range scan finds and how many satisfy the filter).
If there is no index then it will do a full clustered scan no matter what. Top 1 will stop the scan on first match, whether this is after processing 1 record or after processing 999 mil. records depdends on the actual user name and application...
The only thing that wil make a real difference is to allow the query to do a seek for both values, ie. have a covering index. The constraint would be enforced through exactly such a covering index. In other words: by turning off the constraint, presumably for write performance, be prepared to pay the price at reads. Is this read important? Did you do any measurement to confirm that the extra index write of the constraint would be critically dampening the performance?

Related

Query with LIKE, increasingly slow with a smaller resultset

Say I have a Person table with 200000 records, there's a clustered index on it's GUID primary key. This GUID is generated using the NEWSEQUENTIALID() construct provided by SQL Server (2008 R2). Furthermore there is a regular index on the LastName (varchar(256)) column.
For every record I've generated a unique name (Lastname_1 through Lastname_200000), now I'm playing around with some queries and have come to find that the more restrictive my criteria is, the slower SQL Server will return actual results. And this performance implication is quite severe.
E.g.:
SELECT * FROM Person WHERE Lastname LIKE '%Lastname_123456%'
Is much slower than
SELECT * FROM Person WHERE Lastname LIKE '%Lastname_123%'
Responsetimes are measured by setting statistics on:
SET STATISTICS TIME ON
I can imagine this being caused
1) Because of the LIKE clause itself, since it starts with % it isn't possible to use the inde on that particular column,
2) SQL having to think more about my 'bigger question'.
Is there any truth in this? Is there some way to avoid this?
Edit:
To add some context to this question, this is part of a use case for a 'free search'. I would very much like the system to be fast when a user enters a full lastname.
How should I make these cases perform? Should I avoid the '%xxx%' construction and go for 'xxx%' like construction? Which does add alot of speed, but at the cost of some flexibility for the user...
You are right on with number 2, since the second LIKE must match more characters in the string, SQL stops searching when it finds a character that doesn't match so it takes less string matching iterations to find a smaller search string - even though you get more results back.
As for #1 - SQL will use an index if possible for a LIKE, but will probably do an index scan (probably the clustered index) since a seek is not possible with a wildcard. It also depends on what's included in the index - since you are selecting all columns, it's likely that a table scan is happening instead since the index you 'could' use is not covering your query (unless it's using the clustered index)
Check your execution plan - you will likely see a table scan
Usually, SQL Server does not use indexes on a LIKE.
This article can help guide you

Multiple indexes on one column

Using Oracle, there is a table called User.
Columns: Id, FirstName, LastName
Indexes: 1. PK(Id), 2. UPPER(FirstName), 3. LOWER(FirstName), 4. Index(FirstName)
As you can see index 2, 3, 4 are indexes on the same column - FirstName.
I know this creates overhead, but my question is on selecting how will the database react/optimize?
For instance:
SELECT Id FROM User u WHERE
u.FirstName LIKE 'MIKE%'
Will Oracle hit the right index or will it not?
The problem is that via Hibernate this slows down the query VERY much (so it uses prepared statements).
Thanks.
UPDATE: Just to clarify indexes 2 and 3 are functional indexes.
In addition to Mat's point that either index 2 or 3 should be redundant because you should choose one approach to doing case-insensitive searches and to Richard's point that it will depend on the selectivity of the index, be aware that there are additional concerns when you are using the LIKE clause.
Assuming you are using bind variables (which it sounds like you are based on your use of prepared statements), the optimizer has to guess at how selective the actual bind value is going to be. Something short like 'S%' is going to be very non-selective, causing the optimizer to generally prefer a table scan. A longer string like 'Smithfield-Manning%', on the other hand, is likely to be very selective and would likely use index 4. How Oracle handles this variability will depend on the version.
In Oracle 10, Oracle introduced bind variable peeking. This meant that the first time Oracle parsed a query after a reboot (or after the query plan being aged out of the shared pool), Oracle looked at the bind value and decided what plan to use based on that value. Assuming that most of your queries would benefit from the index scan because users are generally searching on relatively selective values, this was great if the first query after a reboot had a selective condition. But if you got unlucky and someone did a WHERE firstname LIKE 'S%' immediately after a reboot, you'd be stuck with the table scan query plan until the query plan was removed from the shared pool.
Starting in Oracle 11, however, the optimizer has the ability to do adaptive cursor sharing. That means that the optimizer will try to figure out that WHERE firstname LIKE 'S%' should do a table scan and WHERE firstname LIKE 'Smithfield-Manning%' should do an index scan and will maintain multiple query plans for the same statement in the shared pool. That solves most of the problems that we had with bind variable peeking in earlier versions.
But even here, the accuracy of the optimizer's selectivity estimates are generally going to be problematic for medium-length strings. It's generally going to know that a single-character string is very weakly selective and that a 20 character string is highly selective but even with a 256 bucket histogram, it's not going to have a whole lot of information about how selective something like WHERE firstname LIKE 'Smit%' really is. It may know roughly how selective 'Sm%' is based on the column histogram but it's guessing rather blindly at how selective the next two characters are. So it's not uncommon to end up in a situation where most of the queries work efficiently but the optimizer is convinced that WHERE firstname LIKE 'Cave%' isn't selective enough to use an index.
Assuming that this is a common query, you may want to consider using Oracle's plan stability features to force Oracle to use a particular plan regardless of the value of a bind variable. This may mean that users that enter a single character have to wait even longer than they would otherwise have waited because the index scan is substantially less efficient than doing a table scan. But that may be worth it for other users that are searching for short but reasonably distinctive last names. And you may do things like add a ROWNUM limiter to the query or add logic to the front end that requires a minimum number of characters in the search box to avoid situations where a table scan would be more efficient.
It's a bit strange to have both the upper and lower function-based indexes on the same field. And I don't think the optimizer will use either in your query as it its.
You should pick one or the other (and probably drop the last one too), and only ever query on the upper (or lower)-case with something like:
select id from user u where upper(u.firstname) like 'MIKE%'
Edit: look at this post too, has some interesting info How to use a function-based index on a column that contains NULLs in Oracle 10+?
It may not hit any of your indexes, because you are returning ID in the SELECT clause, which is not covered by the indexes.
If the index is very selective, and Oracle decides it is still worthwhile using it to find 'MIKE%' then perform a lookup on the data to get the ID column, then it may use 4. Index(FirstName). 2 and 3 will only be used if the column searched uses the exact function defined in the index.

Is there a SQL ANSI way of starting a search at the end of table?

In a certain app I must constantly query data that are likely to be amongst the last inserted rows. Since this table is going to grow a lot, I wonder if theres a standard way of optimizing the queries by making them start the lookup at the table's end. I think I would get the same optmization if the database stored data for the table in a stack-like structure, so the last inserted rows would be searched first.
The SQL spec doesn't mention anything about maintaining the insertion order. In practice, most of decent DB's also doesn't maintain it. Then it stops here. Sorting the table first ain't going to make it faster. Just index the column(s) of interest (at least the ones which you use in the WHERE).
One of the "tenets" of a proper RDBMS is that this kind of matters shouldn't concern you or anyone else using the DB.
The DB engine is "free" to use whatever method it wants to store/retrieve records, so if you want to enforce a "top" behaviour do what other suggested: add a timestamp field to the table (or tables), add an index on it and query using it as a sort and/or query criteria (e.g.: you poll the table each minute, and ask for records with timestamp>=systime-1 minute)
There is no standard way.
In some databases you can specify the sort order on an index.
SQL Server allows you to write ASC or DESC on an index:
[ ASC | DESC ]
Determines the ascending or descending sort direction for the particular index column. The default is ASC.
In MySQL you can also write ASC or DESC when you create the index but currently this is ignored. It might be implemented in a future version.
Add a counter or a time field in your table, sort on it and get top rows.
In other words: You should forget the idea that SQL tables are accessed in any particular order by default. A seqscan does not mean the oldest rows will be searched first, only that all rows will be checked. If you want to optimize some search you add indexes on some fields. What you are looking for is probably indexes.
If your data is indexed, it won't matter. The index is doing a binary search, not a sequential scan.
Unless you're doing TOP 1 (or something like it), the SELECT will have to scan the whole table or index anyway.
According to Data Independence you shouldn't care. That said a clustered index would probably suit your needs if you typically look for a date range. (sorting acs/desc shouldn't matter but you should try it out.)
If you find that you really need it you can also shard your database to increase perf on the most recently added data.
If you have enough rows that its actually becomming a problem, and you know how many "the most recently inserted rows" should be, you could try a round-about method.
Note: Even for pretty big tables, this is less efficient, but once your main table gets big enough, I've seen this work wonders for user-facing performance.
Create a "staging" table that exactly mimics your table's structure. Whenever you insert into your main table, also insert into your "staging" area. Limit your "staging" area to n rows by using a trigger to delete the lowest id row in the table when a new row over your arbitrary maximum is reached (say, 10,000 or whatever your limit is).
Then, queries can hit that smaller table first looking for the information. Since the table is arbitrarilly limited to the last n rows, it's only looking in the most recent data. Only if that fails to find a match would your query (actually, at this point a stored procedure because of the decision making) hit your main table.
Some Gotchas:
1) Make sure your trigger(s) is(are) set up properly to maintain the correct concurrancy between your "main" and "staging" tables.
2) This can quickly become a maintenance nightmare if not handled properly- and depending on your scenario it be be a little finiky.
3) I cannot stress enough that this is only efficient/useful in very specific scenarios. If yours doesn't match it, use one of the other answers.
ISO/ANSI Standard SQL does not consider optimization at all. For example the widely recognized CREATE INDEX SQL DDL does not appear in the Standard. This is because the Standard makes no assumptions about the underlying storage medium and nor should it. I regularly use SQL to query data in text files and Excel spreadsheets, neither of which have any concept of database indexes.
You can't do this.
However, there is a way to do something that might be even better. Depending on the design of your table, you should be able to create an index that keeps things in almost the order of entry. For example, if you adopt the common practice of creating an id field that autoincrements, then that index is just about in chronological order.
Some RDBMSes permit you to declare a backwards index, that is, one that descends instead of ascending. If you create a backwards index on the ID field, and if the optimizer uses that index, it will look at the most recent entries first. This will give you a rapid response for the first row.
The next step is to get the optimizer to use the index. You need to use explain plan to see if the index is being used. If you ask for the rows in order of id descending, the optimizer will almost certainly use the backwards index. If not you may be able to use hints to guide the optimizer.
If you still need to avoid reading all the rows in order to avoid wasting time, you may be able to use the LIMIT feature to declare that you only want, say 10 rows, and no more, or 1 row and no more. That should do it.
Good luck.
If your table has a create date, then I'd reverse sort by that and take the top 1.

Does an index on a unique field in a table allow a select count(*) to happen instantly? If not why not?

I know just enough about SQL tuning to get myself in trouble. Today I was doing EXPLAIN plan on a query and I noticed it was not using indexes when I thought it probably should. Well, I kept doing EXPLAIN on simpler and simpler (and more indexable in my mind) queries, until I did EXPLAIN on
select count(*) from table_name
I thought for sure this would return instantly and that the explain would show use of an index, as we have many indexes on this table, including an index on the row_id column, which is unique. Yet the explain plan showed a FULL table scan, and it took several seconds to complete. (We have 3 million rows in this table).
Why would oracle be doing a full table scan to count the rows in this table? I would like to think that since oracle is indexing unique fields already, and having to track every insert and update on that table, that it would be caching the row count somewhere. Even if it's not, wouldn't it be faster to scan the entire index than to scan the entire table?
I have two theories. Theory one is that I am imagining how indexes work incorrectly. Theory two is that some setting or parameter somewhere in our oracle setup is messing with Oracle's ability to optimize queries (we are on oracle 9i). Can anyone enlighten me?
Oracle does not cache COUNT(*).
MySQL with MyISAM does (can afford this), because MyISAM is transactionless and same COUNT(*) is visible by anyone.
Oracle is transactional, and a row deleted in other transaction is still visible by your transaction.
Oracle should scan it, see that it's deleted, visit the UNDO, make sure it's still in place from your transaction's point of view, and add it to the count.
Indexing a UNIQUE value differs from indexing a non-UNIQUE one only logically.
In fact, you can create a UNIQUE constraint over a column with a non-unique index defined, and the index will be used to enforce the constraint.
If a column is marked as non-NULL, the an INDEX FAST FULL SCAN over this column can be used for COUNT.
It's a special access method, used for cases when the index order is not important. It does not traverse the B-Tree, but instead just reads the pages sequentially.
Since an index has less pages than the table itself, the COUNT can be faster with an INDEX_FFS than with a FULL
It is certainly possible for Oracle to satisfy such a query with an index (specifically with an INDEX FAST FULL SCAN).
In order for the optimizer to choose that path, at least two things have to be true:
Oracle has to be certain that every row in the table is represented in the index -- basically, that there are no NULL entries that would be missing from the index. If you have a primary key this should be guaranteed.
Oracle has to calculate the cost of the index scan as lower than the cost of a table scan. I don't think it necessarily true to assume that an index scan is always cheaper.
Possibly, gathering statistics on the table would change the behavior.
Expanding a little on the "transactions" reason. When a database supports transactions, at any point in time there might be records in different states, even in a "deleted" state. If a transaction fails, the states are rolled back.
A full table scan is done so that the current "version" of each record can be accessed for that point in time.
MySQL MyISAM doesn't have this problem since it uses table locking, instead of record locking required for transactions, and caches the record count. So it's always instantlyy returned. InnoDB under MySQL works the same as Oracle, but returns and "estimate".
You may be able to get a quicker query by counting the distinct values on the primary key, then only the index would be accessed.

Does limiting a query to one record improve performance

Will limiting a query to one result record, improve performance in a large(ish) MySQL table if the table only has one matching result?
for example
select * from people where name = "Re0sless" limit 1
if there is only one record with that name? and what about if name was the primary key/ set to unique? and is it worth updating the query or will the gain be minimal?
If the column has
a unique index: no, it's no faster
a non-unique index: maybe, because it will prevent sending any additional rows beyond the first matched, if any exist
no index: sometimes
if 1 or more rows match the query, yes, because the full table scan will be halted after the first row is matched.
if no rows match the query, no, because it will need to complete a full table scan
If you have a slightly more complicated query, with one or more joins, the LIMIT clause gives the optimizer extra information. If it expects to match two tables and return all rows, a hash join is typically optimal. A hash join is a type of join optimized for large amounts of matching.
Now if the optimizer knows you've passed LIMIT 1, it knows that it won't be processing large amounts of data. It can revert to a loop join.
Based on the database (and even database version) this can have a huge impact on performance.
To answer your questions in order:
1) yes, if there is no index on name. The query will end as soon as it finds the first record. take off the limit and it has to do a full table scan every time.
2) no. primary/unique keys are guaranteed to be unique. The query should stop running as soon as it finds the row.
I believe the LIMIT is something done after the data set is found and the result set is being built up so I wouldn't expect it to make any difference at all. Making name the primary key will have a significant positive effect though as it will result in an index being made for the column.
If "name" is unique in the table, then there may still be a (very very minimal) gain in performance by putting the limit constraint on your query. If name is the primary key, there will likely be none.
Yes, you will notice a performance difference when dealing with the data. One record takes up less space than multiple records. Unless you are dealing with many rows, this would not be much of a difference, but once you run the query, the data has to be displayed back to you, which is costly, or dealt with programmatically. Either way, one record is easier than multiple.