What does a SQL SELECT statement actually do during execution? - sql

In a SELECT statement :
SELECT name
FROM users
WHERE address IN (addr_a, addr_b, addr_c, ...);
We know that it will select all person's names whose address is in (addr_a, addr_b, addr_c, ...). But I want to know what it actually do when executing this statement.
For example, does it search every element in the table to check if its address is in (addr_a, ...) ?
If addr_a, addr_b is too long, does it slow down the search process?
Is there any material about these stuff to be recommended ?
Edit: I didn't specify a RDBMS, because I would like to know as many SQL implementations as possible.
Edit again: Here I got answers about MySQL and SQL Server and I accepted the "SQL Server" one as it's a detailed answer. Welcome for more answers about other RDBMS.

Since you haven't specified which RDBMS are your question about, I am going to write how it works on SQL Server, trying to simplify it a bit and avoid much of technicalities. It might be same or very similar on different systems, but it also might be completely different.
What SQL Server is going to do with your query
`SELECT name FROM users WHERE address IN (addr_a, addr_b, addr_c, ...);`
depends almost entirely on what kind of indexes do you have on a table. Here are a 3 basic scenarios:
Scenario 1 (good index)
If you have what is called Covering Index, which would mean either a PK or clustered index on column address or non-clustered index on address which include name, SQL Server will do something called Index Seek. It means it will go through index's tree structure and quickly pinpoint the exact row you need (or find it's not existing). Since name column is also included in index, it will read it and return right from there.
Scenario 2 (not-so-good index)
This is the case when you have index on column address, which does not include column name. You might find these kind of indexes - on only one column - very often, but as you'll find out soon they are pretty useless most of the time. What you are hoping here that SQL Server goes through your index structure (seek) and quickly finds the row with your address. However as column name is not there now, it can only get rowID (or PK) where the row actually is, so it will for each row returned do additional reading of another index or table to find your row and retrieve name. Since that takes 3 times more reading then scenario 1, SQL Server will more often then not decide that it's cheaper to just go through all rows of table rather than to use your index. And that is explained in scenario 3.
Scenario 3 (no usable index)
This will happen if you don't have indexes at all or no indexes on column address. Simply speaking SQL Server goes through all the rows and check every row for your condition. This is called Index Scan (or Table Scan if there are no indexes at all). Usually the worst
case scenario and slowest at all.
Hopes that helps to clarify things a bit.
As for the other sub-question about long string slowing down - the answer for this case would be 'probably not much'. When SQl Server compares two strings, it goes character-by-character, so if the first letters of both strings are different, it will not check further. However if you put a wildcard % on beginning of your string ie: WHERE address LIKE '%addr_a' SQL Server will have to check every character of every string in column and therefore work much slower.

The documentation explains exactly what it does.
If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search.
Therefore the order of the arguments actually doesn't matter as MySQL sorts them for comparison anyway.

#Xu : An execution plan is created for the select query and based on that plan the final execution is done. Please check this basic documentation related to Execution Plan for more details.

Related

Two indexes for same column and change the order

I have a large table in Microsoft SQL Server 2008. It has two indexes. One index having column A descending order and another index having the column A ascending with some other columns.
My application is doing below:
Select for the record.
If there is no record then insert
If find then update the records
Note that this table has millions of records.
The question is: Are these indexes affect the any select/insert/update performance?
Any suggestions?
Having the exact same two indexes with the only difference being the ordering will make no difference to the SQL engine and will just pick either (practially).
Imagine you 2 dictionaries of the english language, one sorts words from A to Z and the other from Z to A. The effort you will need to search for a particular word will be roughly the same in both cases.
A different case would be if you had 2 lists of people's data, one ordered by first name then last name and the other by last name first, then first name. If you have to look for "John Doe", the one that's ordered first by last name will be practically useless compared to the other one.
These examples are very simplified representations of indexes on SQL Server. Indexes store their data on a structure that's called a B-Tree, but for searching purposes these examples work to understand when will a index be useful or not.
Resuming: you can drop the first index and keep the one that has additional columns on it, since it can be used for more different scenarios and also all cases that would require the other one. Keeping an unuseful index brings additional maintenance tasks like keeping the index updated on every insert, update, delete and refreshing statistics, so you might want to drop it.
PD: As for the 2nd index being used, this greatly depends on the query you are using to access that particular table, it's joins and where clauses. You can use the "Display Estimated Execution Plan" having highlighted the query on SSMS to see the access plan of the engine to each object to perform the operation. If the index is used, you will see it there.
Thanks for all the answers. I explored the SQL Server Profiler and SQL Tuning Advisor for the SQL Server and ran them to get the recommended indexes. It recommended a single index with include options. I used the index and the performance has been improved.

Why do functions on columns prevent the use of indexes?

On this question that I asked the other day I got the following comment.
In almost any database, almost any function on a column prevents the use of indexes. There are exceptions here and there, but in general, functions prevent the use of indexes
I googled around and found more mentions of this same behavior, but I had trouble finding something more in depth than what the comment already told me.
Could someone elaborate on why this occurs, and perhaps strategies for avoiding it?
An index in its most basic form is just the sorted column data, making it easy to look up by some value. For example, a textbook can have the pages in some order, but then have an index in the back for all the terms. As you can see, the data is precomputed/sorted and stored in a separate area.
When you apply a function to the column and try to match/filter based on the output, the index is no longer useful. Let's take a look at our book example again, and say that the function we're applying is the reverse of the term (so reverse('integral') becomes 'largetni'). You won't find this value in the index, so you have to take all the terms, put them through the function, and only then compare. All at query time. Originally we could skip search for i, then in, then int and so on, making it easy to find the term so the function made everything much slower.
If you query using this function often, you could make an index with reverse(term) ahead of time to speed up look ups. But without doing so explicitly, it will always be slow.
The indexes are stored separately from the data itself on the SQL server. So when you do a query the B-tree index that ought to be referenced to provide the speed can no longer be referenced because there is an operation(the function) on each of the column so the query optimiser will opt not to use the index any more.
Here is a good explanation of why this occurs (this is a SQL Server specific article, but probably applies to other SQL RDBMS systems):
https://www.mssqltips.com/sqlservertip/1236/avoid-sql-server-functions-in-the-where-clause-for-performance/
The line from the article that really stands out is "The reason for this is is that the function value has to be evaluated for each row of data to determine it matches your criteria."
Let's consider an extreme example. Let's say that you're looking up a row using a cryptographic hash function, like HASH(email_address) = 0x123456. The database has an index built on email_address, but now you're asking it to look up data on HASH(email_address) which it doesn't have. It could still use the index, but it would end up having to look at every single index entry for email_address and see if HASH(email_address) matches. If it's going to have to scan the full index, it may as well just scan the full table instead so it doesn't have to bounce back and forth fetching individual row locations.

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

Query with LIKE, increasingly slow with a smaller resultset

Say I have a Person table with 200000 records, there's a clustered index on it's GUID primary key. This GUID is generated using the NEWSEQUENTIALID() construct provided by SQL Server (2008 R2). Furthermore there is a regular index on the LastName (varchar(256)) column.
For every record I've generated a unique name (Lastname_1 through Lastname_200000), now I'm playing around with some queries and have come to find that the more restrictive my criteria is, the slower SQL Server will return actual results. And this performance implication is quite severe.
E.g.:
SELECT * FROM Person WHERE Lastname LIKE '%Lastname_123456%'
Is much slower than
SELECT * FROM Person WHERE Lastname LIKE '%Lastname_123%'
Responsetimes are measured by setting statistics on:
SET STATISTICS TIME ON
I can imagine this being caused
1) Because of the LIKE clause itself, since it starts with % it isn't possible to use the inde on that particular column,
2) SQL having to think more about my 'bigger question'.
Is there any truth in this? Is there some way to avoid this?
Edit:
To add some context to this question, this is part of a use case for a 'free search'. I would very much like the system to be fast when a user enters a full lastname.
How should I make these cases perform? Should I avoid the '%xxx%' construction and go for 'xxx%' like construction? Which does add alot of speed, but at the cost of some flexibility for the user...
You are right on with number 2, since the second LIKE must match more characters in the string, SQL stops searching when it finds a character that doesn't match so it takes less string matching iterations to find a smaller search string - even though you get more results back.
As for #1 - SQL will use an index if possible for a LIKE, but will probably do an index scan (probably the clustered index) since a seek is not possible with a wildcard. It also depends on what's included in the index - since you are selecting all columns, it's likely that a table scan is happening instead since the index you 'could' use is not covering your query (unless it's using the clustered index)
Check your execution plan - you will likely see a table scan
Usually, SQL Server does not use indexes on a LIKE.
This article can help guide you

Multiple indexes on one column

Using Oracle, there is a table called User.
Columns: Id, FirstName, LastName
Indexes: 1. PK(Id), 2. UPPER(FirstName), 3. LOWER(FirstName), 4. Index(FirstName)
As you can see index 2, 3, 4 are indexes on the same column - FirstName.
I know this creates overhead, but my question is on selecting how will the database react/optimize?
For instance:
SELECT Id FROM User u WHERE
u.FirstName LIKE 'MIKE%'
Will Oracle hit the right index or will it not?
The problem is that via Hibernate this slows down the query VERY much (so it uses prepared statements).
Thanks.
UPDATE: Just to clarify indexes 2 and 3 are functional indexes.
In addition to Mat's point that either index 2 or 3 should be redundant because you should choose one approach to doing case-insensitive searches and to Richard's point that it will depend on the selectivity of the index, be aware that there are additional concerns when you are using the LIKE clause.
Assuming you are using bind variables (which it sounds like you are based on your use of prepared statements), the optimizer has to guess at how selective the actual bind value is going to be. Something short like 'S%' is going to be very non-selective, causing the optimizer to generally prefer a table scan. A longer string like 'Smithfield-Manning%', on the other hand, is likely to be very selective and would likely use index 4. How Oracle handles this variability will depend on the version.
In Oracle 10, Oracle introduced bind variable peeking. This meant that the first time Oracle parsed a query after a reboot (or after the query plan being aged out of the shared pool), Oracle looked at the bind value and decided what plan to use based on that value. Assuming that most of your queries would benefit from the index scan because users are generally searching on relatively selective values, this was great if the first query after a reboot had a selective condition. But if you got unlucky and someone did a WHERE firstname LIKE 'S%' immediately after a reboot, you'd be stuck with the table scan query plan until the query plan was removed from the shared pool.
Starting in Oracle 11, however, the optimizer has the ability to do adaptive cursor sharing. That means that the optimizer will try to figure out that WHERE firstname LIKE 'S%' should do a table scan and WHERE firstname LIKE 'Smithfield-Manning%' should do an index scan and will maintain multiple query plans for the same statement in the shared pool. That solves most of the problems that we had with bind variable peeking in earlier versions.
But even here, the accuracy of the optimizer's selectivity estimates are generally going to be problematic for medium-length strings. It's generally going to know that a single-character string is very weakly selective and that a 20 character string is highly selective but even with a 256 bucket histogram, it's not going to have a whole lot of information about how selective something like WHERE firstname LIKE 'Smit%' really is. It may know roughly how selective 'Sm%' is based on the column histogram but it's guessing rather blindly at how selective the next two characters are. So it's not uncommon to end up in a situation where most of the queries work efficiently but the optimizer is convinced that WHERE firstname LIKE 'Cave%' isn't selective enough to use an index.
Assuming that this is a common query, you may want to consider using Oracle's plan stability features to force Oracle to use a particular plan regardless of the value of a bind variable. This may mean that users that enter a single character have to wait even longer than they would otherwise have waited because the index scan is substantially less efficient than doing a table scan. But that may be worth it for other users that are searching for short but reasonably distinctive last names. And you may do things like add a ROWNUM limiter to the query or add logic to the front end that requires a minimum number of characters in the search box to avoid situations where a table scan would be more efficient.
It's a bit strange to have both the upper and lower function-based indexes on the same field. And I don't think the optimizer will use either in your query as it its.
You should pick one or the other (and probably drop the last one too), and only ever query on the upper (or lower)-case with something like:
select id from user u where upper(u.firstname) like 'MIKE%'
Edit: look at this post too, has some interesting info How to use a function-based index on a column that contains NULLs in Oracle 10+?
It may not hit any of your indexes, because you are returning ID in the SELECT clause, which is not covered by the indexes.
If the index is very selective, and Oracle decides it is still worthwhile using it to find 'MIKE%' then perform a lookup on the data to get the ID column, then it may use 4. Index(FirstName). 2 and 3 will only be used if the column searched uses the exact function defined in the index.