Why do functions on columns prevent the use of indexes? - sql

On this question that I asked the other day I got the following comment.
In almost any database, almost any function on a column prevents the use of indexes. There are exceptions here and there, but in general, functions prevent the use of indexes
I googled around and found more mentions of this same behavior, but I had trouble finding something more in depth than what the comment already told me.
Could someone elaborate on why this occurs, and perhaps strategies for avoiding it?

An index in its most basic form is just the sorted column data, making it easy to look up by some value. For example, a textbook can have the pages in some order, but then have an index in the back for all the terms. As you can see, the data is precomputed/sorted and stored in a separate area.
When you apply a function to the column and try to match/filter based on the output, the index is no longer useful. Let's take a look at our book example again, and say that the function we're applying is the reverse of the term (so reverse('integral') becomes 'largetni'). You won't find this value in the index, so you have to take all the terms, put them through the function, and only then compare. All at query time. Originally we could skip search for i, then in, then int and so on, making it easy to find the term so the function made everything much slower.
If you query using this function often, you could make an index with reverse(term) ahead of time to speed up look ups. But without doing so explicitly, it will always be slow.

The indexes are stored separately from the data itself on the SQL server. So when you do a query the B-tree index that ought to be referenced to provide the speed can no longer be referenced because there is an operation(the function) on each of the column so the query optimiser will opt not to use the index any more.

Here is a good explanation of why this occurs (this is a SQL Server specific article, but probably applies to other SQL RDBMS systems):
https://www.mssqltips.com/sqlservertip/1236/avoid-sql-server-functions-in-the-where-clause-for-performance/
The line from the article that really stands out is "The reason for this is is that the function value has to be evaluated for each row of data to determine it matches your criteria."

Let's consider an extreme example. Let's say that you're looking up a row using a cryptographic hash function, like HASH(email_address) = 0x123456. The database has an index built on email_address, but now you're asking it to look up data on HASH(email_address) which it doesn't have. It could still use the index, but it would end up having to look at every single index entry for email_address and see if HASH(email_address) matches. If it's going to have to scan the full index, it may as well just scan the full table instead so it doesn't have to bounce back and forth fetching individual row locations.

Related

Oracle Index query is not working

I want to improve the performance of a simple query, typical structure like that:
SELECT title,datetime
FROM LICENSE_MOVIES
WHERE client='Alex'
As you can read in different websites,like this, you should make an index like that:
CREATE INDEX INDEX_LICENSE_MOVIES
ON LICENSE_MOVIES(client);
But there is any performance in the query, it is like it where "ignoring" the index.
I have try to use hints like this webpage says.
And the query result like this:
SELECT /*+ INDEX(LICENSE_MOVIES INDEX_LICENSE_MOVIES) */ title, datetime
FROM LICENSE_MOVIES
WHERE client='Alex'
Is there is any error in this syntax? Why couldn't I appreciate any improvement?
Oracle has a smart optimizer. It does not always use indexes -- in fact, you might be surprised to learn that sometimes using an index is exactly the wrong thing to do.
In your case, your data fits on a handful of data pages (well, dozens). The question is: How many "Alex"s are in the data. If there is just one, then Oracle should use the index, as following:
Oracle looks up the row containing "Alex" in the index.
Oracle identifies the data page where the row is located.
Oracle loads the data page.
Oracle processes the query and returns the results.
If lots of rows (say more than a few dozen) are for "Alex", then the optimizer is going to "think" . . . "Gosh, I need to read every data page anyway. Let me avoid using the index and just scan all the data."
Of course, this decision is based on the available statistics (which might be inaccurate or out-of-date). But there are definitely circumstances where a full table scan is the right approach, even when an index is available.

What does a SQL SELECT statement actually do during execution?

In a SELECT statement :
SELECT name
FROM users
WHERE address IN (addr_a, addr_b, addr_c, ...);
We know that it will select all person's names whose address is in (addr_a, addr_b, addr_c, ...). But I want to know what it actually do when executing this statement.
For example, does it search every element in the table to check if its address is in (addr_a, ...) ?
If addr_a, addr_b is too long, does it slow down the search process?
Is there any material about these stuff to be recommended ?
Edit: I didn't specify a RDBMS, because I would like to know as many SQL implementations as possible.
Edit again: Here I got answers about MySQL and SQL Server and I accepted the "SQL Server" one as it's a detailed answer. Welcome for more answers about other RDBMS.
Since you haven't specified which RDBMS are your question about, I am going to write how it works on SQL Server, trying to simplify it a bit and avoid much of technicalities. It might be same or very similar on different systems, but it also might be completely different.
What SQL Server is going to do with your query
`SELECT name FROM users WHERE address IN (addr_a, addr_b, addr_c, ...);`
depends almost entirely on what kind of indexes do you have on a table. Here are a 3 basic scenarios:
Scenario 1 (good index)
If you have what is called Covering Index, which would mean either a PK or clustered index on column address or non-clustered index on address which include name, SQL Server will do something called Index Seek. It means it will go through index's tree structure and quickly pinpoint the exact row you need (or find it's not existing). Since name column is also included in index, it will read it and return right from there.
Scenario 2 (not-so-good index)
This is the case when you have index on column address, which does not include column name. You might find these kind of indexes - on only one column - very often, but as you'll find out soon they are pretty useless most of the time. What you are hoping here that SQL Server goes through your index structure (seek) and quickly finds the row with your address. However as column name is not there now, it can only get rowID (or PK) where the row actually is, so it will for each row returned do additional reading of another index or table to find your row and retrieve name. Since that takes 3 times more reading then scenario 1, SQL Server will more often then not decide that it's cheaper to just go through all rows of table rather than to use your index. And that is explained in scenario 3.
Scenario 3 (no usable index)
This will happen if you don't have indexes at all or no indexes on column address. Simply speaking SQL Server goes through all the rows and check every row for your condition. This is called Index Scan (or Table Scan if there are no indexes at all). Usually the worst
case scenario and slowest at all.
Hopes that helps to clarify things a bit.
As for the other sub-question about long string slowing down - the answer for this case would be 'probably not much'. When SQl Server compares two strings, it goes character-by-character, so if the first letters of both strings are different, it will not check further. However if you put a wildcard % on beginning of your string ie: WHERE address LIKE '%addr_a' SQL Server will have to check every character of every string in column and therefore work much slower.
The documentation explains exactly what it does.
If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search.
Therefore the order of the arguments actually doesn't matter as MySQL sorts them for comparison anyway.
#Xu : An execution plan is created for the select query and based on that plan the final execution is done. Please check this basic documentation related to Execution Plan for more details.

Is there a SQL ANSI way of starting a search at the end of table?

In a certain app I must constantly query data that are likely to be amongst the last inserted rows. Since this table is going to grow a lot, I wonder if theres a standard way of optimizing the queries by making them start the lookup at the table's end. I think I would get the same optmization if the database stored data for the table in a stack-like structure, so the last inserted rows would be searched first.
The SQL spec doesn't mention anything about maintaining the insertion order. In practice, most of decent DB's also doesn't maintain it. Then it stops here. Sorting the table first ain't going to make it faster. Just index the column(s) of interest (at least the ones which you use in the WHERE).
One of the "tenets" of a proper RDBMS is that this kind of matters shouldn't concern you or anyone else using the DB.
The DB engine is "free" to use whatever method it wants to store/retrieve records, so if you want to enforce a "top" behaviour do what other suggested: add a timestamp field to the table (or tables), add an index on it and query using it as a sort and/or query criteria (e.g.: you poll the table each minute, and ask for records with timestamp>=systime-1 minute)
There is no standard way.
In some databases you can specify the sort order on an index.
SQL Server allows you to write ASC or DESC on an index:
[ ASC | DESC ]
Determines the ascending or descending sort direction for the particular index column. The default is ASC.
In MySQL you can also write ASC or DESC when you create the index but currently this is ignored. It might be implemented in a future version.
Add a counter or a time field in your table, sort on it and get top rows.
In other words: You should forget the idea that SQL tables are accessed in any particular order by default. A seqscan does not mean the oldest rows will be searched first, only that all rows will be checked. If you want to optimize some search you add indexes on some fields. What you are looking for is probably indexes.
If your data is indexed, it won't matter. The index is doing a binary search, not a sequential scan.
Unless you're doing TOP 1 (or something like it), the SELECT will have to scan the whole table or index anyway.
According to Data Independence you shouldn't care. That said a clustered index would probably suit your needs if you typically look for a date range. (sorting acs/desc shouldn't matter but you should try it out.)
If you find that you really need it you can also shard your database to increase perf on the most recently added data.
If you have enough rows that its actually becomming a problem, and you know how many "the most recently inserted rows" should be, you could try a round-about method.
Note: Even for pretty big tables, this is less efficient, but once your main table gets big enough, I've seen this work wonders for user-facing performance.
Create a "staging" table that exactly mimics your table's structure. Whenever you insert into your main table, also insert into your "staging" area. Limit your "staging" area to n rows by using a trigger to delete the lowest id row in the table when a new row over your arbitrary maximum is reached (say, 10,000 or whatever your limit is).
Then, queries can hit that smaller table first looking for the information. Since the table is arbitrarilly limited to the last n rows, it's only looking in the most recent data. Only if that fails to find a match would your query (actually, at this point a stored procedure because of the decision making) hit your main table.
Some Gotchas:
1) Make sure your trigger(s) is(are) set up properly to maintain the correct concurrancy between your "main" and "staging" tables.
2) This can quickly become a maintenance nightmare if not handled properly- and depending on your scenario it be be a little finiky.
3) I cannot stress enough that this is only efficient/useful in very specific scenarios. If yours doesn't match it, use one of the other answers.
ISO/ANSI Standard SQL does not consider optimization at all. For example the widely recognized CREATE INDEX SQL DDL does not appear in the Standard. This is because the Standard makes no assumptions about the underlying storage medium and nor should it. I regularly use SQL to query data in text files and Excel spreadsheets, neither of which have any concept of database indexes.
You can't do this.
However, there is a way to do something that might be even better. Depending on the design of your table, you should be able to create an index that keeps things in almost the order of entry. For example, if you adopt the common practice of creating an id field that autoincrements, then that index is just about in chronological order.
Some RDBMSes permit you to declare a backwards index, that is, one that descends instead of ascending. If you create a backwards index on the ID field, and if the optimizer uses that index, it will look at the most recent entries first. This will give you a rapid response for the first row.
The next step is to get the optimizer to use the index. You need to use explain plan to see if the index is being used. If you ask for the rows in order of id descending, the optimizer will almost certainly use the backwards index. If not you may be able to use hints to guide the optimizer.
If you still need to avoid reading all the rows in order to avoid wasting time, you may be able to use the LIMIT feature to declare that you only want, say 10 rows, and no more, or 1 row and no more. That should do it.
Good luck.
If your table has a create date, then I'd reverse sort by that and take the top 1.

Use of MD5(URL) instead of URL in DB for WHERE

I have a big MySQL InnoDB table (about 1 milion records, increase by 300K weekly) let's say with blog posts. This table has an url field with index.
By adding new records in it I'm checking for existent records with the same url. Here is how query looks like:
SELECT COUNT(*) FROM `tablename` WHERE url='http://www.google.com/';
Currently system produces about 10-20 queries per second and this amount will be increased. I'm thinking about improving performance by adding additional field which is MD5 hash of the URL.
SELECT COUNT(*) FROM `tablename` WHERE md5url=MD5('http://www.google.com/');
So it will be shorter and with constant length which is better for index compared to URL field. What do you guys think about it. Does it make sense?
Another suggestion by friend of mine is to use CRC32 instead of MD5, but I'm not sure about how unique will be result of CRC32. Let me know what you think about CRC32 for this role.
UPDATE: the URL column is unique for each row.
Create a non-clustered index on URL. That will let your SQL engine do all the optimization internally and will produce the best results!
If you create an index on a VARCHAR column, SQL will create a hash internally anyway and using the index can give better performance by an order of magnitude or even more!
Also, something to keep in mind if you're only checking whether a URL exists, is that certain SQL products will produce faster results with a query like this:
IF NOT EXISTS(SELECT * FROM `tablename` WHERE url='')
-- return TRUE or do your logic here
I think CRC32 would actually be better for this role, as it's shorter and it saves more SQL space. If you're receiving that many queries, the object is to save space anyways? If it does the job, I'd say go for it.
Although, since it's only 32bit, and shorter in length, it's not as unique as MD5 of course. You will have to decide if you want unique, or if you want to save space.
I still think I'd choose CRC32.
My system generates roughly 4k queries per second, and I use CRC32 for links.
Using the build-in indexing is always best, or you should volunteer to add to their codebase anyways ;)
When using a hash, create a 2 column index on the hash and the URL. If you only choose the first couple of letters on the index, it still does a complete match, but it doesn't index more then the first few letters.
Something like this:
INDEX(CRC32_col, URL_col(5))
Either hash would work in that case. It's a trade-off of space vs speed.
Also, this query will be much faster:
SELECT * FROM table WHERE hash_col = 'hashvalue' AND url_col = 'urlvalue' LIMIT 1;
This will find the first value and stop. Much faster then finding many matches for the COUNT(*) calculation.
Ultimately the best choice is to make test cases for each variant and benchmark.
Don't most SQL engines use hash functions internally for text column searches?
If you're going to use hashed keys and you're concerned about collisions, use two different hash functions and concatenate the two hashed values.
But even if you do this, you should always store the original key value in the row as well.
If the tendency is for the result of that select statement to be rather high, an alternative solution would be to have a separate table which keeps track of the counts. Obviously there are high penalties for using that technique, but if this specific query is a common one and is too slow, this might be a solution.
There are obvious trade-offs involved in this solution, and you probably do not want to update this 2nd table after every individual insertion of a new record inserted, as that would slow down your insertions.
If you choose a hash you need to take into account collissions. Even with a large hash like MD5 you have to account the meet-in-the-middle probability, better known as birthday attack. For a smaller hash like CRC-32 the collision probability will be quite large and your WHERE has to specify hash and the full URL.
But I gotta ask, is this the best way to spend your efforts? Is there nothing else left to optimize? You may be well doing premature optimizations unless you have clear metrics and measurements indicating that this problem is the bottleneck of the system. After all, this kind of seek is what databases are optimized for (all of them), and by doing something like a hash you may actually decrease performance (eg. your index may become fragmented becuase hashes have a different distribution than URLs).

Using IN or a text search

I want to search a table to find all rows where one particular field is one of two values. I know exactly what the values would be, but I'm wondering which is the most efficient way to search for them:
for the sake of example, the two values are "xpoints" and "ypoints". I know for certain that there will be no other values in that field which has "points" at the end, so the two queries I'm considering are:
WHERE `myField` IN ('xpoints', 'ypoints')
--- or...
WHERE `myField` LIKE '_points'
which would give the best results in this case?
As always with SQL queries, run it through the profiler to find out. However, my gut instinct would have to say that the IN search would be quicker. Espcially in the example you gave, if the field was indexed, it would only have to do 2 lookups. If you did a like search, it may have to do a scan, because you are looking for records that end with a certain value. It would also be more accurate as LIKE '_points' could also return 'gpoints', or any other similar string.
Unless all of the data items in the column in question start with 'x' or 'y', I believe IN will always give you a better query. If it is indexed, as #Kibbee points out, you will only have to perform 2 lookups to get both. Alternatively, if it is not indexed, a table scan using IN will only have to check the first letter most of the time whereas with LIKE it will have to check two characters every time (assuming all items are at least 2 characters) -- since the first character is allowed to be anything.
Try it and see. Create a large amount of test data, Also, try it with and without an index on myfield. While you are at it, see if there's a noticeable difference between
LIKE 'points' and LIKE 'xpoint'.
It depends on what the optimizer does with each query.
For small amounts of data, the difference will be negligible. Do whichever one makes more sense. For large amounts of data the amount of disk I/O matters much more than the amount of CPU time.
I'm betting that IN will get you better results than LIKE, if there is an index on myfield. I'm also betting that 'xpoint_' runs faster than '_points'. But there's nothing like trying it yourself.
MySQL can't use an index when using string comparisons such as LIKE '%foo' or '_foo', but can use an index for comparisons like 'foo%' and 'foo_'.
So in your case, IN will be much faster assuming that the field is indexed.
If you're working with a limited set of possible values, it's worth specifying the field as an ENUM - MySQL will then store it internally as an integer and make this sort of lookup much faster, and save disk space.
It will be faster to do the IN-version than the LIKE-version. Especially when your wildcard isn't at the end of the comparison, but even under ideal conditions IN would still be ideal up until your query nears the size of your max-query insert.