Avoid applying a function to an index column - sql

I need to filter out data that exceeds a certain length but the column that contains the data is an indexed column. If I apply a function to the column I lose the benefit of the index.
I cannot create a new index or alter the column as I am not an admin to the database.
I would prefer not to drop the data after the fact.
I know of a few ways to filter the column but all would use some kind of function.
select
table.name
from
table
where
length(table.name)>12
;
The field table.name is not nullable.

If I apply a function to the column I lose the benefit of the index.
Ah, but what is the benefit of an index?
Consider these two values:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
Are they both longer than 12 characters? Yes. Are they likely to be adjacent in the index? Of course not. Therefore the only way for Oracle to use an index to find those values is to execute a Full Fast Scan over the index and evaluate the length of each entry. Now Oracle can do that, but is it worthwhile?
Your posted query is selecting just name. In a comment you say name is not nullable. In that case it would be efficient for Oracle to use the index, because there is no need to read the table records: the index has sufficient information to satisfy the query.
However.
In that comment you also say:
the query is not that simple
If your actual query includes other columns in the projection then the database does have to visit the table to get those values. At which point the rule of thumb for indexed reads kicks in: if the result set of the query is greater than 1-2% of all the rows in the table it's more efficient to do a Full Table Scan than use an index. So the number of records in the table becomes pertinent, and especially the proportion of records where length(name) > 12. If 99% of the records have short names then it is probably still more efficient to Full Fast Scan the index. But if it's only 90% using the index would probably be deadly to performance.
Likewise, if your actual query applies additional criteria in the WHERE clause it may be more efficient to do a Full Table Scan (because the database needs to read the records to evaluate those filters) to to use a different index, if there is an appropriate one.
So, while the index would be useful for the toy query you posted in your question it may not help with your actual query, and indeed could lead to a sub-optimal access path.
is it a case by case situation depending on query complexity?
Yes. The answer is always, it depends. That's why database tuning professionals can charge the fat consultancy fees they do. If you don't provide the whole query the best we can do is point you at this post which explains to ask performance tuning questions and wish you good luck.

If the column is NOT NULL, then Oracle can answer the query using a full index scan. It will need to read every row in the index in order to find only those rows with the length greater than 12. If the index is smaller than the table this is faster than a full scan.
You are only selecting the indexed column so Oracle would not need to visit the table but can get the result entirely from the index. If you were to select other columns there were not in that index Oracle would also need to read the table row having first located the row in the index.
There is no way around this without adding a more suitable index or otherwise changing the database schema.

Related

OR many columns efficiently

Let's say I have a table with 50 columns. I want to do something like:
SELECT * FROM table WHERE column1=value1 OR column2=value2 OR ...
How can I do this efficiently?
I could add a bunch of indexes or an index across many/all columns. Would this help?
I could create a secondary table with columns (id, field_name, field_value) and then index on each column and now my ORs apply to just 2 columns which are indexed.
What else can I do?
For a bit more info:
Rows are added pretty frequently, but rarely edited after that.
Rows are selected several times after being added (maybe dozens but probably not hundreds).
Table has 100,000+ rows and table scan is too slow.
Specifically, given a row, I'll want to look up all rows that match that row on ANY column.
When you run into scenarios like this, it typically indicates that there may be room for normalizing the table (your scenario B). It would be hard to know without further information on what data your columns actually hold and what your overall table access pattern is.
That being said, without any sort of table structure change, you would just need to have an index on each column that you might want to query against, so as to prevent a full table scan.
Setting aside a discussion of changing your database design...
A combined index (an index on many or most of the columns referenced in your query) isn't going to be of help for your query, which has a bunch of OR'd colN = 'foo' predicates. MySQL is not going use that index to satisfy your query. Even if it were to use the index, there would still be other columns in the underlying table that need to be checked on essentially every row, so MySQL is very likely just to visit all the data pages and not use an index at all. (If you happen to have a GROUP BY or ORDER BY in your query, MySQL might be able to use the index to optimize those operations, especially if it was a "covering" index that included EVERY column referenced by your query.
On the other hand, IF you had a separate, individual index on EVERY column (as a leading column in the index) that was checked with an OR colN = 'foo' OR colN = 'bar', it is possible that MySQL would consider using an "index merge" plan for your query.
But it would have to be an index on EVERY column. If your query is checking even just ONE column that is not a leading column in ANY index, then MySQL would have no choice but to examine every row in the table. So having separate indexes on "many" columns will not help your query, because it's very likely that NONE of the indexes will be used.
Even if you did have a separate index for every single one of the boatload of columns being reference, it's likely that MySQL's estimate of the total number of rows being returned (combined from each index) is too large, and MySQL is likely to decide that an "index merge" is too expensive, and opt for a full table scan instead.
In summary, your only two choices for indexes to help your query (and neither of them is a really good choice) would be:
1) a "covering index" that has leading columns that can be used to satisfy a GROUP BY or an ORDER BY clause (avoiding a "Using filesort" operation"
2) separate, individual indexes on EVERY column (as a leading column) that is checked by an OR colN = 'literal' predicate in your query
But again, neither of those is likely to be a good choice.

Do queries make use of more than one index at a time?

If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query? Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
If the cost-based query optimizer determines that it's more efficient to use more than one index, yes, it will. If it's more efficient to do a scan (and often it is), then it may not use an index, even if you think it should.
Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Again, if the optimizer thinks it's efficient to do so, yes it'll use that other index for the same query. If it determines the cost is higher with the index...it'll ignore it. It all depends on how selective (or rather, how selective the optimizer thinks it is, based off the latest statistics) as to whether it'll use the index. If it's not selective (won't narrow down the results much), it'll likely ignore it.
It depends on the optimizer and the query, but optimizers relatively seldom use two separate indexes on a single table in a single query. It is perfectly feasible to construct examples where they could, possibly even should - and some may actually do so. Consider:
A UNION query where the separate terms have filters on different columns (but a table scan may be as effective)
A self-join where the separate sides of the self-join have the different filters.
However, be wary of accusing the optimizer of not being efficient - there may still be advantages to resolving the query by other methods.
To answer your 'index on 4 columns' questions: it is rather unlikely. In this scenario, it is likely that the 4-column index provides good selectivity and the query is most easily resolved by applying the extra filter condition to the rows retrieved by the index scan. (Note that the answer might be different depending on whether the extra condition is connected to the other by AND (as I assumed) or OR (where using the second index might be useful).
It depends upon the queries emitted against those tables, the size of the tables and the selectivity of the data in the columns indexed.
The optimizer uses statistics to determine whether using an index will be beneficial.
1.IF I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
It certainly can, for example if you have the table
EMPLOYEE(
id (index1)
name
address
date (index2) )
and the table
TASKS(
id
employee_id (index3)
date (index 4)
category
description)
If you do the query:
select
employee_id,date,category,description
from EMPLOYEE, TASKS where
EMPLOYEE.id=employee_id and
EMPLOYEE.date=TASKS.date
this will list all the tasks of each employee in each day and user index1 and index2 along with index4 and index3. Which will take much more time if I where lacking either index1 or index2.
2.if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Of course it can be done, but the query should include joins on both the 4 column index and also the single column index.

Tuning table select SQL having a RAW column in Oracle 10g

I have a table with several columns and a unique RAW column. I created an unique index on the RAW column.
My query selects all columns from the table (6 million rows).
when i see the cost of the query its too high (51K). and its still using INDEX FULL scan. The query do not have any filter conditions, its a plain select * from.
Please suggest how can i tune the query operation.
Thanks in advance.
Why are you hinting it to use the index if you're retrieving all columns from all rows? The index would only help if you were filtering on the indexed column. If you were only retrieving the indexed column then an INDEX_FFS hint might help. But if you have to go back to the data for any non-indexed columns then using the index at all becomes counterproductive beyond a certain proportion of returned data as you're having to access both the index data blocks and the table data blocks repeatedly.
So, your query is:
select /*+ index (rawdata idx_test) */
rawdata.*
from v_wis_cds_cp_rawdata_test rawdata
and you want to know why Oracle is choosing an INDEX FULL scan?
Well, as Alex said, the reason is the "index (raw data idx_text)" hint. This is a directive that tells the Oracle optimizer, "when you access rawdata, use an index access on the idx_text index", which means that's what Oracle will do if at all possible - even if that's not the best plan.
Hints don't make queries faster automatically. They are a way of telling the optimizer what not to do.
I've seen queries like this before - sometimes a hint like this is added in order to return the rows in sorted order, without actually doing a sort. However, if this was the requirement, I'd strongly recommend adding an ORDER BY clause in anyway, because if the hint becomes invalid for some reason (e.g. the index gets dropped or renamed), the sorting would no longer happen and no error would be reported.
If you don't need the rows returned in any particular order, I suggest you remove the hint and see if the performance improves.

Indexing nulls for fast searching on DB2

It's my understanding that nulls are not indexable in DB2, so assuming we have a huge table (Sales) with a date column (sold_on) which is normally a date, but is occasionally (10% of the time) null.
Furthermore, let's assume that it's a legacy application that we can't change, so those nulls are staying there and mean something (let's say sales that were returned).
We can make the following query fast by putting an index on the sold_on and total columns
Select * from Sales
where
Sales.sold_on between date1 and date2
and Sales.total = 9.99
But an index won't make this query any faster:
Select * from Sales
where
Sales.sold_on is null
and Sales.total = 9.99
Because the indexing is done on the value.
Can I index nulls? Maybe by changing the index type? Indexing the indicator column?
From where did you get the impression that DB2 doesn't index NULLs? I can't find anything in documentation or articles supporting the claim. And I just performed a query in a large table using a IS NULL restriction involving an indexed column containing a small fraction of NULLs; in this case, DB2 certainly used the index (verified by an EXPLAIN, and by observing that the database responded instantly instead of spending time to perform a table scan).
So: I claim that DB2 has no problem with NULLs in non-primary key indexes.
But as others have written: Your data may be composed in a way where DB2 thinks that using an index will not be quicker. Or the database's statistics aren't up-to-date for the involved table(s).
I'm no DB2 expert, but if 10% of your values are null, I don't think an index on that column alone will ever help your query. 10% is too many to bother using an index for -- it'll just do a table scan. If you were talking about 2-3%, I think it would actually use your index.
Think about how many records are on a page/block -- say 20. The reason to use an index is to avoid fetching pages you don't need. The odds that a given page will contain 0 records that are null is (90%)^20, or 12%. Those aren't good odds -- you're going to need 88% of your pages to be fetched anyway, using the index isn't very helpful.
If, however, your select clause only included a few columns (and not *) -- say just salesid, you could probably get it to use an index on (sold_on,salesid), as the read of the data page wouldn't be needed -- all the data would be in the index.
The rule of thumb is that an index is useful for values up on to 15% of the records. ... so an index might be useful here.
If DB2 won't index nulls, then I would suggest adding a boolean field, IsSold, and set it to true whenever the sold_on date gets set (this could be done in a trigger).
That's not the nicest solution, but it might be what you need.
Troels is correct; even rows with a SOLD_ON value of NULL will benefit from an index on that column. If you're doing ranged searches on SOLD_ON, you may benefit even more by creating a clustered index that begins with SOLD_ON. In this particular example, it may not require much additional overhead to maintain the clustering order based on SOLD_ON, since newer rows added will most likely have a newer SOLD_ON date.

Do indexes work with "IN" clause

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..