OR many columns efficiently - sql

Let's say I have a table with 50 columns. I want to do something like:
SELECT * FROM table WHERE column1=value1 OR column2=value2 OR ...
How can I do this efficiently?
I could add a bunch of indexes or an index across many/all columns. Would this help?
I could create a secondary table with columns (id, field_name, field_value) and then index on each column and now my ORs apply to just 2 columns which are indexed.
What else can I do?
For a bit more info:
Rows are added pretty frequently, but rarely edited after that.
Rows are selected several times after being added (maybe dozens but probably not hundreds).
Table has 100,000+ rows and table scan is too slow.
Specifically, given a row, I'll want to look up all rows that match that row on ANY column.

When you run into scenarios like this, it typically indicates that there may be room for normalizing the table (your scenario B). It would be hard to know without further information on what data your columns actually hold and what your overall table access pattern is.
That being said, without any sort of table structure change, you would just need to have an index on each column that you might want to query against, so as to prevent a full table scan.

Setting aside a discussion of changing your database design...
A combined index (an index on many or most of the columns referenced in your query) isn't going to be of help for your query, which has a bunch of OR'd colN = 'foo' predicates. MySQL is not going use that index to satisfy your query. Even if it were to use the index, there would still be other columns in the underlying table that need to be checked on essentially every row, so MySQL is very likely just to visit all the data pages and not use an index at all. (If you happen to have a GROUP BY or ORDER BY in your query, MySQL might be able to use the index to optimize those operations, especially if it was a "covering" index that included EVERY column referenced by your query.
On the other hand, IF you had a separate, individual index on EVERY column (as a leading column in the index) that was checked with an OR colN = 'foo' OR colN = 'bar', it is possible that MySQL would consider using an "index merge" plan for your query.
But it would have to be an index on EVERY column. If your query is checking even just ONE column that is not a leading column in ANY index, then MySQL would have no choice but to examine every row in the table. So having separate indexes on "many" columns will not help your query, because it's very likely that NONE of the indexes will be used.
Even if you did have a separate index for every single one of the boatload of columns being reference, it's likely that MySQL's estimate of the total number of rows being returned (combined from each index) is too large, and MySQL is likely to decide that an "index merge" is too expensive, and opt for a full table scan instead.
In summary, your only two choices for indexes to help your query (and neither of them is a really good choice) would be:
1) a "covering index" that has leading columns that can be used to satisfy a GROUP BY or an ORDER BY clause (avoiding a "Using filesort" operation"
2) separate, individual indexes on EVERY column (as a leading column) that is checked by an OR colN = 'literal' predicate in your query
But again, neither of those is likely to be a good choice.

Related

How to speed up mariadb join tables

I have 2 tables from which I'm joining certain columns. They are joined on a VARCHAR column (indexed in both tables). Table A has a bit over 800.000 records and Table B has 20.000 records.
Table A has an auto_inc primary key. Table B does not have a primary key, only the index on the mentioned VARCHAR column.
The query takes about 48 seconds which is too slow. What can I do to increase the speed? Would it help to create a primary key auto_incr in table B? Even if this is not the column on which the join takes place?
Beginning user in SQL. Both tables are InnoDB and I use Mariadb.
QUERY:
select distinct
`pr`.`ProductIdentifier` AS `ProductIdentifier`,
`pr`.`Datum` AS `Datum`,
`pr`.`Retailer` AS `Retailer`,
`pr`.`Prijs` AS `Prijs`,
`pm`.`Merk` AS `Merk`,
`pm`.`Product` AS `Product`,
`pm`.`Formaat` AS `Formaat`
from
(`prices`.`prices_table` `pr`
join `prices`.`product_match_table` `pm`
on(`pr`.`ProductIdentifier` = `pm`.`ProductIdentifier`))
EXPLAIN SELECT:
Explain table
This answer is based on my knowledge of indexing in general; MariaDB may have some more specialised options I am not aware of.
However, indexes broadly speed up queries in two ways
By only having the columns needed, meaning less data to read and process
By being sorted in an appropriate manner to help processing
For the first, you typically need a covering index.
For the second, this includes
Being sorted the same way (e.g., indexed on the same fields) as tables it is being JOINed to in the query
Being sorted so that WHERE clauses and other types of filtering can directly use the sort to go to the appropriate spot in the index/table
In practice, often the best improvement in performance is that last one - however you do not have WHERE clauses in your code there. If (as is typical) the users filter the results (e.g., only show me results where ProductName = 'Handbag') then you may need to adjust the indexes for those (more on that a bit later though).
Covering indexes for query above
I think with the current query (and no filtering etc) the fastest you can get is with two indexes
CREATE INDEX `IX_prices_ProductIdentifier` ON `prices`.`prices_table`
(`ProductIdentifier`,
`Datum`,
`Retailer`,
`Prijs`);
CREATE INDEX `IX_productmatch_ProductIdentifier` ON `prices`.`product_match_table`
(`ProductIdentifier`,
`Merk`,
`Product`,
`Formaat`);
These provide covering indexes on the query shown, and are both sorted the same (by productIdentifier) to make the join easier.
Searching/filtering (not specified in initial example)
However, if people often search by a specific field first, then it makes sense to re-order the fields in the relevant table (so the searched field is first), or have multiple indexes with the search field at the front.
For example, people may be able to search for specific values in pr.Retailer, pm.Merk, or pm.Product. You may therefore add these additional indexes
CREATE INDEX `IX_prices_Retailer` ON `prices`.`prices_table`
(`Retailer`,
`ProductIdentifier`,
`Datum`,
`Prijs`);
CREATE INDEX `IX_productmatch_Merk` ON `prices`.`product_match_table`
(`Merk`,
`ProductIdentifier`,
`Product`,
`Formaat`);
CREATE INDEX `IX_productmatch_Product` ON `prices`.`product_match_table`
(`Product`,
`ProductIdentifier`,
`Merk`,
`Formaat`);
Notice with the above that the field orders matter. The data (index) is sorted by the first field, then the second field, then the third field etc. To use the index effectively, your filtering/WHERE clause needs to include at least the first field, if not more.
An alternate to these indexes (the ones for filtering) is to have the original two indexes as above, but then put a separate index onto each of the fields they can search on e.g., if the users can filter on the retailer, merk and product, then create
one index on pr.Retailer
one on pm.Merk, and
one on pm.Product
Caveats
Adding indexes makes data inserts onto the relevant table (and often deletes/updates), slower than if the indexes weren't there. The reason is that it doesn't just need to update the data in the table, but it also needs to update the index(es).
Typically this is not much of a problem unless you are adding and deleting lots of data from the tables frequently. But it is worth checking your 'product maintenance' interface (e.g., adding products, updating prices etc) after adding indexes to confirm they still run well.

Avoid applying a function to an index column

I need to filter out data that exceeds a certain length but the column that contains the data is an indexed column. If I apply a function to the column I lose the benefit of the index.
I cannot create a new index or alter the column as I am not an admin to the database.
I would prefer not to drop the data after the fact.
I know of a few ways to filter the column but all would use some kind of function.
select
table.name
from
table
where
length(table.name)>12
;
The field table.name is not nullable.
If I apply a function to the column I lose the benefit of the index.
Ah, but what is the benefit of an index?
Consider these two values:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
Are they both longer than 12 characters? Yes. Are they likely to be adjacent in the index? Of course not. Therefore the only way for Oracle to use an index to find those values is to execute a Full Fast Scan over the index and evaluate the length of each entry. Now Oracle can do that, but is it worthwhile?
Your posted query is selecting just name. In a comment you say name is not nullable. In that case it would be efficient for Oracle to use the index, because there is no need to read the table records: the index has sufficient information to satisfy the query.
However.
In that comment you also say:
the query is not that simple
If your actual query includes other columns in the projection then the database does have to visit the table to get those values. At which point the rule of thumb for indexed reads kicks in: if the result set of the query is greater than 1-2% of all the rows in the table it's more efficient to do a Full Table Scan than use an index. So the number of records in the table becomes pertinent, and especially the proportion of records where length(name) > 12. If 99% of the records have short names then it is probably still more efficient to Full Fast Scan the index. But if it's only 90% using the index would probably be deadly to performance.
Likewise, if your actual query applies additional criteria in the WHERE clause it may be more efficient to do a Full Table Scan (because the database needs to read the records to evaluate those filters) to to use a different index, if there is an appropriate one.
So, while the index would be useful for the toy query you posted in your question it may not help with your actual query, and indeed could lead to a sub-optimal access path.
is it a case by case situation depending on query complexity?
Yes. The answer is always, it depends. That's why database tuning professionals can charge the fat consultancy fees they do. If you don't provide the whole query the best we can do is point you at this post which explains to ask performance tuning questions and wish you good luck.
If the column is NOT NULL, then Oracle can answer the query using a full index scan. It will need to read every row in the index in order to find only those rows with the length greater than 12. If the index is smaller than the table this is faster than a full scan.
You are only selecting the indexed column so Oracle would not need to visit the table but can get the result entirely from the index. If you were to select other columns there were not in that index Oracle would also need to read the table row having first located the row in the index.
There is no way around this without adding a more suitable index or otherwise changing the database schema.

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?
I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.
in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes
Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

Do queries make use of more than one index at a time?

If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query? Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
If the cost-based query optimizer determines that it's more efficient to use more than one index, yes, it will. If it's more efficient to do a scan (and often it is), then it may not use an index, even if you think it should.
Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Again, if the optimizer thinks it's efficient to do so, yes it'll use that other index for the same query. If it determines the cost is higher with the index...it'll ignore it. It all depends on how selective (or rather, how selective the optimizer thinks it is, based off the latest statistics) as to whether it'll use the index. If it's not selective (won't narrow down the results much), it'll likely ignore it.
It depends on the optimizer and the query, but optimizers relatively seldom use two separate indexes on a single table in a single query. It is perfectly feasible to construct examples where they could, possibly even should - and some may actually do so. Consider:
A UNION query where the separate terms have filters on different columns (but a table scan may be as effective)
A self-join where the separate sides of the self-join have the different filters.
However, be wary of accusing the optimizer of not being efficient - there may still be advantages to resolving the query by other methods.
To answer your 'index on 4 columns' questions: it is rather unlikely. In this scenario, it is likely that the 4-column index provides good selectivity and the query is most easily resolved by applying the extra filter condition to the rows retrieved by the index scan. (Note that the answer might be different depending on whether the extra condition is connected to the other by AND (as I assumed) or OR (where using the second index might be useful).
It depends upon the queries emitted against those tables, the size of the tables and the selectivity of the data in the columns indexed.
The optimizer uses statistics to determine whether using an index will be beneficial.
1.IF I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
It certainly can, for example if you have the table
EMPLOYEE(
id (index1)
name
address
date (index2) )
and the table
TASKS(
id
employee_id (index3)
date (index 4)
category
description)
If you do the query:
select
employee_id,date,category,description
from EMPLOYEE, TASKS where
EMPLOYEE.id=employee_id and
EMPLOYEE.date=TASKS.date
this will list all the tasks of each employee in each day and user index1 and index2 along with index4 and index3. Which will take much more time if I where lacking either index1 or index2.
2.if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Of course it can be done, but the query should include joins on both the 4 column index and also the single column index.

How do i optimize this query?

I have a very specific query. I tried lots of ways but i couldn't reach the performance i want.
SELECT *
FROM
items
WHERE
user_id=1
AND
(item_start < 20000 AND item_end > 30000)
i created and index on user_id, item_start, item_end
this didn't work and i dropped all indexes and create new indexes
user_id, (item_start, item_end)
also this didn't work.
(user_id, item_start and item_end are int)
edit: database is MySQL 5.1.44, engine is InnoDB
UPDATE: per your comment below, you need all the columns in the query (hence your SELECT *). If that's the case, you have a few options to maximize query performance:
create (or change) your clustered index to be on item_user_id, item_start, item_end. This will ensure that as few rows as possible are examined for each query. Per my original answer below, this approach may speed up this particular query but may slow down others, so you'll need to be careful.
if it's not practical to change your clustered index, you can create a non-clustered index on item_user_id, item_start, item_end and any other columns your query needs. This will slow down inserts somewhat, and will double the storage required for your table, but will speed up this particular query.
There are always other ways to increase performance (e.g. by reducing the size of each row) but the primary way is to decrease the number of rows which must be accessed and to increase the % of rows which are accessed sequentially rather than randomly. The indexing suggestions above do both.
ORIGINAL ANSWER BELOW:
Without knowing the exact schema or query plan, the main performance problem with this query is that SELECT * forces a lookup back to your clustered index for every row. If there are large numbers of matching rows for a particular user ID and if your clustered index's first column is not item_user_id, then this will likley be a very inefficient operation because your disk will be trying to fetch lots of randomly distributed rows from teh clustered inedx.
In other words, even thouggh filtering the rows you want is fast (because of your index), actually fetching the data is slower. .
If, however, your clustered index is ordered by item_user_id, item_start, item_end then that should speed things up. Note that this is not a panacea, since if you have other queries which depend on different ordering, or if you're inserting rows in a differnet order, you could end up slowing down other queries.
A less impactful solution would be to create a covering index which contains only the columns you want (also ordered by item_user_id, item_start, item_end, and then add the other cols you need). THen change your query to only pull back the cols you need, instead of using SELECT *.
If you could post more info about the DBMS brand and version, and the schema of your table, and we can help with more details.
Do you need to SELECT *?
If not, you can create a index on user_id, item_start, item_end with the fields you need in the SELECT-part as included columns. This all assuming you're using Microsoft SQL Server 2005+