Designing SQL indexes with join clauses and ORDER BY - sql

I've been working on building some indexes for a DB2 LUW database. We've implemented some new queries for a landing page and I'm trying to get performance up. I've found a few indexes on some tables that do not appear to be optimal in their ordering, ie, columns with very low selectivity are coming earlier than those with high selectivity. I'm looking to replace them with better versions, but I'm having a bit of confusion on join indexes.
For a bit of background, the queries aren't anything complicated, although they can be a bit large:
SELECT
--About a dozen fields from TABLE A--
--A few fields from joined tables--
FROM
TABLE A
--A few inner join/left joins, mostly on A.ID1 and A.ID2, BIGINT generated keys--
WHERE
A.ONE = :x
AND A.TWO IN (:y)
AND A.THREE IN (--uncorrelated suquery--)
AND A.FOUR IS NULL
AND (A.FIVE BETWEEN :date1 AND :date2
OR
A.SIX = 'STUFF')
ORDER BY A.SEVEN
You get the idea. The cardinality on most of these columns is pretty apparent and it's easy to structure the index in terms of selectivity. Indexing on all of the fields used in the WHERE clause with the proper order has been quite successful in speeding things up. However, the join columns are a bit confusing.
A number of columns have already been indexed by themselves, including A.ID1 and A.ID2, which together form the primary key of the table. I presume that this is a clustered index. There are also some foreign key ID pairs indexed by themselves as well. What I'm wondering is if it is necessary or even useful to include these columns used in the joins within the index that covers the WHERE clause fields. I've heard it said plenty that joined columns should be indexed, WHERE clause columns should be indexed, and they are, but separately. I haven't really been able to find anything definitive (or "usually a good idea, but not always") on the subject. What is the general practice for this sort of thing? Separate indexes or put them all together if the query is important?
In addition, A.SEVEN is a column with unique values, but we're only using it in ORDER BY. Again, I haven't exactly found anything definitive, but does the fact that it's only being used in ORDER BY (well, and in the SELECT statement) affect its placement within the index regardless of cardinality (ie, it is placed at the end of the index as it will not be used for filtering, only sorting, or place it at the beginning due to uniqueness), or should it also be left in a separate index?
And as an afterthought, the column A.FOUR is only checked for null. Would this mean that the cardinality of any non-null data is irrelevant and it should be placed late in the index as we're only looking for null values? A.FOUR is likely to be mostly nulls, but will be largely unique when it isn't null.

In general, database indexes are like book indexes: when you want to find something, you start looking from the left-hand side of the search term, not in the middle. So if you have a compound Index (last name, first name), for example, you can expect the compound index to work properly on last name only, but not first name only. If you want to join first name only, you would need to index first name separately.
See Also
https://stackoverflow.com/a/2228233

Related

How to speed up mariadb join tables

I have 2 tables from which I'm joining certain columns. They are joined on a VARCHAR column (indexed in both tables). Table A has a bit over 800.000 records and Table B has 20.000 records.
Table A has an auto_inc primary key. Table B does not have a primary key, only the index on the mentioned VARCHAR column.
The query takes about 48 seconds which is too slow. What can I do to increase the speed? Would it help to create a primary key auto_incr in table B? Even if this is not the column on which the join takes place?
Beginning user in SQL. Both tables are InnoDB and I use Mariadb.
QUERY:
select distinct
`pr`.`ProductIdentifier` AS `ProductIdentifier`,
`pr`.`Datum` AS `Datum`,
`pr`.`Retailer` AS `Retailer`,
`pr`.`Prijs` AS `Prijs`,
`pm`.`Merk` AS `Merk`,
`pm`.`Product` AS `Product`,
`pm`.`Formaat` AS `Formaat`
from
(`prices`.`prices_table` `pr`
join `prices`.`product_match_table` `pm`
on(`pr`.`ProductIdentifier` = `pm`.`ProductIdentifier`))
EXPLAIN SELECT:
Explain table
This answer is based on my knowledge of indexing in general; MariaDB may have some more specialised options I am not aware of.
However, indexes broadly speed up queries in two ways
By only having the columns needed, meaning less data to read and process
By being sorted in an appropriate manner to help processing
For the first, you typically need a covering index.
For the second, this includes
Being sorted the same way (e.g., indexed on the same fields) as tables it is being JOINed to in the query
Being sorted so that WHERE clauses and other types of filtering can directly use the sort to go to the appropriate spot in the index/table
In practice, often the best improvement in performance is that last one - however you do not have WHERE clauses in your code there. If (as is typical) the users filter the results (e.g., only show me results where ProductName = 'Handbag') then you may need to adjust the indexes for those (more on that a bit later though).
Covering indexes for query above
I think with the current query (and no filtering etc) the fastest you can get is with two indexes
CREATE INDEX `IX_prices_ProductIdentifier` ON `prices`.`prices_table`
(`ProductIdentifier`,
`Datum`,
`Retailer`,
`Prijs`);
CREATE INDEX `IX_productmatch_ProductIdentifier` ON `prices`.`product_match_table`
(`ProductIdentifier`,
`Merk`,
`Product`,
`Formaat`);
These provide covering indexes on the query shown, and are both sorted the same (by productIdentifier) to make the join easier.
Searching/filtering (not specified in initial example)
However, if people often search by a specific field first, then it makes sense to re-order the fields in the relevant table (so the searched field is first), or have multiple indexes with the search field at the front.
For example, people may be able to search for specific values in pr.Retailer, pm.Merk, or pm.Product. You may therefore add these additional indexes
CREATE INDEX `IX_prices_Retailer` ON `prices`.`prices_table`
(`Retailer`,
`ProductIdentifier`,
`Datum`,
`Prijs`);
CREATE INDEX `IX_productmatch_Merk` ON `prices`.`product_match_table`
(`Merk`,
`ProductIdentifier`,
`Product`,
`Formaat`);
CREATE INDEX `IX_productmatch_Product` ON `prices`.`product_match_table`
(`Product`,
`ProductIdentifier`,
`Merk`,
`Formaat`);
Notice with the above that the field orders matter. The data (index) is sorted by the first field, then the second field, then the third field etc. To use the index effectively, your filtering/WHERE clause needs to include at least the first field, if not more.
An alternate to these indexes (the ones for filtering) is to have the original two indexes as above, but then put a separate index onto each of the fields they can search on e.g., if the users can filter on the retailer, merk and product, then create
one index on pr.Retailer
one on pm.Merk, and
one on pm.Product
Caveats
Adding indexes makes data inserts onto the relevant table (and often deletes/updates), slower than if the indexes weren't there. The reason is that it doesn't just need to update the data in the table, but it also needs to update the index(es).
Typically this is not much of a problem unless you are adding and deleting lots of data from the tables frequently. But it is worth checking your 'product maintenance' interface (e.g., adding products, updating prices etc) after adding indexes to confirm they still run well.

How to set the right indexes on a sql table?

How can I identify the indexes that are worth to set on a sql table?
Take the following as an example:
select *
from products
where name = 'car'
and type = 'vehicle'
and availability > 3
and insertion_date > '2015-10-10'
order by price asc
limit 1
Imagine a database with a few million entries.
Would there be benefits if I set an index on the combination of all attributes that occur in the WHERE and ORDER BY clause?
For the example:
create index i_my_idx on products
(name, type, availability, insertion_date, price)
There are a few rules of thumb that can be useful when deciding which columns to index:
Make sure there's a unique index on the primary key - this is done automatically when you specify a PK in most RDBMSs including postgresql.
Add indexes for each foreign key. These are created automatically in some RDBMSs when you specify a FK but not in postgresql.
If a PK is a compound key, consider adding indexes on each FK making up the PK (except for the first, which is covered by the PK index). As in 2, some RDBMSs (e.g. MySQL with ISAM) add these indexes automatically when the FKs are specified.
Usually, but not always, table joins in queries will be PF to FK and by having indexes on both keys, the query optimizer of the RDBMS has flexibility in determining the optimum plan for maximum performance. This won't always be the best though, and experienced programmers will often format the SQL for a database query to influence the execution plan for best performance, or decide to omit indexes they know are not needed. It's worth noting that an SQL query that is optimal on one RDBMS is not necessarily optimal on another, or on future versions of the DB server, or as the database grows. The latter is important as in some RDBMSs such as postgres and Oracle, the query execution plans are dependent on the data in the tables (this is known as cost-based optimisation).
Once you've got these out of the way a lot comes down to experience and a knowledge of your data, and importantly, how the data is going to be accessed.
Generally you will be looking to index those columns which are best at filtering the data. In your query above, the obvious one is name. This might be enough to make that query run fast enough (unless all your products are cars).
Other than that it's worth making a list of the common ways the data is likely to be accessed e.g.
Get a list of products that are in a category - an index on category will probably help
However, get a list of products that are currently available - an index on availability will probably not help because a large proportion of products are likely to satisfy this condition.
Unless you are dealing with large amounts of data this can often be all you need to do, and it's not generally a good idea to add indexes "just in case" as there are overheads in maintaining them. But if your system does has performance issues, then it's worth considering how combinations of columns are being used in queries, reading up about the postgres query optimizer etc.
And to answer your last question - possibly, but it's far from the first thing to consider.
Well the way you are setting indexes is absolutely correct. Indexes has nothing to do with order by clause.
Some important points while designing SQL query
Always put the condition first in WHERE clause which will filter maximum rows for eg above query name ='car' will filter maximum records in products.
Do not use ">=" use ">" only because greater or equal to will always end up in checking greater first if failed equals as well which will reduce performance of query.
Create a single index in same order your where clause is arranged in.
Try minimizing IN clause use ANY instead.
Thanks
Anant

Performance of SQL query with condition vs. without where clause

Which SQL-query will be executed with less time — query with WHERE-clause or without, when:
WHERE-clause deals with indexed field (e.g. primary key field)
WHERE-clause deals with non-indexed field
I suppose when we're working with indexed fields, thus query with WHERE will be faster. Am I right?
As has been mentioned there is no fixed answer to this. It all depends on the particular context. But just for the sake of an answer. Take this simple query:
SELECT first_name FROM people WHERE last_name = 'Smith';
To process this query without an index, every column, last_name must be checked for every row in the table (full table scan).
With an index, you could just follow a B-tree data structure until 'Smith' was found.
With a non index the worst case looks linear (n), whereas with a B-tree it would be log n, hence computationally less expensive.
Not sure what you mean by 'query with WHERE-clause or without', but you're correct that most of the time a query with a WHERE clause on an indexed field with outperform a query whose WHERE clause on a non-indexed field.
One instance where the performance will be the same (ie indexing doesn't matter) is when you run a range based query in your where clause (ie WHERE col1 > x ). This forces a scan of the table, and thus will be the same speed as a range query on a non indexed column.
Really, it depends on the columns you reference in the where clause, the types of data in the columns, the types of queries your running, etc.
It may depend on the type of where clause you are writing. In a simple where clause, it is generally better to have an index on the field you are using (and uindexes can and should be built on more than the PK). However, you have to write a saragble where clause for the index to make any difference. See this question for some guidelines on sarability:
What makes a SQL statement sargable?
There are cases where a where clause on the primary key will be slower.
The simplest is a table with one row. Using the index requires loading both the index and the data page -- two reads. No index cuts the work in half.
That is a degenerate case, but it points to the issue -- the proportion of the rows selected. Or, more accurately, the proportion of pages needed to resolve the query.
When the desired data is on all pages, then using an index slowed things down. For a non primary key, this can be disastrous, when the table is bigger than the page cache and the accesses are random.
Since pages are ordered by a primary key, the worst case is an additional index scan -- not too bad.
Some databases use statistics on tables to decide when to use an index and when to do a full table scan. Some don't.
In short, for low selectivity queries, an index will improve performance. For high selectivity queries, using an index can result in marginally worse performance or dire performance, depending on various factors.
Some of my queries are quite complex and applying a where clause degrading the performance. For the workaround, I used temp tables and then applied where clause on them. This significantly improved the performance. Also, where I had joins especially Left Outer Join, improved the performance.

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?
I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.
in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes
Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

Two single-column indexes vs one two-column index in MySQL?

I'm faced with the following and I'm not sure what's best practice.
Consider the following table (which will get large):
id PK | giver_id FK | recipient_id FK | date
I'm using InnoDB and from what I understand, it creates indices automatically for the two foreign key columns. However, I'll also be doing lots of queries where I need to match a particular combination of:
SELECT...WHERE giver_id = x AND recipient_id = t.
Each such combination will be unique in the table.
Is there any benefit from adding an two-column index over these columns, or would the two individual indexes in theory be sufficient / the same?
If you have two single column indexes, only one of them will be used in your example.
If you have an index with two columns, the query might be faster (you should measure). A two column index can also be used as a single column index, but only for the column listed first.
Sometimes it can be useful to have an index on (A,B) and another index on (B). This makes queries using either or both of the columns fast, but of course uses also more disk space.
When choosing the indexes, you also need to consider the effect on inserting, deleting and updating. More indexes = slower updates.
A covering index like:
ALTER TABLE your_table ADD INDEX (giver_id, recipient_id);
...would mean that the index could be used if a query referred to giver_id, or a combination of giver_id and recipient_id. Mind that index criteria is leftmost based - a query referring to only recipient_id would not be able to use the covering index in the statement I provided.
Please note that some older MySQL versions can only use one index per SELECT so a covering index would be the best means of optimizing your queries.
If one of the foreign key indexes is already very selective, then the database engine should use that one for the query you specified. Most database engines use some kind of heuristic to be able to choose the optimal index in that situation. If neither index is highly selective by itself, it probably does make sense to add the index built on both keys since you say you will use that type of query a lot.
Another thing to consider is if you can eliminate the PK field in this table and define the primary key index on the giver_id and recipient_id fields. You said that the combination is unique, so that would possibly work (given a lot of other conditions that only you can answer). Typically, though, I think the added complexity that adds is not worth the hassle.
Another thing to consider is that the performance characteristics of both approaches will be based on the size and cardinality of the dataset. You may find that the 2-column index only becomes noticing more performant at a certain dataset size threshold, or the exact opposite. Nothing can substitute for performance metrics for your exact scenario.