How to speed up mariadb join tables

How to speed up mariadb join tables - sql

I have 2 tables from which I'm joining certain columns. They are joined on a VARCHAR column (indexed in both tables). Table A has a bit over 800.000 records and Table B has 20.000 records.
Table A has an auto_inc primary key. Table B does not have a primary key, only the index on the mentioned VARCHAR column.
The query takes about 48 seconds which is too slow. What can I do to increase the speed? Would it help to create a primary key auto_incr in table B? Even if this is not the column on which the join takes place?
Beginning user in SQL. Both tables are InnoDB and I use Mariadb.
QUERY:
select distinct
`pr`.`ProductIdentifier` AS `ProductIdentifier`,
`pr`.`Datum` AS `Datum`,
`pr`.`Retailer` AS `Retailer`,
`pr`.`Prijs` AS `Prijs`,
`pm`.`Merk` AS `Merk`,
`pm`.`Product` AS `Product`,
`pm`.`Formaat` AS `Formaat`
from
(`prices`.`prices_table` `pr`
join `prices`.`product_match_table` `pm`
on(`pr`.`ProductIdentifier` = `pm`.`ProductIdentifier`))
EXPLAIN SELECT:
Explain table

This answer is based on my knowledge of indexing in general; MariaDB may have some more specialised options I am not aware of.
However, indexes broadly speed up queries in two ways
By only having the columns needed, meaning less data to read and process
By being sorted in an appropriate manner to help processing
For the first, you typically need a covering index.
For the second, this includes
Being sorted the same way (e.g., indexed on the same fields) as tables it is being JOINed to in the query
Being sorted so that WHERE clauses and other types of filtering can directly use the sort to go to the appropriate spot in the index/table
In practice, often the best improvement in performance is that last one - however you do not have WHERE clauses in your code there. If (as is typical) the users filter the results (e.g., only show me results where ProductName = 'Handbag') then you may need to adjust the indexes for those (more on that a bit later though).
Covering indexes for query above
I think with the current query (and no filtering etc) the fastest you can get is with two indexes
CREATE INDEX `IX_prices_ProductIdentifier` ON `prices`.`prices_table`
(`ProductIdentifier`,
`Datum`,
`Retailer`,
`Prijs`);
CREATE INDEX `IX_productmatch_ProductIdentifier` ON `prices`.`product_match_table`
(`ProductIdentifier`,
`Merk`,
`Product`,
`Formaat`);
These provide covering indexes on the query shown, and are both sorted the same (by productIdentifier) to make the join easier.
Searching/filtering (not specified in initial example)
However, if people often search by a specific field first, then it makes sense to re-order the fields in the relevant table (so the searched field is first), or have multiple indexes with the search field at the front.
For example, people may be able to search for specific values in pr.Retailer, pm.Merk, or pm.Product. You may therefore add these additional indexes
CREATE INDEX `IX_prices_Retailer` ON `prices`.`prices_table`
(`Retailer`,
`ProductIdentifier`,
`Datum`,
`Prijs`);
CREATE INDEX `IX_productmatch_Merk` ON `prices`.`product_match_table`
(`Merk`,
`ProductIdentifier`,
`Product`,
`Formaat`);
CREATE INDEX `IX_productmatch_Product` ON `prices`.`product_match_table`
(`Product`,
`ProductIdentifier`,
`Merk`,
`Formaat`);
Notice with the above that the field orders matter. The data (index) is sorted by the first field, then the second field, then the third field etc. To use the index effectively, your filtering/WHERE clause needs to include at least the first field, if not more.
An alternate to these indexes (the ones for filtering) is to have the original two indexes as above, but then put a separate index onto each of the fields they can search on e.g., if the users can filter on the retailer, merk and product, then create
one index on pr.Retailer
one on pm.Merk, and
one on pm.Product
Caveats
Adding indexes makes data inserts onto the relevant table (and often deletes/updates), slower than if the indexes weren't there. The reason is that it doesn't just need to update the data in the table, but it also needs to update the index(es).
Typically this is not much of a problem unless you are adding and deleting lots of data from the tables frequently. But it is worth checking your 'product maintenance' interface (e.g., adding products, updating prices etc) after adding indexes to confirm they still run well.

Related

Simple Inner join suggesting an Include index

I have this simple inner join query and its execution plan master table has around 34K records and detail table has around 51K records. But this simple query is suggesting to add an index with include (containing all master columns that I included in the select). I wasn't expecting this what could be the reason and remedy.
DECLARE
#StartDrInvDate Date ='2017-06-01',
#EndDrInvDate Date='2017-08-31'
SELECT
Mastertbl.DrInvoiceID,
Mastertbl.DrInvoiceNo,
Mastertbl.DistributorInvNo,
PreparedBy,
detailtbl.BatchNo, detailtbl.Discount,
detailtbl.TradePrice, detailtbl.IssuedUnits,
detailtbl.FreeUnits
FROM
scmDrInvoices Mastertbl
INNER JOIN
scmDrInvoiceDetails detailtbl ON Mastertbl.DrInvoiceID = detailtbl.DrInvoiceID
WHERE
(Mastertbl.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate)
My real curiosity is why it is suggesting this index - I normally not see this behavior with larger tables

For this query:
SELECT m.DrInvoiceID, m.DrInvoiceNo, m.DistributorInvNo,
PreparedBy,
d.BatchNo, d.Discount, d.TradePrice, d.IssuedUnits, d.FreeUnits
FROM scmDrInvoices m INNER JOIN
scmDrInvoiceDetails d
ON m.DrInvoiceID = d.DrInvoiceID
WHERE m.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate;
I would expect the basic indexes to be: scmDrInvoices(DrInvDate, DrInvoiceID) and scmDrInvoiceDetails(DrInvoiceID). This index would allow the query engine to quickly identify the rows that match the WHERE in the master table and then look up the corresponding values in scmDrInvoiceDetails.
The rest of the columns could then be included in either index so the indexes would cover the query. "Cover" means that all the columns are in the index, so the query plan does not need to refer to the original data pages.
The above strategy is what SQL Server is suggesting.

You can perhaps see the logic of why it's suggesting to index the invoice date; it's done some calculation on the number of rows you want out of the number of rows it thinks there are currently, and it appears that the selectivity of an index on that column makes it worth indexing. If you want 3 rows out of 55,000, and you want it every 5 minutes forever, it makes sense to index. Especially if the growth rate of that table means that next year it'll be 3 rows out of 5.5 million.
The include recommendation is perhaps more naively recommending associating sufficient additional data with the indexed values such that the entire dataset demanded from the master table can be answered from the index, without hitting the table - indexes are essentially pointers to rows in a table; when the query engine has used the index to locate all the rows it will need, it then still needs to bash the table to actually get the data you want. By including data in an index you remove the need to go to the table and it's sensible sometimes, but not others (creating many indexes that essentially replicate most/all of a table data for seldom run queries is a waste of disk space).
Consider too, that the frequency with which you're running this query now, in a debug tool, is affecting SQLServer's opinion of how often the query is used. I routinely find my SQLAzure portal making index recommendations thanks to the devs running a query over and over, debugging it, when I actually know that in prod, that query will be used once a month, so I discard the recommendation to make an index that includes most the table, when the straight "index only the columns searched" will do fine, no include necessary
These recommendations thus shouldn't be blindly heeded as SQLServer cannot know what you intend to use this, or similar queries for in the real world applications. Index creation and maintenance should be done carefully and thoughtfully; for example it may be that this query is asking for this index, another query would want an index on a different column but it might make sense to create an index that keys on both columns (in a particular order) and then in whichever query searches on the column that is indexed second, include a predicate that hits the first indexed column regardless of whether the query needs it
Example, in your invoices table you have a column indicating whether it's paid or not, and somewhere else in your app you have another query that counts the number of unpaid invoices. You can either have 2 indexes - one on invoice date (for this query) and one on status (for that query) or one on both columns (status, date) and in this query have predicates of WHERE status = 'unpaid' AND date between... even though the status predicate is redundant. Why might it be redundant? Suppose you know you'll only ever be choosing invoices from last week that have not been sent out yet, so can only ever be unpaid.. This is what I mean by "be thoughtful about indexing" - you know lots about your app that SQLServer can never figure out.. By including the redundant status column in the "get invoices from last week" query (even though status is logically redundant) you allow the query engine to use an index that is ordered first by status, then by date. This means you can get away with having to only maintain one index, and it can be used by two queries
Index maintenance and logic of creation can be a full time job.. ;)

Designing SQL indexes with join clauses and ORDER BY

I've been working on building some indexes for a DB2 LUW database. We've implemented some new queries for a landing page and I'm trying to get performance up. I've found a few indexes on some tables that do not appear to be optimal in their ordering, ie, columns with very low selectivity are coming earlier than those with high selectivity. I'm looking to replace them with better versions, but I'm having a bit of confusion on join indexes.
For a bit of background, the queries aren't anything complicated, although they can be a bit large:
SELECT
--About a dozen fields from TABLE A--
--A few fields from joined tables--
FROM
TABLE A
--A few inner join/left joins, mostly on A.ID1 and A.ID2, BIGINT generated keys--
WHERE
A.ONE = :x
AND A.TWO IN (:y)
AND A.THREE IN (--uncorrelated suquery--)
AND A.FOUR IS NULL
AND (A.FIVE BETWEEN :date1 AND :date2
OR
A.SIX = 'STUFF')
ORDER BY A.SEVEN
You get the idea. The cardinality on most of these columns is pretty apparent and it's easy to structure the index in terms of selectivity. Indexing on all of the fields used in the WHERE clause with the proper order has been quite successful in speeding things up. However, the join columns are a bit confusing.
A number of columns have already been indexed by themselves, including A.ID1 and A.ID2, which together form the primary key of the table. I presume that this is a clustered index. There are also some foreign key ID pairs indexed by themselves as well. What I'm wondering is if it is necessary or even useful to include these columns used in the joins within the index that covers the WHERE clause fields. I've heard it said plenty that joined columns should be indexed, WHERE clause columns should be indexed, and they are, but separately. I haven't really been able to find anything definitive (or "usually a good idea, but not always") on the subject. What is the general practice for this sort of thing? Separate indexes or put them all together if the query is important?
In addition, A.SEVEN is a column with unique values, but we're only using it in ORDER BY. Again, I haven't exactly found anything definitive, but does the fact that it's only being used in ORDER BY (well, and in the SELECT statement) affect its placement within the index regardless of cardinality (ie, it is placed at the end of the index as it will not be used for filtering, only sorting, or place it at the beginning due to uniqueness), or should it also be left in a separate index?
And as an afterthought, the column A.FOUR is only checked for null. Would this mean that the cardinality of any non-null data is irrelevant and it should be placed late in the index as we're only looking for null values? A.FOUR is likely to be mostly nulls, but will be largely unique when it isn't null.

In general, database indexes are like book indexes: when you want to find something, you start looking from the left-hand side of the search term, not in the middle. So if you have a compound Index (last name, first name), for example, you can expect the compound index to work properly on last name only, but not first name only. If you want to join first name only, you would need to index first name separately.
See Also
https://stackoverflow.com/a/2228233

OR many columns efficiently

Let's say I have a table with 50 columns. I want to do something like:
SELECT * FROM table WHERE column1=value1 OR column2=value2 OR ...
How can I do this efficiently?
I could add a bunch of indexes or an index across many/all columns. Would this help?
I could create a secondary table with columns (id, field_name, field_value) and then index on each column and now my ORs apply to just 2 columns which are indexed.
What else can I do?
For a bit more info:
Rows are added pretty frequently, but rarely edited after that.
Rows are selected several times after being added (maybe dozens but probably not hundreds).
Table has 100,000+ rows and table scan is too slow.
Specifically, given a row, I'll want to look up all rows that match that row on ANY column.

When you run into scenarios like this, it typically indicates that there may be room for normalizing the table (your scenario B). It would be hard to know without further information on what data your columns actually hold and what your overall table access pattern is.
That being said, without any sort of table structure change, you would just need to have an index on each column that you might want to query against, so as to prevent a full table scan.

Setting aside a discussion of changing your database design...
A combined index (an index on many or most of the columns referenced in your query) isn't going to be of help for your query, which has a bunch of OR'd colN = 'foo' predicates. MySQL is not going use that index to satisfy your query. Even if it were to use the index, there would still be other columns in the underlying table that need to be checked on essentially every row, so MySQL is very likely just to visit all the data pages and not use an index at all. (If you happen to have a GROUP BY or ORDER BY in your query, MySQL might be able to use the index to optimize those operations, especially if it was a "covering" index that included EVERY column referenced by your query.
On the other hand, IF you had a separate, individual index on EVERY column (as a leading column in the index) that was checked with an OR colN = 'foo' OR colN = 'bar', it is possible that MySQL would consider using an "index merge" plan for your query.
But it would have to be an index on EVERY column. If your query is checking even just ONE column that is not a leading column in ANY index, then MySQL would have no choice but to examine every row in the table. So having separate indexes on "many" columns will not help your query, because it's very likely that NONE of the indexes will be used.
Even if you did have a separate index for every single one of the boatload of columns being reference, it's likely that MySQL's estimate of the total number of rows being returned (combined from each index) is too large, and MySQL is likely to decide that an "index merge" is too expensive, and opt for a full table scan instead.
In summary, your only two choices for indexes to help your query (and neither of them is a really good choice) would be:
1) a "covering index" that has leading columns that can be used to satisfy a GROUP BY or an ORDER BY clause (avoiding a "Using filesort" operation"
2) separate, individual indexes on EVERY column (as a leading column) that is checked by an OR colN = 'literal' predicate in your query
But again, neither of those is likely to be a good choice.

Issue with the big tables ( no primary key available)

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.

A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?

Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.

Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)

It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers

One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?

I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.

in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes

Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas