Simple Inner join suggesting an Include index - sql

I have this simple inner join query and its execution plan master table has around 34K records and detail table has around 51K records. But this simple query is suggesting to add an index with include (containing all master columns that I included in the select). I wasn't expecting this what could be the reason and remedy.
DECLARE
#StartDrInvDate Date ='2017-06-01',
#EndDrInvDate Date='2017-08-31'
SELECT
Mastertbl.DrInvoiceID,
Mastertbl.DrInvoiceNo,
Mastertbl.DistributorInvNo,
PreparedBy,
detailtbl.BatchNo, detailtbl.Discount,
detailtbl.TradePrice, detailtbl.IssuedUnits,
detailtbl.FreeUnits
FROM
scmDrInvoices Mastertbl
INNER JOIN
scmDrInvoiceDetails detailtbl ON Mastertbl.DrInvoiceID = detailtbl.DrInvoiceID
WHERE
(Mastertbl.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate)
My real curiosity is why it is suggesting this index - I normally not see this behavior with larger tables

For this query:
SELECT m.DrInvoiceID, m.DrInvoiceNo, m.DistributorInvNo,
PreparedBy,
d.BatchNo, d.Discount, d.TradePrice, d.IssuedUnits, d.FreeUnits
FROM scmDrInvoices m INNER JOIN
scmDrInvoiceDetails d
ON m.DrInvoiceID = d.DrInvoiceID
WHERE m.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate;
I would expect the basic indexes to be: scmDrInvoices(DrInvDate, DrInvoiceID) and scmDrInvoiceDetails(DrInvoiceID). This index would allow the query engine to quickly identify the rows that match the WHERE in the master table and then look up the corresponding values in scmDrInvoiceDetails.
The rest of the columns could then be included in either index so the indexes would cover the query. "Cover" means that all the columns are in the index, so the query plan does not need to refer to the original data pages.
The above strategy is what SQL Server is suggesting.

You can perhaps see the logic of why it's suggesting to index the invoice date; it's done some calculation on the number of rows you want out of the number of rows it thinks there are currently, and it appears that the selectivity of an index on that column makes it worth indexing. If you want 3 rows out of 55,000, and you want it every 5 minutes forever, it makes sense to index. Especially if the growth rate of that table means that next year it'll be 3 rows out of 5.5 million.
The include recommendation is perhaps more naively recommending associating sufficient additional data with the indexed values such that the entire dataset demanded from the master table can be answered from the index, without hitting the table - indexes are essentially pointers to rows in a table; when the query engine has used the index to locate all the rows it will need, it then still needs to bash the table to actually get the data you want. By including data in an index you remove the need to go to the table and it's sensible sometimes, but not others (creating many indexes that essentially replicate most/all of a table data for seldom run queries is a waste of disk space).
Consider too, that the frequency with which you're running this query now, in a debug tool, is affecting SQLServer's opinion of how often the query is used. I routinely find my SQLAzure portal making index recommendations thanks to the devs running a query over and over, debugging it, when I actually know that in prod, that query will be used once a month, so I discard the recommendation to make an index that includes most the table, when the straight "index only the columns searched" will do fine, no include necessary
These recommendations thus shouldn't be blindly heeded as SQLServer cannot know what you intend to use this, or similar queries for in the real world applications. Index creation and maintenance should be done carefully and thoughtfully; for example it may be that this query is asking for this index, another query would want an index on a different column but it might make sense to create an index that keys on both columns (in a particular order) and then in whichever query searches on the column that is indexed second, include a predicate that hits the first indexed column regardless of whether the query needs it
Example, in your invoices table you have a column indicating whether it's paid or not, and somewhere else in your app you have another query that counts the number of unpaid invoices. You can either have 2 indexes - one on invoice date (for this query) and one on status (for that query) or one on both columns (status, date) and in this query have predicates of WHERE status = 'unpaid' AND date between... even though the status predicate is redundant. Why might it be redundant? Suppose you know you'll only ever be choosing invoices from last week that have not been sent out yet, so can only ever be unpaid.. This is what I mean by "be thoughtful about indexing" - you know lots about your app that SQLServer can never figure out.. By including the redundant status column in the "get invoices from last week" query (even though status is logically redundant) you allow the query engine to use an index that is ordered first by status, then by date. This means you can get away with having to only maintain one index, and it can be used by two queries
Index maintenance and logic of creation can be a full time job.. ;)

Related

How to speed up mariadb join tables

I have 2 tables from which I'm joining certain columns. They are joined on a VARCHAR column (indexed in both tables). Table A has a bit over 800.000 records and Table B has 20.000 records.
Table A has an auto_inc primary key. Table B does not have a primary key, only the index on the mentioned VARCHAR column.
The query takes about 48 seconds which is too slow. What can I do to increase the speed? Would it help to create a primary key auto_incr in table B? Even if this is not the column on which the join takes place?
Beginning user in SQL. Both tables are InnoDB and I use Mariadb.
QUERY:
select distinct
`pr`.`ProductIdentifier` AS `ProductIdentifier`,
`pr`.`Datum` AS `Datum`,
`pr`.`Retailer` AS `Retailer`,
`pr`.`Prijs` AS `Prijs`,
`pm`.`Merk` AS `Merk`,
`pm`.`Product` AS `Product`,
`pm`.`Formaat` AS `Formaat`
from
(`prices`.`prices_table` `pr`
join `prices`.`product_match_table` `pm`
on(`pr`.`ProductIdentifier` = `pm`.`ProductIdentifier`))
EXPLAIN SELECT:
Explain table
This answer is based on my knowledge of indexing in general; MariaDB may have some more specialised options I am not aware of.
However, indexes broadly speed up queries in two ways
By only having the columns needed, meaning less data to read and process
By being sorted in an appropriate manner to help processing
For the first, you typically need a covering index.
For the second, this includes
Being sorted the same way (e.g., indexed on the same fields) as tables it is being JOINed to in the query
Being sorted so that WHERE clauses and other types of filtering can directly use the sort to go to the appropriate spot in the index/table
In practice, often the best improvement in performance is that last one - however you do not have WHERE clauses in your code there. If (as is typical) the users filter the results (e.g., only show me results where ProductName = 'Handbag') then you may need to adjust the indexes for those (more on that a bit later though).
Covering indexes for query above
I think with the current query (and no filtering etc) the fastest you can get is with two indexes
CREATE INDEX `IX_prices_ProductIdentifier` ON `prices`.`prices_table`
(`ProductIdentifier`,
`Datum`,
`Retailer`,
`Prijs`);
CREATE INDEX `IX_productmatch_ProductIdentifier` ON `prices`.`product_match_table`
(`ProductIdentifier`,
`Merk`,
`Product`,
`Formaat`);
These provide covering indexes on the query shown, and are both sorted the same (by productIdentifier) to make the join easier.
Searching/filtering (not specified in initial example)
However, if people often search by a specific field first, then it makes sense to re-order the fields in the relevant table (so the searched field is first), or have multiple indexes with the search field at the front.
For example, people may be able to search for specific values in pr.Retailer, pm.Merk, or pm.Product. You may therefore add these additional indexes
CREATE INDEX `IX_prices_Retailer` ON `prices`.`prices_table`
(`Retailer`,
`ProductIdentifier`,
`Datum`,
`Prijs`);
CREATE INDEX `IX_productmatch_Merk` ON `prices`.`product_match_table`
(`Merk`,
`ProductIdentifier`,
`Product`,
`Formaat`);
CREATE INDEX `IX_productmatch_Product` ON `prices`.`product_match_table`
(`Product`,
`ProductIdentifier`,
`Merk`,
`Formaat`);
Notice with the above that the field orders matter. The data (index) is sorted by the first field, then the second field, then the third field etc. To use the index effectively, your filtering/WHERE clause needs to include at least the first field, if not more.
An alternate to these indexes (the ones for filtering) is to have the original two indexes as above, but then put a separate index onto each of the fields they can search on e.g., if the users can filter on the retailer, merk and product, then create
one index on pr.Retailer
one on pm.Merk, and
one on pm.Product
Caveats
Adding indexes makes data inserts onto the relevant table (and often deletes/updates), slower than if the indexes weren't there. The reason is that it doesn't just need to update the data in the table, but it also needs to update the index(es).
Typically this is not much of a problem unless you are adding and deleting lots of data from the tables frequently. But it is worth checking your 'product maintenance' interface (e.g., adding products, updating prices etc) after adding indexes to confirm they still run well.

Two indexes for same column and change the order

I have a large table in Microsoft SQL Server 2008. It has two indexes. One index having column A descending order and another index having the column A ascending with some other columns.
My application is doing below:
Select for the record.
If there is no record then insert
If find then update the records
Note that this table has millions of records.
The question is: Are these indexes affect the any select/insert/update performance?
Any suggestions?
Having the exact same two indexes with the only difference being the ordering will make no difference to the SQL engine and will just pick either (practially).
Imagine you 2 dictionaries of the english language, one sorts words from A to Z and the other from Z to A. The effort you will need to search for a particular word will be roughly the same in both cases.
A different case would be if you had 2 lists of people's data, one ordered by first name then last name and the other by last name first, then first name. If you have to look for "John Doe", the one that's ordered first by last name will be practically useless compared to the other one.
These examples are very simplified representations of indexes on SQL Server. Indexes store their data on a structure that's called a B-Tree, but for searching purposes these examples work to understand when will a index be useful or not.
Resuming: you can drop the first index and keep the one that has additional columns on it, since it can be used for more different scenarios and also all cases that would require the other one. Keeping an unuseful index brings additional maintenance tasks like keeping the index updated on every insert, update, delete and refreshing statistics, so you might want to drop it.
PD: As for the 2nd index being used, this greatly depends on the query you are using to access that particular table, it's joins and where clauses. You can use the "Display Estimated Execution Plan" having highlighted the query on SSMS to see the access plan of the engine to each object to perform the operation. If the index is used, you will see it there.
Thanks for all the answers. I explored the SQL Server Profiler and SQL Tuning Advisor for the SQL Server and ran them to get the recommended indexes. It recommended a single index with include options. I used the index and the performance has been improved.

Oracle SQL: What is the best way to select a subset of a very large table

I have been roaming these forums for a few years and I've always found my questions had already been asked, and a fitting answer was already present.
I have a pretty generic (and maybe easy) question now though, but I haven't been able to find a thread asking the same one yet.
The situation:
I have a payment table with 10-50M records per day, a history of 10 days and hundreds of columns. About 10-20 columns are indexed. One of the indices is batch_id.
I have a batch table with considerably fewer records and columns, say 10k a day and 30 columns.
If I want to select all payments from one specific sender, I could just do this:
Select * from payments p
where p.sender_id = 'SenderA'
This runs a while, even though sender_id is also indexed. So I figure, it's better to select the batches first, then go into the payments table with the batch_id:
select * from payments p
where p.batch_id in
(select b.batch_id from batches where b.sender_id = 'SenderA')
--and p.sender_id = 'SenderA'
Now, my questions are:
In the second script, should I uncomment the Sender_id in my where clause on the payments table? It doesn't feel very efficient to filter on sender_id twice, even though it's in different tables.
Is it better if I make it an inner join instead of a nested query?
Is it better if I make it a common table expression instead of a nested query or inner join?
I suppose it could all fit into one question: What is the best way to query this?
In the worst case the two queries should run in the same time and in the best case I would expect the first query to run quicker. If it is running slower, there is some problem elsewhere. You don't need the additional condition in the second query.
The first query will retrieve index entries for a single value, so that is going to access less blocks than the second query which has to find index entries for multiple batches (as well as executing the subquery, but that is probably not significant).
But the danger as always with Oracle is that there are a lot of factors determining which query plan the optimizer chooses. I would immediately verify that the statistics on your indexed columns are up-to-date. If they are not, this might be your problem and you don't need to read any further.
The next step is to obtain a query execution plan. My guess is that this will tell you that your query is running a full-table-scan.
Whether or not Oracle choses to perform a full-table-scan on a query such as this is dependent on the number of rows returned and whether Oracle thinks it is more efficient to use the index or to simply read the whole table. The threshold for flipping between the two is not a fixed number: it depends on a lot of things, one of them being a parameter called DB_FILE_MULTIBLOCK_READ_COUNT.
This is set-up by Orale and in theory it should be configured such that the transition between indexed and full-table scan queries should be smooth. In other words, at the transition point where your query is returning enough rows to just about make a full table scan more efficient, the index scan and the table scan should take roughly the same time.
Unfortunately, I have seen systems where this is way out and Oracle flips to doing full table scans far too quickly, resulting in a long query time once the number of rows gets over a certain threshold.
As I said before, first check your statistics. If that doesn't work, get a QEP and start tuning your Oracle instance.
Tuning Oracle is a very complex subject that can't be answered in full here, so I am forced to recommend links. Here is a useful page on the parameter: reducing it might help: Why Change the Oracle DB_FILE_MULTIBLOCK_READ_COUNT.
Other than that, the general Oracle performance tuning guide is here: (Oracle) Configuring a Database for Performance.
If you are still having problems, you need to progress your investigation further and then come up with a more specific question.
EDIT:
Based on your comment where you say your query is returning 4M rows out of 10M-50M in the table. If it is 4M out of 10M there is no way an index will be of any use. Even with 4M out of 50M, it is still pretty certain that a full-table-scan would be the most efficient approach.
You say that you have a lot of columns, so probably this 4M row fetch is returning a huge amount of data.
You could perhaps consider splitting off some of the columns that are not required and putting them into a child table. In particular, if you have columns containing a lot of data (e.g., some text comments or whatever) they might be better being kept outside the main table.
Remember - small is fast, not only in terms of number of rows, but also in terms of the size of each row.
SQL is an declarative language. This means, that you specify what you like not how.
Check your indexes primary and "normal" ones...

Multiple single field indexes vs multiple-fields indexes

I know there are similar questions on StackOverflow, but after testing different indexes on my tables, I think I don't quite understand how indexes work and I'd like it if someone could explain the behavior I'm experiencing on my queries' performance.
I'm using this query as an example, I'm going to try to explain it in detail:
SELECT ss1.PlayerID, ss1.Name, ss1.Series, ss1.LanesNum, ss1.Date, ss1.LeagueName, ss1.Season FROM SeriesScores ss1
JOIN (SELECT Series, Gender, LanesNum, Bowlout, Season FROM SeriesScores
WHERE Gender = ? AND LanesNum = ? AND Series > -1 AND Bowlout = 'No' AND Season = '2011-2012'
ORDER BY Series DESC LIMIT 0,?) as ss2
USING(series, gender, lanesNum, bowlout, season)
ORDER BY ss1.Series DESC
This query is used to get the highest series bowled in a given season for each pair of lanes in a bowling center for both male and female players.
I'm joining the table on itself instead of using the MAX aggregate function because if there's a tie on a given pair of lanes, I want all the names to come up.
Basically, I join all the fields that match what the inner SELECT returns. That inner SELECT returns the top X players for a given gender and a given pair of lanes.
The USING part makes sure only the players that haven't bowled out, with the same gender, series, lanesNum and season as I'm looking for get selected. I then order them by highest series to lowest series.
This query is in a for loop, which gets run 12 times for men and 12 times for women (12 pair of lanes in the bowling center) with only the lanesNum and gender parameters changing.
I then put all the results in two different vectors in Java to display the results in an application (one vector for men, one for women).
Without any indexes whatsoever, it takes around 11 seconds to run everything including putting the results in a vector and all of that. (5.5 seconds for the 12 queries for men, same for women).
With an index on (gender, lanesNum, series), it takes 0.04 seconds for the whole thing, which is amazing, since that's a more than acceptable speed for my needs.
I used that index because those are all the most important fields I'm using in my WHERE clause, but I don't get why it speeds things up that much, because I tried other things and using some other indexes actually made my queries SLOWER by more than 100%. Also, I'm wondering if I would get an even faster query if I added "bowlout" and "season" to that index.
I wanted to try a single column index on series first and test performance. That's the index that made all of those queries take a total of 22 seconds.
I came to the conclusion that I don't understand where I should be using my indexes and when I should be using them on multiple fields, or using multiple indexes on single fields, etc. Also, I don't understand how using (the wrong) indexes can actually make performance worse.
Optimizing an index too aggresively for just one query runs the risk of slowing down other queries (and thus a real world application, or the next version of it). However, let us do exactly that as an exercise in analysing index performance.
Indexes influence query performance in multiple ways; their existence can actually completely change the algorithm that the database server will use to get to the data. A nice overview is here, but as your query is simple, and you actually have very few relevant indexes in your database (the one you see, and also automatically created indexes to support the primary keys of your tables) we can simplify the story greatly.
A good index makes it faster to cross reference the data between the tables. Ideally it contains columns in your USING and WHERE clauses, and enough of them to reference a unique row in its table most of the time. If it contains less, it may still be used by the database server, but the remaining rows will have to be visited one by one.
An great index does not only all that, but it also contains all data that you will be selecting from the table (yes, this makes sense when the two tables are actually the same physical table due to the self-join; the database server still processes as if it was two different tables, incidentally with the same data). The benefit of such a "fully covering index" is that the database server does not have to visit its table at all; all the columns are available in the index.
Order of columns in the index matters. It is especially essential that the leftmost column in the index appears in the USING clause, or WHERE clause; otherwise the index is pretty much unusable as matching data for a single lookup can appear in many locations in that index. It should also be highly selective (have many different values in the table). Do a few experiments now to see this first hand.
For this reason, the first choice index I'd suggest to you would be series, gender, lanesNum, bowlout; but yours is also a very good one for this query.
There is not much use in creating more than one index explicitly. There is basically no use for more than one of them during query execution, because your query is so simple. So the most useful one will supposedly win and all the others will be ignored.
To your last question: some people believe that superfluous indexes only slow down UPDATE, INSERT and DELETE statements (because these carry the overhead to update the indexes), but it is not that simple. As the database server considers multiple algorithms to compute your query (there are two logical tables to start from and automatic and explicit indexes to use, or not to use), it may choose the wrong plan: an index may look seductive without knowing the data distribution in the table, but be very counterproductive given the distribution.
There is actually a way to let the database server analyze the data and record some statistics that will greatly help it optimize your subsequent queries reasonably and probably to avoid any 22 second executions of your query (until you change your data so much that the statistics will no longer hold true). That is the ANALYZE command. Issue it every time after you change your indexes to see the subsequent sqlite performance at its best. In a production database, schedule ANALYZE to execute every night, so that your database does not gradually slow down over time, or abruptly after adding a harmless, useless index.

When do sql optimizations become overkill?

I'm updating tables with millions of records and I need to be as efficient as possible. Is there a point at which adding more criteria to the where clause will actually hurt rather than help?
For example, if know I want to set a column to 3 I could use this query:
update mytable set col = 3
Or I could update the record only if it's different
update mytable set col = 3 where col <> 3
I could also filter it so it only updates records added since the last time I ran this process
update mytable set col = 3 where col <> 3 and createDate > #lastRunDate
And perhaps I could look for more things in additional columns.
I guess my question is if there is a point where the cost of looking at additional columns outweighs the cost of the update itself and if there's a principle you can use to determine where to draw the line.
Update
So here's the principle I'm trying to piece together based on what was said. Feel free to argue with this and I'll update it accordingly:
If no indexed columns to filter on, add as much criteria as possible to limit the records being updated since a full table scan is going to happen anyway.
If the difference in records between filtering on only indexed columns and filtering on all possible columns is marginal, only use the indexed columns and avoid the full table scan.
If you have a mix of indexed and non-indexed columns, definitely use the indexed columns if you can and only use non-index columns if... [[I'm still struggling with this part. What's the threshold for introducing the non-indexed columns in the where clause?]]
Update #2
Sounds like I have my answer.
If you have an index on "col", then running your first query will update millions of rows regardless; your second query would potentially only update a few and find those quickly if there's an index available. If you don't have an index on that column, the effect will be marginal since a full table or index scan must occur to check all rows in your table (you'll just have fewer actual updates, but that's it).
The whole point of restricting your queries usnig WHERE clauses is to reduce the scope of your query, e.g. the number of rows SQL Server has to look at. Less data to process is always faster than just doing it to all millions of rows......
In response to your update: the main goal of using a WHERE clause is to reduce the number of rows you need to inspect / touch. If you have a means (typically an index) to reduce that number from 100% to a few percent, then it's definitely worth it. That's the whole point of having indices (mostly for SELECTs, but applies to other operations, too, of course).
If you have a suitable index, and thus you can pluck out a few hundred rows to check against a criteria versus having to inspect millions of rows, you'll always be faster. If you have a good book index in a bookstore that guides you easily to the two shelves where the books that interest you are located, you'll find what you're looking for more quickly than when you have to criss-cross the whole bookstore since there's no index available.
There obviously is a point where yet another criteria or index doesn't help anymore. If that's the case, typically yet another WHERE clause won't really help much - or at all. But in this case, the SQL query optimizer will find those cases and filter them out (possibly even just ignoring them when deciding on what the best query execution plan is).
This really comes down to index usage and query optimization. I would suggest looking at the query plan before making any decisions.
Adding indexed fields to the where clause will often improve query time, however, adding non-indexed fields can result in table scans which will slow your query.
My suggestion is write a query that works, look at the execution time, work to reduce it to an exceptable level by looking at the query plan. Don't over optimize, go for the acceptable solution.