Making a more efficient join

Making a more efficient join - sql

Here's my query, it is fairly straightforward:
SELECT
INVOICE_ITEMS.II_IVNUM, INVOICE_ITEMS.IIQSHP
FROM
INVOICE_ITEMS
LEFT JOIN
INVOICES
ON
INVOICES.INNUM = INVOICE_ITEMS.II_INNUM
WHERE
INVOICES.IN_DATE
BETWEEN
'2010-08-29' AND '2010-08-30'
;
I have very limited knowledge of SQL, but I'm trying to understand some of the concepts like subqueries and the like. I'm not looking for a redesign of this code, but rather an explanation of why it is so slow (600+ seconds on my test database) and how I can make it faster.
From my understanding, the left join is creating a virtual table and populating it with every result row from the join, meaning that it is processing every row. How would I stop the query from reading the table completely and just finding the WHERE/BETWEEN clause first, then creating a virtual table after that (if it is possible)?
How is my logic? Are there any consistently recommended resources to get me to SQL ninja status?
Edit: Thanks everyone for the quick and polite responses. Currently, I'm connecting over ODBC to a proprietary database that is used in the rapid application development framework called OMNIS. Therefore, I really have no idea what sort of optimization is being run, but I believe it is based loosely on MSSQL.

I would rewrite it like this, and make sure you have an index on i.INNUM, ii.INNUM, and i.IN_DATE. The LEFT JOIN is being turned into an INNER JOIN by your WHERE clause, so I rewrote it as such:
SELECT ii.II_IVNUM, ii.IIQSHP
FROM INVOICE_ITEMS ii
INNER JOIN INVOICES i ON i.INNUM = ii.II_INNUM
WHERE i.IN_DATE BETWEEN '2010-08-29' AND '2010-08-30'
Depending on what database you are using, what may be happening is all of the records from INVOICE_ITEMS are being joined (due to the LEFT JOIN), regardless of whether there is a match with INVOICE or not, and then the WHERE clause is filtering down to the ones that matched that had a date within range. By switching to an INNER JOIN, you may make the query more efficient, by only needing to apply the WHERE clause to INVOICES records that have a matching INVOICE_ITEMS record.

SInce that is a very basic query the optimizer should do fine with it, likely your problem would be incorrect indexing. DO you haveindexes on the In_date field and INVOICE_ITEMS.II_INNUM field? If you have properly set up PK Fk relationships, INVOICES.INNUM should already be indexed but FKs are not indexed automatically.

Your query is fine, it's the indexes you have to look at.
Are INVOICES.INNUM and INVOICE_ITEMS.II_INNUM indexed?
If not SQL has to do something called a 'scan' - it searches every single record.
You can think of indexes as like the tabs on the side of a phone book - you know where to start looking for people based on the first letters of their surname. Without an index (say you want to look for names that end in '...son') you have to search the entire book.
There are different types of index - they can be ordered (like the phone book index - all ordered by surname) or not (like the index at the back of a book - there's an overhead in finding the index and then the actual page).
You should also be able to view the query plan - this is how the server executes the SQL statement. That can tell you all sorts of more advanced stuff - for instance there are multiple ways to do the job: a merge join is possible if both tables are sorted by the join field or a nested join will loop through the smaller table for every record in the larger table.

well there is no reason why this query is slow... the only thing that comes to mind is, do you have indexes on INVOICES.INNUM = INVOICE_ITEMS.II_INNUM? if you add them it could speed up the select but it would slow down updates/inserts...

A join doesn't create a "virtual table" on anything more than just a conceptual level.
The performance issue with your query most likely lies in poor or insufficient indexing. You should have indexes on:
INVOICE_ITEMS.II_INNUM
INVOICES.IN_DATE
You should also have an index on INVOICES.INNUM, but if that's the primary key of the table then it already has one.
Also, don't use a left join here. If there's a foreign key between INVOICE_ITEMS.II_INNUM and INVOICES.INNUM (and INVOICE_ITEMS.II_INNUM is not nullable), then you'll never encounter a record in INVOICE_ITEMS that won't match up to a record in INVOICES. Even if there were, your WHERE condition is using a value from INVOICES, so you'd eliminate any unmatched rows anyway. Just use a regular JOIN.

Related

Simple Inner join suggesting an Include index

I have this simple inner join query and its execution plan master table has around 34K records and detail table has around 51K records. But this simple query is suggesting to add an index with include (containing all master columns that I included in the select). I wasn't expecting this what could be the reason and remedy.
DECLARE
#StartDrInvDate Date ='2017-06-01',
#EndDrInvDate Date='2017-08-31'
SELECT
Mastertbl.DrInvoiceID,
Mastertbl.DrInvoiceNo,
Mastertbl.DistributorInvNo,
PreparedBy,
detailtbl.BatchNo, detailtbl.Discount,
detailtbl.TradePrice, detailtbl.IssuedUnits,
detailtbl.FreeUnits
FROM
scmDrInvoices Mastertbl
INNER JOIN
scmDrInvoiceDetails detailtbl ON Mastertbl.DrInvoiceID = detailtbl.DrInvoiceID
WHERE
(Mastertbl.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate)
My real curiosity is why it is suggesting this index - I normally not see this behavior with larger tables

For this query:
SELECT m.DrInvoiceID, m.DrInvoiceNo, m.DistributorInvNo,
PreparedBy,
d.BatchNo, d.Discount, d.TradePrice, d.IssuedUnits, d.FreeUnits
FROM scmDrInvoices m INNER JOIN
scmDrInvoiceDetails d
ON m.DrInvoiceID = d.DrInvoiceID
WHERE m.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate;
I would expect the basic indexes to be: scmDrInvoices(DrInvDate, DrInvoiceID) and scmDrInvoiceDetails(DrInvoiceID). This index would allow the query engine to quickly identify the rows that match the WHERE in the master table and then look up the corresponding values in scmDrInvoiceDetails.
The rest of the columns could then be included in either index so the indexes would cover the query. "Cover" means that all the columns are in the index, so the query plan does not need to refer to the original data pages.
The above strategy is what SQL Server is suggesting.

You can perhaps see the logic of why it's suggesting to index the invoice date; it's done some calculation on the number of rows you want out of the number of rows it thinks there are currently, and it appears that the selectivity of an index on that column makes it worth indexing. If you want 3 rows out of 55,000, and you want it every 5 minutes forever, it makes sense to index. Especially if the growth rate of that table means that next year it'll be 3 rows out of 5.5 million.
The include recommendation is perhaps more naively recommending associating sufficient additional data with the indexed values such that the entire dataset demanded from the master table can be answered from the index, without hitting the table - indexes are essentially pointers to rows in a table; when the query engine has used the index to locate all the rows it will need, it then still needs to bash the table to actually get the data you want. By including data in an index you remove the need to go to the table and it's sensible sometimes, but not others (creating many indexes that essentially replicate most/all of a table data for seldom run queries is a waste of disk space).
Consider too, that the frequency with which you're running this query now, in a debug tool, is affecting SQLServer's opinion of how often the query is used. I routinely find my SQLAzure portal making index recommendations thanks to the devs running a query over and over, debugging it, when I actually know that in prod, that query will be used once a month, so I discard the recommendation to make an index that includes most the table, when the straight "index only the columns searched" will do fine, no include necessary
These recommendations thus shouldn't be blindly heeded as SQLServer cannot know what you intend to use this, or similar queries for in the real world applications. Index creation and maintenance should be done carefully and thoughtfully; for example it may be that this query is asking for this index, another query would want an index on a different column but it might make sense to create an index that keys on both columns (in a particular order) and then in whichever query searches on the column that is indexed second, include a predicate that hits the first indexed column regardless of whether the query needs it
Example, in your invoices table you have a column indicating whether it's paid or not, and somewhere else in your app you have another query that counts the number of unpaid invoices. You can either have 2 indexes - one on invoice date (for this query) and one on status (for that query) or one on both columns (status, date) and in this query have predicates of WHERE status = 'unpaid' AND date between... even though the status predicate is redundant. Why might it be redundant? Suppose you know you'll only ever be choosing invoices from last week that have not been sent out yet, so can only ever be unpaid.. This is what I mean by "be thoughtful about indexing" - you know lots about your app that SQLServer can never figure out.. By including the redundant status column in the "get invoices from last week" query (even though status is logically redundant) you allow the query engine to use an index that is ordered first by status, then by date. This means you can get away with having to only maintain one index, and it can be used by two queries
Index maintenance and logic of creation can be a full time job.. ;)

MSSQL Server Not Using NonClustered Composite Key Index (PK + FK) on InnerJoin

Having the following structure:
Table Auction (Id_Auction (Pk), DateTime_Auction)
Table Auction_Item (Id_Auction_Item (Pk), Id_Auction (Fk), Id_Winning_Bid (Fk), Item_Description)
Table Bid (Id_Bid (Pk), Id_Auction_Item (Fk), Id_Bidder (Fk), Lowest_Value, Highest_Value)
Table Bidder (Id_Bidder (Pk), Name)
Indexes for Auction are not relevant.
Indexes for Auction_Item:
Clustered Index PK_Auction_Item (Id_Auction_Item)
NonClustered Index IX_Auction_Item_IdWinningBid (Id_Winning_Bid)
Indexes for Bid:
Clustered Index PK_Bid (Id_Bid)
NonClustered Index IX_Bid_IdBidder (Id_Bidder)
NonClustered Index IX_Bid_IdBid_IdBidder (Id_Bid, Id_Bidder) Unique Included (Id_Auction_Item, Lowest_Value, Highest_Value)
Indexes for Bidder are not relevant.
I'll ask you to bear with me a little... This structure is only to you recognize the relationship between the tables/data and is not intendent to be following best practices. The actual database is really more complex (Table "Bid" is like 54 millions rows). Oh, Yes, each Auction_Item will have only one "Bid per Bidder" with his highest and lowest bid.
So, when I execute the following query:
Select
Auc.Id_Auction,
Itm.Id_Auction_Item,
Itm.Item_Description,
B.Id_Bid,
B.Lowest_Value,
B.Highest_Value
From
Auction Auc
Inner Join Auction_Item Itm on Itm.Id_Auction = Auc.Id_Auction
Inner Join Bid B on B.Id_Bid = Itm.Id_Winning_Bid
And B.Id_Bidder = 27
Where Auc.DateTime_Auction > '2014-01-01';
Why Sql Server prefers to NOT use "IX_Bid_IdBid_IdBidder", and use this execution plan for Bid:
If I disable IX_Bid_IdBidder, and force it to use "IX_Bid_IdBid_IdBidder" everything mess up:
I can't understand why MSSQL prefers use 2 indexes, instead of only one that covers completely the query. My only guess is that's faster to use the ClusteredIndex, but I can't believe that it's faster than just use the Unique Composite Key of the other NonClustered Index.
Why?
Update:
As proposed by #Arvo, I changed the order of key columns of the "IX_Bid_IdBid_IdBidder", making the Id_Bidder first and Id_Bid second. Then, it become the preferred index. So, once again, why is MSSQL using the less selective "Index Key", instead of the most selective key? The Id_Bid is explicitly related in the inner join...
Old update:
I Updated the query, making it even more selective.
Also, I updated the index "IX_Bid_IdBid_IdBidder", to include Id_Auction_Item
Apologies:
The Index IX_Bid_IdAuctionItem_IdBidder is in fact IX_Bid_IdBid_IdBidder, that INCLUDES Id_Bid IN THE INDEX UNIQUE KEY!

A covering, correctly-sorted index is rarely not used by SQL Server. Only pathological cases come to mind such as extremely low page fullness or huge unneeded additional columns.
You index is simply not covering. Look at the columns that are output. You'll discover one that you have not indexed.
That column is Id_Auction_Item.

Ok, I think that after a lot of research (and learn a bit more about how joins really work behind the scenes) I figured it out.
By now, I'll post it only as a theory, til some SQL Master say that it's wrong and show me the light, or I really be sure I'm right.
The point is that MSSQL is choosing what is fastest to the whole query, and not only to the Bid table. So the analyzer have to choose to start from Auction table, or Bid table (because the conditions I specified. DateTime_Auction, and Id_Bidder).
In my (frivolous) mind, I thought the best execution plan will be starting from the Auction table:
Get Auctions that match the specified date >> Get Auctions_Items matching inner join with Auctions >> Get the Bids matching inner join with Auction_Item AND that have Id_Bidder matching the specified id
This will select a lot of rows in each "level"/nested loop, and only in the end use the specified index to exclude 90% of data.
Instead, MSSQL want to start with the minimal data set as possible. In this case, only the Bids of the specified bidder, since there is a lot of Auction Items that the bidder could simply don't participate. Doing this, each nested loop have its outer table shrunken compared with "my plan".
Get Bids of specified bidder >> inner join with Auction_Item >> excludes Auctions matching date.
If you pay attention to the very most at right nested loop, that I presume is the first nested loop, the Outer table of the loop is the preselected list of Bids of a Bidder using the appropriate index (IX_Bid_IdBidder), than execute a scan on the clustered index, and etc...
To make it even better, I included the columns that was in the "IX_Bid_IdBid_IdBidder" into "IX_Bid_IdBidder", and MSSQL doesn't need to execute a Key lookup on the PK_Bid.
There is a lot of Auction Items to each Auction, but only one Bid from the specified Bidder for each Auction Item, so the first nested loop will select the minimum of valid Auction Items we will need, that also will limit the Auctions we will to consider matching the Date. Thus, since we are starting from Bids, there is not a "list" of Id_Bids to limit, and then MSSQL cannot use the index "IX_Bid_IdBid_IdBidder" EVEN it covering all the fields of query. Thinking now, it seems a little obvious.
Anyway, Thanks for everybody that helped me!
My research:
http://sqlmag.com/database-performance-tuning/advanced-join-techniques (a little outdated...)
https://technet.microsoft.com/en-us/library/ms191426%28v=sql.105%29.aspx
https://technet.microsoft.com/en-us/library/ms191318%28v=sql.105%29.aspx
http://blogs.msdn.com/b/craigfr/archive/2006/07/26/679319.aspx
http://blogs.msdn.com/b/craigfr/archive/2009/03/18/optimized-nested-loops-joins.aspx

There's a lot of people out there who know a lot more about SQL Server than I do, but this sounds a lot like one of two possible problems:
First it could be that SQL Server is using outdated statistics to determine what's "most efficient", and because the statistics are wrong, it's picking the wrong index.
The second is a lot less likely, but bears mentioning. You've not mentioned stored procedures in your text, but if this is in a stored proc, SQL could be using a cached (and very wrong) execution plan - look up 'parameter sniffing' for more explanation on this topic.

JOIN on concatenated column performance

I have a view that needs to join on a concatenated column. For example;
dbo.View1 INNER JOIN
dbo.table2 ON dbo.View1.combinedcode = dbo.table2.code
Inside the 'View1' there is a column which is comprised like so;
dbo.tableA.details + dbo.tableB.code AS combinedcode
Performing a join on this column is extremely slow. However the actual 'View1' runs extremely quickly. The poor performance comes with the join, and there aren't even many rows in any of the tables or views. Does anyone know why this might be?
Thanks for any insight!

Since there's no index on combinedcode, the JOIN will most likely result in a full "table scan" of the view to calculate the code for every row.
If you want to speed things up, try making the view into an indexed view with an index on combinedcode to help the join.
Another alternative, depending on your SQL server version, is to (as Parado answers) create a temporary table for the join, although it's usually less performant, at least for single shot queries.

Try this way:
select *
into #TemTap
from View1
/*where conditions on view1*/
after that You could create index on #TemTap.combinedcode and than
dbo.#TemTap as View1 INNER JOIN dbo.table2 ON dbo.View1.combinedcode =
dbo.table2.code
It often works for me.

The reason is because the optimizer has no information about the concatenated column, so it cannot choose a reasonable join path. My guess, if you look at the execution plan, is that the join is using a "nested loop" join. (I'm tempted to add "dreaded" to that.)
You might be able to fix this by putting an index on table2(code). The optimizer should decide to use this index, getting around the bad join optimization.
You can also use query hints to force the use of a "hash join" or "merge join". I am finding myself doing this more often for complex queries, where changes to the data might effect the query plan. (Such hints go in when a query that has been taking 2 minutes for a year decides to take hours, fill the temporary database, and die when it runs out of space.) You can do this by adding OPTION (merge join, hash join) to the end of the query. You can also explicitly choose the type of join in the on clause.
Finally, storing the intermediate results in a temporary table (as proposed by Parado) should give the optimizer enough information to choose the best join algorithm.

Using SQL functions is where condition is not advised. here you are using concatenate in where condition (indirectly but yes). so it is performing concatenation for every row and then comparing it with other table.
Now solution will be try using intermediate table rather then this view to hold the concatinated value.
if not possible try using index view, I know its a hell of task.
I would have preferred creating intermediate table.
see the link for indexed views
http://msdn.microsoft.com/en-us/library/ms191432.aspx#Restrictions

If you join two tables in the SELECT statement, all indexes on the table columns can no longer be used?

Let's say we have:
SELECT *
FROM Pictures
JOIN Categories ON Categories.CategoryId = Pictures.CategoryId
WHERE Pictures.UserId = #UserId
ORDER BY Pictures.UploadDate DESC
In this case, the database first join the two tables and then work on the derived table, which I think would mean the indexes on the individual tables would be no use, unless you can come up with an index that is bound to some column in the derived table?

You have a fundamental misunderstanding of how SQL works. The SQL language specifies what result set should be returned. It says nothing about how the database should achieve those results.
It is up to the database engine to parse the statement and come up with an execution plan (hopefully an efficient one) that will produce the correct results. Many modern relational databases have sophisticated query optimizers that completely pull apart the statement and derive execution plans that seem to have no relationship with the original query. (At least not to the untrained eye)
The execution plan for the same query can even change over time if the engine uses a cost based optimizer. A cost based optimizer makes decisions based on statistics that have been gathered about data and indexes. As the statistics change, the execution plan can also change.
With your simple query you assume that the database has to join the tables and create a temporary result set before it applies the where clause. That might be how you think about the problem, but the database is free to implement it entirely differently. I doubt there are many (if any) databases that would create a temporary result set for your simple query.
This is not to say that you cannot ever predict when an index may or may not be used. But it takes practice and experience to get a feel for how a database might execute a query.

This will join the tables giving you all the category information if a picture's 'CategoryId' is in the table 'Categories''s CategoryId field. (and no result for a particular 'Picture' if there is no such category)
This query will likely return several rows of data. The indexes of either table will be useful no matter which table you would like to access.
Normally your program would loop through the result set.
CategoryId will give you the row in Categories with all the relevant fields in that Category and 'Picture.Id' (assuming there is such a field) will give you a reference to that exact picture row in the database.
You can then manipulate either table by using the relevant index
"UPDATE Categories SET .... WHERE CategoryId = " +
"UPDATE Pictures ..... WHERE PictureId =" +
or some such depending on your programming environment.

Indexes are up to the optimizer for use, which depends on what is occurring in the query. For the query posted, there's nothing obvious to stop an index from being used. However, not all databases operate the same -- MySQL only allows one index to be used per SELECT (check the query plan, because the optimizer might interpret the JOIN so that another index may be used).
The stuff that is likely to ensure that an index can not be used is any function/operation that alters the data. IE: getting the month/etc out of a date, wildcarding the left side of a LIKE clause...

Is a JOIN more/less efficient than EXISTS IN when no data is needed from the second table?

I need to look up all households with orders. I don't care about the data of the order at all, just that it exists. (Using SQL Server)
Is it more efficient to say something like this:
SELECT HouseholdID, LastName, FirstName, Phone
FROM Households
INNER JOIN Orders ON Orders.HouseholdID = Households.HouseholdID
or this:
SELECT HouseholdID, LastName, FirstName, Phone
FROM Households
WHERE EXISTS
(SELECT HouseholdID
FROM Orders
WHERE Orders.HouseholdID = Households.HouseholdID)

Unless this is a fairly rigid 1:1 relationship (which doesn't seem to make much sense given a the wider meaning of households and orders), your queries will return different results (if there are more matching rows in the Orders table).
On Oracle (and most DBMS), I would expect the Exists version to run significantly faster since it only needs to find one row in Orders for the Households record to qualify.
Regardless of the DBMS I would expect the explain plan to show the difference (if the tables are significantly large that the query would not be resolved by full table scans).
Have you tried testing it? Allowing for caching?
C.

The 2 queries are not equivalent. The first one will return multiple results if there is multiple joining records. The EXISTS will likely be more efficient though especially if there is not a trusted FK constraint that the optimiser can use.
For further details on this last point see point 9 here http://www.simple-talk.com/sql/t-sql-programming/13-things-you-should-know-about-statistics-and-the-query-optimizer/

depends on the database engine and how efficient it is at optimizing queries. A good mature database optimizer will make EXISTS faster, others will not. I know that SQL Server can make the query faster, I'm not sure of others.

As was said earlier, your queries will return different resultsets if at least one house has more than one order.
You could work around this by using DISTINCT, but EXISTS (or IN) is more efficient.
See this article:
IN vs. JOIN vs. EXISTS

For such trivial query, it'll be no surprise if the execution of both variants will boil down to a single form, which will be deemed the most performant by the system. Check out the query execution plan to find out.

In postgres, exists would be faster than inner join.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas