What colums to index for a JOIN with WHERE - sql

Assume you have a JOIN with a WHERE:
SELECT *
FROM partners
JOIN orders
ON partners.partner_id = orders.partner_id
WHERE orders.date
BETWEEN 20140401 AND 20140501
1) An index on partner_id in both tables will speed up the JOIN, right?
2) An index on orders.date will speed up the WHERE clause?
3) But as far as I know, one SELECT can not use more than one index. So which one will be used?

This is your query, with the quoting fixed (and assuming orders.date is really a date type):
SELECT *
FROM partners JOIN
orders
ON partners.partner_id = orders.partner_id
WHERE orders.date BETWEEN '2014-04-01' AND '2014-05-01';
For inner join, there are basically two execution strategies. The engine can start with the partners table and find all matches in orders. Or it can start with orders can find all matches in partners. (There are then different algorithms that can be used.)
For the first approach, the only index what would help is orders(partner_id, orderdate). For the second approach, the best index is orders(orderdate, partner_id). Note that these are not equivalent.
In most scenarios like this, I would expect the orders table to be larger and the filtering to be important. That would suggest that the best execution plan is to start with the orders table and filter it first, using the second option.

To start, an index is used for an operator not for the SELECT statement. Therefore one index will be used for reading data from the partner table and another index could be used to get data from orders table.
I think that the best strategy in this case would be to have a clustered index on partners.partner_id and one non-clustered index on orders.partner_id and orders.date

See the case. It is a sample case
SELECT *
FROM [dbo].[LUEducation] E
JOIN LUCitizen C On C.skCitizen = E.skCitizen
WHERE C.skCitizen <= 100
AND E.skSchool = 26069
Execution Plan:
The sql engine uses more than 1 index at a time.

Without knowing which DBMS you are using it's difficult to know what execution plan the optimizer is going to choose.
Here's a typical one:
Do a range scan on orders.date, using a sorted index for that purpose.
Do a loop join on the results, doing one lookup on
partners.partner_id for each entry, using the index on that field.
In this plan, an index on orders.partner_id will not be used.
However, if the WHERE clause were not there, you might see an execution plan that
does a merge join using the indexes on partners.partner_id and
orders.partner_id.
This terminology may be confusing, because the documentation for your DBMS may use different terms.

One select can only use one index per table (index-merge is an exception).
You pointed out the right indexes in your question.
You don't really need an index on orders.partner_id for this query,
but it is necessary for foreign key constraints and join in other direction.

Related

Do I need to filter all subqueries before joining very large tables for optimization

I have three tables that I'm trying to join with over a billion rows per table. Each table has an index on the billing date column. If I just filter the left table and do the joins, will the query run efficiently or do I need to put the same date filter in each subquery?
I.E. will the first query run much slower than the second query?
select item, billing_dollars, IC_billing_dollars
from billing
left join IC_billing on billing.invoice_number = IC_billing.invoice_number
where billing.date = '2019-09-24'
select item, billing_dollars, IC_billing_dollars
from billing
left join (select * from IC_billing where IC_billing.date = '2019-09-24') on billing.invoice_number = IC_billing.invoice_number
where billing.date = '2019-09-24'
I don't want to run this without knowing whether the query will perform well as there aren't many safeguards for poor performing queries. Also, if I need to write the query in the second way, is there a way to only have the date filter in one location rather than having it show up multiple times in the query?
That depends.
Consider your query:
select b.item, b.billing_dollars, icb.IC_billing_dollars
from billing b left join
IC_billing icb
on b.invoice_number = icb.invoice_number
where b.date = '2019-09-24';
(Assuming I have the columns coming from the correct tables.)
The optimal strategy is an index on billing(date, invoice_number) -- perhaps also with item and billing_dollars to the index; and ic_billing(invoice_number) -- perhaps with IC_billing_dollars.
I can think of two situations where filtering on date in ic_billing would be useful.
First, if there is an index on (invoice_date, invoice_number), particularly a primary key definition. Then using this index is usually preferred, even if another index is available.
Second, if ic_billing is partitioned by invoice_date. In that case, you will want to specify the partition for performance.
Generally, though, the additional restriction on the invoice date does not help. In some databases, it might even hurt performance (particularly if the subquery is materialized and the outer query does not use an appropriate index).

Query not using indexes using using Table function

I have following query:
select i.pkey as instrument_pkey,
p.asof,
p.price,
p.lastprice as lastprice,
p.settlementprice as settlement_price,
p.snaptime,
p.owner as source_id,
i.type as instrument_type
from quotes_mxsequities p,
instruments i,
(select instrument, maxbackdays
from TABLE(cast (:H_ARRAY as R_TAB))) lbd
where p.asof between :ASOF - lbd.maxbackdays and :ASOF
and p.instrument = lbd.instrument
and p.owner = :SOURCE_ID
and p.instrument = i.pkey
Since I have started using table function, query has started doing full table scan on table quotes_mxsequities which is large table.
Earlier when I used IN clause include of table function index was being used.
Any suggestion on how to enforce index usage?
EDIT:
I will try to get explain plan but just to add, H_ARRAY is expected to have around 10k entries. quotes_mxsequities is a large table millions of rows. Instruments is again a large table but has lesser rows than quotes_mxsequities.
Full table scan is happening for quotes_mxsequities while instruments is using index
Quite difficult to answer with no explain plan and informations about table structure, number of rows, etc.
As a general, simplified approach, you could try to force the use on an index with the INDEX hint.
Your problem can even be due to a wrong order in table processing; you can try to make Oracle follow the right order ( I suppose LBD first) with the LEADING hint.
Another point could be the full access, while you probably need a NESTED LOOP; in this case you can try the USE_NL hint
It's hard to be sure form the limited information provided, but it looks like this is an issue with the optimiser not being able to establish the cardinality of the table collection expression, since its contents aren't known at parse time. With a stored nested table the statistics would be available, but here there are none for it to use.
Without that information the optimiser defaults to guessing your table collection will have 8K entries, and uses that as the cardinality estimate; if that is a significant proportion of the number of rows in quotes_mxsequities then it will decide the index isn't going to be efficient, and will use a full table scan.
You can use the undocumented cardinality hint to tell the optimiser roughly how many elements you actually expect in the collection; you presumably won't know exactly, but you might know you usually expect around 10. So you could add a hint:
select /*+ CARDINALITY(lbd, 10) */ i.pkey as instrument_pkey,
You may also find the dynamic sampling hint useful here, but without your real data to look at, the cardinality hint applies to the basic execution plan so it's easy to see its effect.
Incidentally, you don't need the subquery on the table expression, you can simplify slightly to:
from TABLE(cast (:H_ARRAY as R_TAB)) lbd,
quotes_mxsequities p,
instruments i
or even better use modern join syntax:
select /*+ CARDINALITY(lbd, 10) */ i.pkey as instrument_pkey,
p.asof,
p.price,
p.lastprice as lastprice,
p.settlementprice as settlement_price,
p.snaptime,
p.owner as source_id,
i.type as instrument_type
from TABLE(cast (:H_ARRAY as R_TAB)) lbd
join quotes_mxsequities p
on p.asof between :ASOF - lbd.maxbackdays and :ASOF
and p.instrument = lbd.instrument
join instruments i
on i.pkey = p.instrument
where p.owner = :SOURCE_ID;

How to avoid full table scan in sql

SELECT
DISTINCT P.IDENTIFICATIONNUM IDNUMBER,
P.NAME NAME,
P.NATIONALITY NATIONALITY,
O.NAME COMPANY
FROM APPLICANT_TB P
LEFT JOIN APP_TB A ON A.APPLICANTID=P.APPLICANTID
LEFT JOIN ORGANISATION_TB O ON O.ORGID = A.ORGID
as the sql code showing, i am using IBM DB2 , and according to explain plan , all the 3 tables are full table scan .can someone tell me how to avoid this ? (all the PK using are indexed)
Be more selective with the records you want. Include a WHERE clause.
Since you are selecting all the rows the most efficient way to return those to you is to do table scans.
Even if you add filters you still may do table scans, that will depend on how well your indexes match to the columns you are filtering on, and how up-to-date the database statistics are.
In general the query optimizer will guess the percentage of the table it needs based on you filter. Once that percentage goes over a certain (surprisingly small like 20%) portion of the table, it will choose a table scan as the "best" way to get the data you are asking for.

Join on columns with different indexes

I am using a query the join columns one has a clustered index and the other has a non-clustered index. The query is taking a long time. Is this the reason that I am using different type of indexes ?
SELECT #NoOfOldBills = COUNT(*)
FROM Billing_Detail D, Meter_Info m, Meter_Reading mr
WHERE D.Meter_Reading_ID = mr.id
AND m.id = mr.Meter_Info_ID
AND m.id = #Meter_Info_ID
AND BillType = 'Meter'
IF (#NoOfOldBills > 0) BEGIN
SELECT TOP 1 #PReadingDate = Bill_Date
FROM Billing_Detail D, Meter_Info m, Meter_Reading mr
WHERE D.Meter_Reading_ID = mr.id
AND m.id = mr.Meter_Info_ID
AND m.id = #Meter_Info_ID
AND billtype = 'Meter'
ORDER BY Bill_Date DESC
END
Without knowing more details and the context, it's tricky to advise - but it looks like you are trying to find out the date of the oldest bill. You could probably rewrite those two queries as one, which would improve the performance significantly (assuming that there are some old bills).
I would suggest something like this - which in addition to probably performing better, is a little easier to read!
SELECT count(d.*) NoOfOldBills, MAX(d.Billing_Date) OldestBillingDate FROM Billing_Detail d
INNER JOIN Meter_Reading mr ON mr.id=d.Meter_Reading_ID
INNER JOIN Meter_Info m ON m.Id=mr.Meter_Info_ID
WHERE
m.id = #Meter_Info_ID AND billtype = 'Meter'
The reason is not because you have different types of indices. Since you said you have clustered indices on all primary keys, you should be fine there. To support this query, you would also need an index with two columns on BillType and Bill_Date to cut down the time.
Hopefully they are in the same table. Otherwise, you may need a few indices to finally create one that does have two columns.
OK there are a number of things here. First, are the indexes you have relevant (or as relevant as they can be). There should be a clustered index on each table - a non clustered index does not work effectively if the table does not have a clustered index which non clustered indexes use to identify rows.
Does your index cover only one column? SQL Server uses indexes with the order of the columns for that index being very important. The left most column of the index should (in general) be the column that has the greatest ordinality (divides the data into the smallest amounts) Do either of the indexes cover all the columns referred to in the query (this is known as a covering index (Google this for more info).
In general indexes should be wide, if SQL Server has an index on col1, col2, col3, col4 and another on col1, col2 the later is redundant, as the information from the second index is fully contained in the first and SQL Server understands this.
Are you statistics up to date? SQL Server can / will choose a bad execution plan if the statistics are not up to date. What does query analyser show for plan execution (SSMS Query | show execution plan)?

SQL Server search filter and order by performance issues

We have a table value function that returns a list of people you may access, and we have a relation between a search and a person called search result.
What we want to do is that wan't to select all people from the search and present them.
The query looks like this
SELECT qm.PersonID, p.FullName
FROM QueryMembership qm
INNER JOIN dbo.GetPersonAccess(1) ON GetPersonAccess.PersonID = qm.PersonID
INNER JOIN Person p ON p.PersonID = qm.PersonID
WHERE qm.QueryID = 1234
There are only 25 rows with QueryID=1234 but there are almost 5 million rows total in the QueryMembership table. The person table has about 40K people in it.
QueryID is not a PK, but it is an index. The query plan tells me 97% of the total cost is spent doing "Key Lookup" witht the seek predicate.
QueryMembershipID = Scalar Operator (QueryMembership.QueryMembershipID as QM.QueryMembershipID)
Why is the PK in there when it's not used in the query at all? and why is it taking so long time?
The number of people total 25, with the index, this should be a table scan for all the QueryMembership rows that have QueryID=1234 and then a JOIN on the 25 people that exists in the table value function. Which btw only have to be evaluated once and completes in less than 1 second.
if you want to avoid "key lookup", use covered index
create index ix_QueryMembership_NameHere on QueryMembership (QueryID)
include (PersonID);
add more column names, that you gonna select in include arguments.
for the point that, why PK's "key lookup" working so slow, try DBCC FREEPROCCACHE, ALTER INDEX ALL ON QueryMembership REBUILD, ALTER INDEX ALL ON QueryMembership REORGANIZE
This may help if your PK's index is disabled, or cache keeps wrong plan.
You should define indexes on the tables you query. In particular on columns referenced in the WHERE and ORDER BY clauses.
Use the Database Tuning Advisor to see what SQL Server recommends.
For specifics, of course you would need to post your query and table design.
But I have to make a couple of points here:
You've already jumped to the conclusion that the slowness is a result of the ORDER BY clause. I doubt it. The real test is whether or not removing the ORDER BY speeds up the query, which you haven't done. Dollars to donuts, it won't make a difference.
You only get the "log n" in your big-O claim when the optimizer actually chooses to use the index you defined. That may not be happening because your index may not be selective enough. The thing that makes your temp table solution faster than the optimizer's solution is that you know something about the subset of data being returned that the optimizer does not (specifically, that it is a really small subset of data). If your indexes are not selective enough for your query, the optimizer can't always reasonably assume this, and it will choose a plan that avoids what it thinks could be a worst-case scenario of tons of index lookups, followed by tons of seeks and then a big sort. Oftentimes, it chooses to scan and hash instead. So what you did with the temp table is often a way to solve this problem. Often you can narrow down your indexes or create an indexed view on the subset of data you want to work against. It all depends on the specifics of your wuery.
You need indexes on your WHERE and ORDER BY clauses. I am not an expert but I would bet it is doing a table scan for each row. Since your speed issue is resolved by Removing the INNER JOIN or the ORDER BY I bet the issue is specifically with the join. I bet it is doing the table scan on your joined table because of the sort. By putting an index on the columns in your WHERE clause first you will be able to see if that is in fact the case.
Have you tried restructuring the query into a CTE to separate the TVF call? So, something like:
With QueryMembershipPerson
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QMP.PersonId
EDIT: Btw, I'm assuming that there is an index on PersonId in both the QueryMembership and the Person table.
EDIT What about two table expressions like so:
With
QueryMembershipPerson As
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
, With PersonAccess As
(
Select PersonId
From dbo.GetPersonAccess(1)
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join PersonAccess As PA
On PA.PersonId = QMP.PersonId
Yet another solution would be a derived table like so:
Select ...
From (
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
) As QueryMembershipPerson
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QueryMembershipPerson.PersonId
If pushing some of the query into a temp table and then joining on that works, I'd be surprised that you couldn't combine that concept into a CTE or a query with a derived table.