How to avoid full table scan in sql - sql

SELECT
DISTINCT P.IDENTIFICATIONNUM IDNUMBER,
P.NAME NAME,
P.NATIONALITY NATIONALITY,
O.NAME COMPANY
FROM APPLICANT_TB P
LEFT JOIN APP_TB A ON A.APPLICANTID=P.APPLICANTID
LEFT JOIN ORGANISATION_TB O ON O.ORGID = A.ORGID
as the sql code showing, i am using IBM DB2 , and according to explain plan , all the 3 tables are full table scan .can someone tell me how to avoid this ? (all the PK using are indexed)

Be more selective with the records you want. Include a WHERE clause.

Since you are selecting all the rows the most efficient way to return those to you is to do table scans.
Even if you add filters you still may do table scans, that will depend on how well your indexes match to the columns you are filtering on, and how up-to-date the database statistics are.
In general the query optimizer will guess the percentage of the table it needs based on you filter. Once that percentage goes over a certain (surprisingly small like 20%) portion of the table, it will choose a table scan as the "best" way to get the data you are asking for.

Related

Do I need to filter all subqueries before joining very large tables for optimization

I have three tables that I'm trying to join with over a billion rows per table. Each table has an index on the billing date column. If I just filter the left table and do the joins, will the query run efficiently or do I need to put the same date filter in each subquery?
I.E. will the first query run much slower than the second query?
select item, billing_dollars, IC_billing_dollars
from billing
left join IC_billing on billing.invoice_number = IC_billing.invoice_number
where billing.date = '2019-09-24'
select item, billing_dollars, IC_billing_dollars
from billing
left join (select * from IC_billing where IC_billing.date = '2019-09-24') on billing.invoice_number = IC_billing.invoice_number
where billing.date = '2019-09-24'
I don't want to run this without knowing whether the query will perform well as there aren't many safeguards for poor performing queries. Also, if I need to write the query in the second way, is there a way to only have the date filter in one location rather than having it show up multiple times in the query?
That depends.
Consider your query:
select b.item, b.billing_dollars, icb.IC_billing_dollars
from billing b left join
IC_billing icb
on b.invoice_number = icb.invoice_number
where b.date = '2019-09-24';
(Assuming I have the columns coming from the correct tables.)
The optimal strategy is an index on billing(date, invoice_number) -- perhaps also with item and billing_dollars to the index; and ic_billing(invoice_number) -- perhaps with IC_billing_dollars.
I can think of two situations where filtering on date in ic_billing would be useful.
First, if there is an index on (invoice_date, invoice_number), particularly a primary key definition. Then using this index is usually preferred, even if another index is available.
Second, if ic_billing is partitioned by invoice_date. In that case, you will want to specify the partition for performance.
Generally, though, the additional restriction on the invoice date does not help. In some databases, it might even hurt performance (particularly if the subquery is materialized and the outer query does not use an appropriate index).

Left join or Select in select (SQL - Speed of query)

I have something like this:
SELECT CompanyId
FROM Company
WHERE CompanyId not in
(SELECT CompanyId
FROM Company
WHERE (IsPublic = 0) and CompanyId NOT IN
(SELECT ShoppingLike.WhichId
FROM Company
INNER JOIN
ShoppingLike ON Company.CompanyId = ShoppingLike.UserId
WHERE (ShoppingLike.IsWaiting = 0) AND
(ShoppingLike.ShoppingScoreTypeId = 2) AND
(ShoppingLike.UserId = 75)
)
)
It has 3 select, I want to know how could I have it without making 3 selects, and which one has better speed for 1 million record? "select in select" or "left join"?
My experiences are from Oracle. There is never a correct answer to optimising tricky queries, it's a collaboration between you and the optimiser. You need to check explain plans and sometimes traces, often at each stage of writing the query, to find out what the optimiser in thinking. Having said that:
You could remove the outer SELECT by putting the entire contents of it's subquery WHERE clause in a NOT(...). On the face of it will prevent that outer full scan of Company (or it's index of CompanyId). Try it, check the output is the same and get timings, then remove it temporarily before trying the below. The NOT() may well cause the optimiser to stop considering an ANTI-JOIN against the ShoppingLike subquery due to an implicit OR being created.
Ensure that CompanyId and WhichId are defined as NOT NULL columns. Without this (or the likes of an explicit CompanyId IS NOT NULL) then ANTI-JOIN options are often discarded.
The inner most subquery is not correlated (does not reference anything from it's outer query) so can be extracted and tuned separately. As a matter of style I'd swap the table names round the INNER JOIN as you want ShoppingLike scanned first as it has all the filters against it. It wont make any difference but it reads easier and makes it possible to use a hint to scan tables in the order specified. I would even question the need for the Company table in this subquery.
You've used NOT IN when sometimes the very similar NOT EXISTS gives the optimiser more/alternative options.
All the above is just trial and error unless you start trying the explain plan. Oracle can, with a following wind, convert between LEFT JOIN and IN SELECT. 1M+ rows will create time to invest.

What colums to index for a JOIN with WHERE

Assume you have a JOIN with a WHERE:
SELECT *
FROM partners
JOIN orders
ON partners.partner_id = orders.partner_id
WHERE orders.date
BETWEEN 20140401 AND 20140501
1) An index on partner_id in both tables will speed up the JOIN, right?
2) An index on orders.date will speed up the WHERE clause?
3) But as far as I know, one SELECT can not use more than one index. So which one will be used?
This is your query, with the quoting fixed (and assuming orders.date is really a date type):
SELECT *
FROM partners JOIN
orders
ON partners.partner_id = orders.partner_id
WHERE orders.date BETWEEN '2014-04-01' AND '2014-05-01';
For inner join, there are basically two execution strategies. The engine can start with the partners table and find all matches in orders. Or it can start with orders can find all matches in partners. (There are then different algorithms that can be used.)
For the first approach, the only index what would help is orders(partner_id, orderdate). For the second approach, the best index is orders(orderdate, partner_id). Note that these are not equivalent.
In most scenarios like this, I would expect the orders table to be larger and the filtering to be important. That would suggest that the best execution plan is to start with the orders table and filter it first, using the second option.
To start, an index is used for an operator not for the SELECT statement. Therefore one index will be used for reading data from the partner table and another index could be used to get data from orders table.
I think that the best strategy in this case would be to have a clustered index on partners.partner_id and one non-clustered index on orders.partner_id and orders.date
See the case. It is a sample case
SELECT *
FROM [dbo].[LUEducation] E
JOIN LUCitizen C On C.skCitizen = E.skCitizen
WHERE C.skCitizen <= 100
AND E.skSchool = 26069
Execution Plan:
The sql engine uses more than 1 index at a time.
Without knowing which DBMS you are using it's difficult to know what execution plan the optimizer is going to choose.
Here's a typical one:
Do a range scan on orders.date, using a sorted index for that purpose.
Do a loop join on the results, doing one lookup on
partners.partner_id for each entry, using the index on that field.
In this plan, an index on orders.partner_id will not be used.
However, if the WHERE clause were not there, you might see an execution plan that
does a merge join using the indexes on partners.partner_id and
orders.partner_id.
This terminology may be confusing, because the documentation for your DBMS may use different terms.
One select can only use one index per table (index-merge is an exception).
You pointed out the right indexes in your question.
You don't really need an index on orders.partner_id for this query,
but it is necessary for foreign key constraints and join in other direction.

Oracle optimization -- weird execution plan to left join an uncorrelated subquery

I wrote this query to tie product forecast data to historical shipment data in a Oracle star schema database and the optimizer did not behave in the way that I expected, so I am kind of curious as to what is going on.
Essentially, I have a bunch of dimension tables that will be consistent for both the forecast and the sales fact tables but the fact tables are aggregated at a different level, so I set them up as two subqueries and roll them up so I can tie them together (query example below.) In this case, I want all of the forecast data but only the sales data that matches.
The odd thing is that if I use either of the subqueries by themselves, they each seem to behave the way I would expect and each returns in less than a second (using the same filters -- I tested by just removing one or the other subquery and changing the alias).
Here is an example of the query structure -- I kept it as generic as I could, so there may be a few typos from changing it:
SELECT
TIME_DIMENSION.GREGORIAN_DATE,
LOCATION_DIMENSION.LOCATION_CODE,
DESTINATION_DIMENSION.REGION,
PRODUCT_DIMENSION.PRODUCT_CODE,
SUM(NVL(FIRST_SUBQUERY.VALUE,0)) VALUE1,
SUM(NVL(SECOND_SUBQUERY.VALUE,0)) VALUE2
FROM
TIME_DIMENSION,
LOCATION_DMENSION SOURCE_DIMENSION,
LOCATION_DIMENSION DESTINATION_DIMENSION,
PRODUCT_DIMENSION,
(SELECT
FORECAST_FACT.TIME_KEY,
FORECAST_FACT.SOURCE_KEY,
FORECAST_FACT.DESTINATION_KEY,
FORECAST_FACT.PRODUCT_KEY,
SUM(FORECAST_FACT.VALUE) AS VALUE,
FROM FORECAST_FACT
WHERE [FORECAST_FACT FILTERS HERE]
GROUP BY
FORECAST_FACT.TIME_KEY,
FORECAST_FACT.SOURCE_KEY,
FORECAST_FACT.DESTINATION_KEY) FIRST_SUBQUERY
LEFT JOIN
(SELECT
--This is just as an example offset
(LAST_YEAR_FACT.TIME_KEY + 52) TIME_KEY,
LAST_YEAR_FACT.SOURCE_KEY,
LAST_YEAR_FACT.DESTINATION_KEY,
FORECAST_FACT.PRODUCT_KEY,
SUM(LAST_YEAR_FACT.VALUE) AS VALUE,
FROM LAST_YEAR_FACT
WHERE [LAST_YEAR_FACT FILTERS HERE]
GROUP BY
LAST_YEAR_FACT.TIME_KEY,
LAST_YEAR_FACT.SOURCE_KEY,
LAST_YEAR_FACT.DESTINATION_KEY) SECOND_SUBQUERY
ON
FORECAST_FACT.TIME_KEY = LAST_YEAR_FACT.TIME_KEY
AND FORECAST_FACT.SOURCE_KEY = LAST_YEAR_FACT.SOURCE_KEY
AND FORECAST_FACT.DESTINATION_KEY = LAST_YEAR_FACT.DESTINATION_KEY
--I also tried to tie the last_year subquery to the dimension tables here
WHERE
FORECAST_FACT.TIME_KEY = TIME_DIMENSION.TIME_KEY
AND FORECAST_FACT.SOURCE_KEY = SOURCE_DIMENSION.LOCATION_KEY
AND FORECAST_FACT.DESTINATION_KEY = DESTINATION_DIMENSION.LOCATION_KEY
AND FORECAST_FACT.PRODUCT_KEY = PRODUCT_DIMENSION.PRODUCT_KEY
--I also tried, separately, to tie the last_year subquery to the dimension tables here
AND TIME_DIMENSION.WEEK = 'VALUE'
AND SOURCE_DIMENSION.SOURCE_CODE = 'VALUE'
AND DESTINATION_DIMENSION.REGION IN ('VALUE', 'VALUE')
AND PRODUCT_DIMENSION.CLASS_CODE = 'VALUE'
GROUP BY
TIME_DIMENSION.GREGORIAN_DATE,
SOURCE_DIMENSION.LOCATION_CODE,
DESTINATION_DIMENSION.REGION,
PRODUCT_DIMENSION.PRODUCT_CODE
Essentially, when I run either subquery independently it will utilize the indexes and search only a specific range of the specific partition, whereas with the left join it always does a full table scan on one of the fact tables. What seems to be happening is Oracle is applying the dimension table filters to only the first subquery -- and thus to do the left join it first needs to scan the entire sales table -- even if I explicitly tie and filter the values twice, instead of relying on the implicit filtering... I tried that. Am I thinking about this wrong? To me, the optimizer should use the indexes on both of the fact tables to filter each by the values in the WHERE clause and then left join the resulting subset.
I realize that I could simply add the filters to each of the subqueries, or set this up as a union of two independent queries, but I am curious as to what exactly is going on in terms of the optimization engine -- I can post the execution plan, if that would help.
Thanks!
Be sure that the tables are all analysed. Do it again. The optimizer uses those values for calculating it's execution plan. In cases where Oracle really choose a wrong plan your workaround is to force the optimizer with hints /*+ ... */, specifying the use of indexes, join order, etc.

SQL Server search filter and order by performance issues

We have a table value function that returns a list of people you may access, and we have a relation between a search and a person called search result.
What we want to do is that wan't to select all people from the search and present them.
The query looks like this
SELECT qm.PersonID, p.FullName
FROM QueryMembership qm
INNER JOIN dbo.GetPersonAccess(1) ON GetPersonAccess.PersonID = qm.PersonID
INNER JOIN Person p ON p.PersonID = qm.PersonID
WHERE qm.QueryID = 1234
There are only 25 rows with QueryID=1234 but there are almost 5 million rows total in the QueryMembership table. The person table has about 40K people in it.
QueryID is not a PK, but it is an index. The query plan tells me 97% of the total cost is spent doing "Key Lookup" witht the seek predicate.
QueryMembershipID = Scalar Operator (QueryMembership.QueryMembershipID as QM.QueryMembershipID)
Why is the PK in there when it's not used in the query at all? and why is it taking so long time?
The number of people total 25, with the index, this should be a table scan for all the QueryMembership rows that have QueryID=1234 and then a JOIN on the 25 people that exists in the table value function. Which btw only have to be evaluated once and completes in less than 1 second.
if you want to avoid "key lookup", use covered index
create index ix_QueryMembership_NameHere on QueryMembership (QueryID)
include (PersonID);
add more column names, that you gonna select in include arguments.
for the point that, why PK's "key lookup" working so slow, try DBCC FREEPROCCACHE, ALTER INDEX ALL ON QueryMembership REBUILD, ALTER INDEX ALL ON QueryMembership REORGANIZE
This may help if your PK's index is disabled, or cache keeps wrong plan.
You should define indexes on the tables you query. In particular on columns referenced in the WHERE and ORDER BY clauses.
Use the Database Tuning Advisor to see what SQL Server recommends.
For specifics, of course you would need to post your query and table design.
But I have to make a couple of points here:
You've already jumped to the conclusion that the slowness is a result of the ORDER BY clause. I doubt it. The real test is whether or not removing the ORDER BY speeds up the query, which you haven't done. Dollars to donuts, it won't make a difference.
You only get the "log n" in your big-O claim when the optimizer actually chooses to use the index you defined. That may not be happening because your index may not be selective enough. The thing that makes your temp table solution faster than the optimizer's solution is that you know something about the subset of data being returned that the optimizer does not (specifically, that it is a really small subset of data). If your indexes are not selective enough for your query, the optimizer can't always reasonably assume this, and it will choose a plan that avoids what it thinks could be a worst-case scenario of tons of index lookups, followed by tons of seeks and then a big sort. Oftentimes, it chooses to scan and hash instead. So what you did with the temp table is often a way to solve this problem. Often you can narrow down your indexes or create an indexed view on the subset of data you want to work against. It all depends on the specifics of your wuery.
You need indexes on your WHERE and ORDER BY clauses. I am not an expert but I would bet it is doing a table scan for each row. Since your speed issue is resolved by Removing the INNER JOIN or the ORDER BY I bet the issue is specifically with the join. I bet it is doing the table scan on your joined table because of the sort. By putting an index on the columns in your WHERE clause first you will be able to see if that is in fact the case.
Have you tried restructuring the query into a CTE to separate the TVF call? So, something like:
With QueryMembershipPerson
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QMP.PersonId
EDIT: Btw, I'm assuming that there is an index on PersonId in both the QueryMembership and the Person table.
EDIT What about two table expressions like so:
With
QueryMembershipPerson As
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
, With PersonAccess As
(
Select PersonId
From dbo.GetPersonAccess(1)
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join PersonAccess As PA
On PA.PersonId = QMP.PersonId
Yet another solution would be a derived table like so:
Select ...
From (
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
) As QueryMembershipPerson
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QueryMembershipPerson.PersonId
If pushing some of the query into a temp table and then joining on that works, I'd be surprised that you couldn't combine that concept into a CTE or a query with a derived table.