DB2 tablescan using join between primary key - indexing

in a db2 database i'm running this query:
select * from mytable t1 left join mytable t2 on t1.id = t2.id
where "id" is the only "primary key" in "mytable"
the explain show a tablescan:
RETURN
MSJOIN
TBSCAN
SORT
mytable TBSCAN
FILTER
TBSCAN
SORT
mytable TBSCAN
if i do the same query in othertable i obtain what i'm expecting: the use of the primary key (pk):
RETURN
MSJOIN
othertable FETCH
PK_othertable IXSCAN
FILTER
othertable FETCH
PK_othertable IXSCAN
why in one case db2 during the join don't use the pk and in other use the pk like i expect?

Db2 uses cost-based optimizer.
It probably decides, that TBSCANs are cheaper for certain queries.
You may try to get desired access plan with the corresponding hint (it's called optimization guidelines in Db2) like below:
select * from mytable t1
left join mytable t2 on t1.id = t2.id
/*
<OPTGUIDELINES>
...
</OPTGUIDELINES>
*/
;
where optimization guidelines may look like:
(let Db2 choose some JOIN request like MSJOIN, NLJOIN, etc. using IXSCAN with particular index name):
<OPTGUIDELINES>
<JOIN>
<IXSCAN TABLE="T1" INDEX="UNQUALIFIED_PK_INDEX_NAME"/>
<IXSCAN TABLE="T2" INDEX="UNQUALIFIED_PK_INDEX_NAME"/>
</JOIN>
</OPTGUIDELINES>
(NLJOIN with IXSCANs using whatever appropriate indexes):
<OPTGUIDELINES>
<NLJOIN>
<IXSCAN TABLE="T1"/>
<IXSCAN TABLE="T2"/>
</NLJOIN>
</OPTGUIDELINES>
Or use whatever else acceptable guidelines to get the access plan desired.
Refer to the Optimization profiles and guidelines topic for more details.
Compare the total cost of each access plan to understand, why Db2 prefers some particular access plan (likely with the smallest total cost).

Related

SQL Server 2016 : query performance with join and without join

I have 2 tables TABLE1 AND TABLE2.
TABLE1 has columns masterId, Id, col1, col2, category
TABLE2 has columns Id, col1, col2
TABLE2.Id is primary key and TABLE1.Id is foreign key.
TABLE1.masterId is primary key of TABLE1.
TABLE1 has 10 million rows with Id 1 to 10 million and first 10 rows having category = 1
TABLE2 has only 10 rows with Id 1 to 10.
Now I want col1 and col2 values with category=1 (either from TABLE1 OR TABLE2 because the values are same in both tables)
Which among below 2 queries gives output faster?
Solution1:
SELECT T1.col1, T1.col2
FROM TABLE1 T1
WHERE T1.category = 1
Solution2:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.Id = T2.Id
WHERE T1.category = 1
Does Solution2 save Table Scan time on millions of rows of TABLE1.
Limitation is:
In my real db scenario, I can make Table1.Id as non clustered index and Table1.category also non clustered index. I cannot make Table1.Id as clustered index because I actually have another auto increment column as primary key in my Table1 in real scenario. So please share your thoughts with this limitation.
Please confirm and share thoughts on this.
It depends on the existing indexes. With a nonclustered index on Id in T1, then the solution 2 might perform better than solution 1, that would require a complete table scan to select the rows with category1. If instead we also have a nonclustered index on Category, then the solution 1 will be faster, since it would only have to seek the nonclustered index to find the rows.
Without any index on Id on T1 a full scan would be required to find the T2.Id row, therefore there might be 10 full scan of T1 for solution 2 and 1 full scan on T1.Category for solution 1, so the solution 1 might be faster. But this depends on the query optimizer and a test the real case to see what are the actual execution plans would be the best way to answer.
But the way to go is to implement the right model and then proceed to create the indexes needed to make the query run fast.
Edit: adapted the answer according to the query edits.
Edit2: index coverage would be expensive and a 10 index seek on PK on table 1 would not cost so much.
[Notice]
This answer was given for an older version of the question, https://stackoverflow.com/revisions/65263530/7
The scenario back then was:
T2 also had a category column, and,
the second query was:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.categoryId = T2.category Id
WHERE T2.category = 1
Assuming the only indices are the PKs, nope, Solution 2 will NOT avoid the table scan. Worse:
Solution 1
Full table scan
Solution 2
Full table scan on T2 (T2.category) and then nested loops (T2.category = T1.category)
Please, what are your goals here?
To begin with, this statement shows a lack of understanding of databases:
first 10 rows having category = 1
SQL tables represent unordered sets. There is no such thing as "first 10 rows". In the context of your question, I think you mean "the 10 rows with the lowest values of the id". However, the ordering of the table is still arbitrary from the perspective of the engine. There are situations where a clustered index could reasonably be assumed to be a "table ordering", but there is never a guarantee that:
select *
from t;
returns data in a particular ordering even with a clustered index.
Two possible execution plans for the first query -- depending on the indexing -- are:
Scanning the table (i.e. reading millions of rows) and doing the test for each row.
Scanning an index on category and just fetching the rows that are needed.
In general, (1) would be much, much slower than (2) when the scanned rows is in the millions and the returned rows are just a few. However, if this may not be true if a significant proportion of all records were returned.
I interpret your question as asking whether the second query could ever be faster than the first:
SELECT T2.col1, T2.col2
FROM TABLE2 T2 INNER JOIN
TABLE1 T1
ON T1.Id = T2.Id
WHERE T1.category = 1;
The answer is "definitely faster than the scan". This is a possible if you have an index on Table1(id, category). However, the query would be better written using EXISTS:
select t2.*
from table2 t2
where exists (select 1
from table1 t1
where t1.id = t2.id and t2.category = 1
);
I would expect this to be faster than the indexed version of the first query as well. Even with an index on (category), the database still has to fetch the data for the select. If the data is on one page (as the "first" statement might suggest), then the two might be quite comparable. However, it would be hard to measure the difference in performance with the correct indexing on table1.
A note about clustered indexes in SQL Server. If the id is an identity primary key and there is no other clustered index, then it is automatically used as the clustered index.

How to redesign a database to find distinct values more effectively?

I often have a need to select a set of distincs values from a column with low selectivity in a big table while joining it to some other table where I can't really filter the entries in the resulting set to some reasonable amount.
For example, I have a table with 20M rows, with column someID which has 200 unique values. I join this table with some other result set on another column and filter 20M rows down to, say, 10M rows (still a lot), and then need to find distinct someID. So I end up with a 10M rows scan no matter what, which is a pain.
In this join, there is no way to filter the results more, 10M records is really the set I need to find distint someID in.
Is there any standard approach to redesign the tables or create some additional table to make this work better?
Your basic query is:
select distinct t1.someID
from table1 t1 join
table2 t2
on t1.col1 = t2.col1;
The optimal indexes for this query are table1(col1, someId) and table2(col1).
Here is another version of the query:
select distinct t1.someId
from table1 t1
where exists (select 1 from table2 t2 where t1.col1 = t2.col1);
In this case, the optimal index would be table1(someid, col1). It is possible that SQL Server will be intelligent in this case and stop looking for an exists value when it encounters a match (although I am a bit skeptical). You would have to investigate the execution plans generated on your data.
Another idea extends this even further:
select s.someId
from someIdtable s
where exists (select 1
from table1 t1 join
table2 t2
on t1.col1 = t2.col1 and t1.someId = s.someId);
This removes the outer distinct, depending only on the semi-join in the exists clause. The optimal index would be table1(someid, col1).
Under some circumstances, this version would probably have the best performance -- for instance, if all the someIds were in the result set. On the other hand, if very few are, this might have poor performance.
I'm stealing the "basic query" from Gordons answer:
select t1.someID
from table1 t1
join table2 t2 on t1.col1 = t2.col1
group by t1.someID
This query fits the requirements of indexed views. You can index this query. Running it will result in a simple clustered index scan which is as cheap as it gets.

Performance of join vs pre-select on MsSQL

I can do the same query in two ways as following, will #1 be more efficient as we don't have join?
1
select table1.* from table1
inner join table2 on table1.key = table2.key
where table2.id = 1
2
select * from table1
where key = (select key from table2 where id=1)
These are doing two different things. The second will return an error if more than one row is returned by the subquery.
In practice, my guess is that you have an index on table2(id) or table2(id, key), and that id is unique in table2. In that case, both should be doing index lookups and the performance should be very comparable.
And, the general answer to performance question is: try them on your servers with your data. That is really the only way to know if the performance difference makes a difference in your environment.
I executed these two statements after running set statistics io on (on SQL Server 2008 R2 Enterprise - which supposedly has the best optimization compared to Standard).
select top 5 * from x2 inner join ##prices on
x1.LIST_PRICE = ##prices.i1
and
select top 5 * from x2 where LIST_PRICE in (select i1 from ##prices)
and the statistics matched exactly. I have always preferred the first type of join but the second allows me to select just that part and see what rows are being returned.
I was taught that joins vs subqueries are mostly equivalent when it comes to performance. I would also look at the resulting query plans to see if one is better then the other. The query plans matched exactly.
MS SQL Server is smart enough to understand that it is the same action in such a simple query.
However if you have more than 1 record in subquery then you'll probably use IN. In is slow operation and it will never work faster than JOIN. It can be the same but never faster.
The best option for your case is to use EXISTS. It will be always faster or the same as JOIN or IN operation. Example:
select * from table1 t1
where EXISTS (select * from table2 t2 where id=1 AND t1.key = t2.key)

SQL: how do I speed up this query

Here is the situation. I have one table that contains records based on records in many different tables (t1 below). t2 is one of the tables that has information pulled from it into t1.
t1
table_oid --which table id is a FK to
id --fk to other table
store_num --field
t2
t2_id
Here is what I need to find: I need the largest t2_id where the store_num is not null in the corresponding record of t1. Here is the query I wrote:
select max(id) from t1
join t2 on t2.t2_id = t1.id
where store_num is not null
and table_oid = 1234;
However, this takes fairly long. I think this should be a fast query. all _ids have indexes for them. (t1.id/t1.table_oid, t2.t2_id). The vast majority of entries in t1 have a store_num.
Mentally, I would get the t2_ids in desc order, than one by one, try them against t1 until I found the first one that had a store_num.
select t2_id from t2 order by t2_id desc;
has an explain cost of 25612
select t1.* from t1 where table_oid = 1234
and id in (select max(t2_id) from t2);
has an explain cost of 8.
So why wouldn't the above query be a cost of at most 25612*8 = 204896? When I explain it, it comes back as more than 3 times that.
Really, my question is how do I re-write that query to run faster.
NOTE: I am using Oracle.
EDIT:
t2 has 11,895,731 rows
t1 has 473,235,192 rows
EDIT 2:
As I've tried different things, the part of the query that is taking the longest is the full scan on t1 looking for the store_num. Is there a way to keep this from doing a full scan, since I only need the biggest entry?
You say:
all _ids have indexes for them
But your query is:
...
where store_num is not null
and table_oid = 1234;
All of your _id indexes are useless for this query unless store_num and table_oid are also indexed, and are the first columns in said index.
So of course it has to do a full scan; it can give you back max(id) instantly without any filter conditions, but as soon as you put in the filter, it can't use the id index anymore because it doesn't know which part of the index matches those store_num is not null entries - not without a scan.
To speed the query up, you need to create an index on (store_num, table_oid, id). Standard disclaimers about creating indexes for a single ad-hoc query apply; having too many indexes will hurt insert/update performance.
It really doesn't matter how you "rewrite" your query - this isn't like application code, the optimizer is going to rearrange all of the pieces of your query anyway. Unless you have sufficiently-selective indexes on your seek columns or the entire query is completely covered by a single index, it's going to be slow.
Not sure if these apply to Oracle. Do you have an index on the fk id column for the join. Also if you can avoid the 'NOT IN' is't a non-sargable type in SQL which slows down a query.
another option that might be slower is doing an outer join then checking for null on that column. (not sure if that only applies to sql also)
select max(id) from t1
left outer join t2 on t2.t2_id = t1.id
where t1... IS NULL
and table_oid = 1234;
The best way I can think of to have this run fast is to:
Create an index on (TABLE_OID, ID DESC, COVERED_ENTITY_ID) in that order. Why?
table_oid -- this is your primary access condition
id -- so you don't have to access a data block to read it,
-- and you get higher ID values first
covered_entity_id -- you're filtering the data based on this, null vs not null
That should prevent the need to access the 473m rows in T1 at all.
Ensure that there's an index on T2_ID.
If all that's in place, a query like:
select max(id)
from t1
inner join t2
on t2.t2_id = t1.id
where covered_entity_id is not null
and table_oid = 1234;
should be (the optimizer is a finicky beast) able to do a semi-join driven by a fast full scans against the index on T1, never scanning the data blocks. Also consider writing it manaully as:
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and exists (select null
from t2
where t1.id = t2.t2_id);
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and id in (select t2_id from t2);
As the optimizer may write those plans slightly differently.
In the following I assume covered_entity_id is the same as store_num - it would really make things easier for us if you were consistent in your naming.
The vast majority of entries in t1
have a store_num.
Given that this is the case, the following clause shouldn't have any impact on the performance of your query ...
where covered_entity_id is not null
However, you go on to say
the part of the query that is taking
the longest is the full scan on t1
looking for the store_num
This suggests the query is looking for covered_entity_id is not null first rather than the presumably far more selective table_oid = 1234. The solution might be as simple as re-writing the query like this ...
where table_oid = 1234
and covered_entity_id is not null;
... although I suspect not. You could try hinting to get the query to use the index on table_oid.
The other thing is, how fresh are the statistics? When the optimizer chooses a radically bad execution plan it is often because the stats are out of date.
Incidentally, why are you joining to T2 at all? Your requirements could be met by selecting max(id) from T1 (unless you don't have a foreign key enforcing T1.ID references T2.T2_ID, and hence need to be sure).
edit
To check your statistics run this query:
select table_name
, num_rows
, last_analyzed
from user_tables
where table_name in ('T1', 'T2')
/
If the results show num_rows is widely divergent from the values you gave in your first edit then you should re-gather statistics. If last_anlayzed is something like the day you went live then you definitely should re-gather. You may want to export your statistics first; refreshing the statistics can affect the execution plans (that is the object of the exercise) usually for good but sometimes things can get worse. Find out more.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.