Teradata optimizer wrongly estimates row number then accessing to table through view with union - sql

Let's say I have three tables: t1(it has about 1 billion rows, fact table) and t2(empty table, 0 rows). and t0 (dimension table), all of them have properly collected statistics. In addition there is view v0:
REPLACE VIEW v0
AS SELECT * from t1
union
SELECT * from t2;
Let's look to these three queries:
1) Select * from t1 inner t0 join on t1.id = t0.id; -- Optimizer correctly estimates 1 bln rows
2) Select * from t2 inner t0 join on t1.id = t0.id; -- Optimizer correctly estimates 0 row
3) Select * from v0 inner t0 join on v0.id = t0.id; -- Optimizer locks t1 and t2 for read, the correctly estimated, that it will get 1 bln rows from t1, but for no clear reasons estimated same number 1 bln from table t2.
What is going on here? Is is it the bug or a feature?
PS. Original query, that pretty big to show here, didn't finished in 35 minutes. After leaving just t1 - successfully finished in 15 minutes.
TD Release: 15.10.03.07
TD Version: 15.10.03.09

It's not the same number for the 2nd Select, it's the overall number of rows in spool after the 2nd Select, which is 1 billion plus 0.
And you query was running slowly because you used a UNION which defaults to DISTINCT, running this on a billion rows is really expensive.
Better switch to UNION ALL instead.

Related

Is there a better way to prioritize a sub query instead of using TOP?

Currently we are using SQL Server 2019 and from time to time we tend to use TOP (max INT) to prioritize the execution of a sub-query.
The main reason to do this is to make the starting result set as small as possible and thus avoid excessive reads when joining with other tables.
most common scenario in which it helps:
t1: is the main table we are querying has about 200k rows
t2,3: just some other tables with max 5k rows
pres: is a view with basically all the fields we use for presentation of e.g. product with about 30 JOINS and also containing table t1 + LanguageID
SELECT t1.Id, "+30 Fields from tables t1,t2,t3, pres"
FROM t1
INNER JOIN pres ON pres.LanguageId=1 AND t1.Id=pres.Id
INNER JOIN t2 ON t1.vtype=t2.Id
LEFT JOIN t3 ON t1.color=t3.Id
WHERE 1=1
AND t1.f1=0
AND t1.f2<>76
AND t1.f3=2
we only expect about 300 rows, but it takes about 12 seconds to run
SELECT t.Id, "10 Fields from tables t1,t2,t3 + 20 fields from pres"
FROM (
SELECT TOP 9223372036854775807 t1.Id, "about 10 fields from table t1,t2,t3"
FROM t1
INNER JOIN t2 ON t1.vtype=t2.Id
LEFT JOIN t3 ON t1.color=t3.Id
WHERE 1=1
AND t1.f1=0
AND t1.f2<>76
AND t1.f3=2
) t
INNER JOIN pres ON pres.LanguageId=1 AND t.Id=pres.Id
we only expect about 300 rows, but it takes about 2 seconds to run

Outer join and using Oracle indexes

SELECT *
FROM t1, t2 , t3
WHERE t1.row_id = t2.invoice_id(+)
and t2.voi_id = t3.row_id(+)
and type = 'Dec'
order by 1
I have 3 indexes, one for each column in the join, but it seems that the explain plan uses a full table scan on the tables without using the indexes:
Plan
1 Every row in the table t1 is read.
2 The rows were sorted to support the join at step 5.
3 Every row in the table t2 is read.
4 The rows were sorted to support the join at step 5.
5 Join the sorted results sets provided from steps 2, 4.
6 Rows were returned by the SELECT statement.
It is depend on rowcounts and size of tables. In your query all rows t1 will be fetched (because used left join and all rows from t1 with type='Dec' will be shown). That's why TABLE ACCESS FULL to table t1 is normal.
If rowcount in t1 is more than 20-30% rowcount in t2 (% depend on t2 size) also TABLE ACCESS FULL to t2 and their hash join is normal scenario.

Oracle 11g Performance of Join

I was doing some random testing on oracle 11g, and realized a strange performance difference between different JOIN using SQL developer.
I inserted 200,000 records into RANDOM_TABLE, and about 300 in unrelated_table, then run the two query for 10 times, the time stated is an average.
The two tables are totally unrelated, so the first one should give the same result as the second one, and indeed the row counts are the same.
1. SELECT * FROM some_random_table t1 LEFT JOIN unrelated_table t2 ON 1=1;
~0.005 seconds to fetch the first 50 row.
2. SELECT * FROM some_random_table t1 RIGHT JOIN unrelated_table t2 ON 1=1;
>0.05 seconds to fetch the first 50 row.
1. SELECT * FROM some_random_table t1 FULL JOIN unrelated_table t2 ON 1=1;
~0.005 seconds to fetch the first 50 row.
2. SELECT * FROM some_random_table t1 CROSS JOIN unrelated_table t2;
>0.05 seconds to fetch the first 50 row.
Can anyone explain the difference between these queries? Why are some faster and some slower by an order of magnitude?

Do volatile tables truncate results by default?

I have two queries that supposed to bring equivalent results. However the second query gives only partial results (less than 10 % of the total).
First query gives more than 4 million rows
SELECT id, amount
FROM table1 t1 LEFT OUTER JOIN table2 t2 ON t1.id = t2.id;
Second give only 18 thousand records
CREATE VOLATILE TABLE vt AS
(
SELECT id, amount
FROM table1 t1 LEFT OUTER JOIN table2 t2 ON t1.id = t2.id;
)
WITH DATA
NO PRIMARY INDEX
ON COMMIT PRESERVE ROWS;
SELECT *
FROM vt ;
Why does the second query give less records ???
When you do a SHOW TABLE vt; you'll notice that it's created as a SET table, which doesn't store duplicate rows. There are only 18 thousand distinct (id,amount) combinations.
Either add DISTINCT to your first Select or use CREATE MULTISET VOLATILE TABLE.

Should I avoid IN() because slower than EXISTS() [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server IN vs. EXISTS Performance
Should I avoid IN() because slower than EXISTS()?
SELECT * FROM TABLE1 t1 WHERE EXISTS (SELECT 1 FROM TABLE2 t2 WHERE t1.ID = t2.ID)
VS
SELECT * FROM TABLE1 t1 WHERE t1.ID IN(SELECT t2.ID FROM TABLE2 t2)
From my investigation, I set SHOWPLAN_ALL. I get the same execution plan and estimation cost. The index(pk) is used, seek on both query. No difference.
What are other scenarios or other cases to make big difference result from both query? Is optimizer so optimization for me to get same execution plan?
Do neither. Do this:
SELECT DISTINCT T1.*
FROM TABLE1 t1
JOIN TABLE2 t2 ON t1.ID = t2.ID;
This will out perform anything else by orders of magnitude.
Both queries will produce the same execution plan (assuming no indexes were created): two table scans and one nested loop (join).
The join, suggested by Bohemian, will do a Hash Match instead of the loop, which I've always heard (and here is a proof: Link) is the worst kind of join.
Among IN and EXIST (your actuall question), EXISTS returs better performance (take a lok at: Link)
If your table T2 has a lot of records, EXISTS is the better approach hands down, because when your database find a record that match your requirement, the condition will be evaluated to true and it stopped the scan from T2. However, in the IN clause, you're scanning your Table2 for every row in table1.
IN is better than Exists when you have a bunch of values, or few values in the subquery.
Expandad a little my answer, based on Ask Tom answer:
In a Select with in, for example:
Select * from T1 where x in ( select y from T2 )
is usually processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table (typically).
In an exist query like:
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
When is where exists appropriate and in appropriate?
Use EXISTS when... Subquery T2 is huge and takes a long time and T1 is relatively small and executing (select null from t2 where y = x.x ) is very very fast
Use IN when... The result of the subquery is small -- then IN is typicaly more appropriate.
If both the subquery and the outer table are huge -- either might work as well as the other -- depends on the indexes and other factors.