Teradata SQL tuning with Sum and other aggregate functions - sum

I have a query like
sel
tb1.col1,
tb4.col2,
(case WHEN t4.col4 in (<IN LIST with more than 1000 values!>) then T4.Col7
Else "Flag" ) as "Dcol1",
Sum ( tb3.col1),
sum (tb3.col2 ),
sum (tb2.col4)
etc
from
tb1 left outer join tb2 <condition> LOJ tb3 <conditions>
where tb1 condition and tb2 condition and tb3 condition
group by ( case <condition> , colx.tb2,coly.tb1
Problem is TB3 and TB4 are HUGE fact table. The PI of the fact table is NOT included in the joins or Queries here.
What I have done so far
is create a volatile table ( same pi for IN LIST ) and tried to materialized. VT1 has SAME PI as TB4 and will include the IN LIST in the where clause. I LOJ this using the approach
select
....
CASE WHEN Dtb1.c1 IS NOT NULL
THEN ft."CustomColumName"
ELSE 'ALL OTHER'
end as "CustomColumName"
from "Db"."FACTTablew5MillionRows" as ft
left join VolatileTable Dtb1
on Dtb1.c1=ft.C1
HOW can I optimize these kinds of queries
Assume tb3 and tb2 are huge fact tables.
PI TB3 is C4, c6 and PI of tb2 is C6, C7
TB3 has a partitioning column Cp but is NOT used for any kind of where clause. It gets used in one of the joins
they do NOT have the same PI but MAY have a column in common within their PI's
Row counts are some 80 Millions rows for TB3 , 60 million .
The original Query simply would not run without spooling out. Luckily at night it could with 80K Impact C. I COULD get the query to run in < 200 Impact ONLY after creating a VT2 with same PI as tb2 and then using it to join.
I DO NOT want to create a bunch of VT to used by BO users who know squat about TD. WHAT can I do to make this Q better

Related

What's the purpose of a JOIN where no column from 2nd table is being used?

I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query
SELECT
c_id, v_id, COUNT(DISTINCT(m_id)) AS participants,
cast(date_sub(current_date, ${window}) as string) as event_date
from (
select
a.c_id, a.v_id, a.user_id,
case
when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id
end as m_id
from (
select * from first
where event_date <= cast(date_sub(current_date, ${window}) as string)
) a
join (
select * from second
) b on a.c_id = b.c_id
left join third c
on a.user_id = c.id1
) dx
group by c_id, v_id;
I have changed the names but otherwise this is the select statement being used to insert overwrite to another table.
Regarding the join
join (
select * from second
) b on a.c_id = b.c_id
b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing.
Or I can just remove this join and it won't make any difference at all.
Thanks.
Join (any inner, left or right) can duplicate rows if join key in joined dataset is not unique. For example if a contains single row with c_id=1 and b contains two rows with c_id=1, the result will be two rows with a.c_id=1.
Join (inner) can filter rows if join key is absent in joined dataset. I believe this is what it meant to do.
If the goal is to get only rows with keys present in both datasets(filter) and you do not want duplication, and you do not use columns from joined dataset, then better use LEFT SEMI JOIN instead of JOIN, it will work as filter only even if there are duplicated keys in joined dataset:
left semi join (
select c_id from second
) b on a.c_id = b.c_id
This is much safer way to filter rows only which exist in both a and b and avoid unintended duplication.
You can replace join with WHERE IN/EXISTS, but it makes no difference, it is implemented as the same JOIN, check the EXPLAIN output and you will see the same query plan. Better use LEFT SEMI JOIN, it implements uncorrelated IN/EXISTS in efficient way.
If you prefer to move it to the WHERE:
WHERE a.c_id IN (select c_id from second)
or correlated EXISTS:
WHERE EXISTS (select 1 from second b where a.c_id=b.c_id)
But as I said, all of them are implemented internally using JOIN operator.

Is there a better way to prioritize a sub query instead of using TOP?

Currently we are using SQL Server 2019 and from time to time we tend to use TOP (max INT) to prioritize the execution of a sub-query.
The main reason to do this is to make the starting result set as small as possible and thus avoid excessive reads when joining with other tables.
most common scenario in which it helps:
t1: is the main table we are querying has about 200k rows
t2,3: just some other tables with max 5k rows
pres: is a view with basically all the fields we use for presentation of e.g. product with about 30 JOINS and also containing table t1 + LanguageID
SELECT t1.Id, "+30 Fields from tables t1,t2,t3, pres"
FROM t1
INNER JOIN pres ON pres.LanguageId=1 AND t1.Id=pres.Id
INNER JOIN t2 ON t1.vtype=t2.Id
LEFT JOIN t3 ON t1.color=t3.Id
WHERE 1=1
AND t1.f1=0
AND t1.f2<>76
AND t1.f3=2
we only expect about 300 rows, but it takes about 12 seconds to run
SELECT t.Id, "10 Fields from tables t1,t2,t3 + 20 fields from pres"
FROM (
SELECT TOP 9223372036854775807 t1.Id, "about 10 fields from table t1,t2,t3"
FROM t1
INNER JOIN t2 ON t1.vtype=t2.Id
LEFT JOIN t3 ON t1.color=t3.Id
WHERE 1=1
AND t1.f1=0
AND t1.f2<>76
AND t1.f3=2
) t
INNER JOIN pres ON pres.LanguageId=1 AND t.Id=pres.Id
we only expect about 300 rows, but it takes about 2 seconds to run

(Oracle)Does adding filter on master table improve the performance of left join condition between master-detail?

I would like to know if adding a condition in the left join clause that filters the records on a master table, improves the performance of the left join between Master-Detail tables.
E.g.
I have a Master table MT(ID, TYPE) and a Detail table DT(ID, FK, NAME), The left join would be written like:
select MT.ID, DT.NAME
from MT left join
DT
on MT.ID = DT.FK
If between the results of the left join, I only need the information regarding records of a certain type, lets say MT.TYPE='01', Does adding this condition in the left join clause improve the performance of the query?
select MT.ID, DT.NAME
from MT left join
DT
on MT.TYPE = '01' and MT.ID = DT.FK
If you have no indices setup on the MT and DT tables, then in general both queries would be executed using full table scans and both would have similar performance. The situation where the second query might evaluate faster than the first is where you had proper indices setup, e.g.
(TYPE, ID) on the MT table
(FK, NAME) on the DT table
In this case, if MT.TYPE = '01' were very restrictive, it could greatly reduce the amount of work the database would have to do. Also, this set of indices would speed up the join operation.

joining larger volumes of table in oracle

I want to join three larger tables in oracle. TableA has 370 million rows, TableB has 370 million rows and the master table TableM has 600 000 rows. TableM is the master table of the other two tables TableA and TableB.
My query was like
Select A.MasterId, B.Date1
FROM TableA A
INNER JOIN TableB B on B.MasterId= A.MasterId
INNER JOIN TableM M ON M.MasterId= A.MasterId
When I execute the above query, its taking a long time. I wanted to split the query execution with WHERE clause by taking the values of five years data. We have total of 25 years of data, so five times I can execute the below query and insert the values to Temp table.
My approaches are.
Approach 1:
Using UNION operator, I can combine the result set and insert the values to Temp table. It took too long.
Select A.MasterId, B.Date1
FROM TableA A
INNER JOIN TableB B on B.MasterId= A.MasterId
INNER JOIN TableM M ON M.MasterId= A.MasterId
WHERE M.Date > '01-JAN-1985' and M.Date <'01-JAN-1990'
UNION ALL
Select A.MasterId, B.Date1
FROM TableA A
INNER JOIN TableB B on B.MasterId= A.MasterId
INNER JOIN TableM M ON M.MasterId= A.MasterId
WHERE M.Date > '01-JAN-1990' and M.Date <'01-JAN-1995'
.....
Approach 2:
Tried to insert the 5 years data to temp table by using bulk collect but it failed.
Is there any other way to handle this problem?
A full join over these 3 tables would result in 8.2140E+22 records, which seems like an unwieldy large dataset and that is also why it takes a loooooong time.
What would be the use of such a select?
For insert, use a simple INSERT INTO ... SELECT ... FROM ...
Performance should be much better than using pl/sql with bulk collect.

Optimization of DB2 query which uses joins and takes 1.5 hours to execute

when i run SELECT stataement on my view it takes around 1.5 hours to run, what can i do to optimize it.
Below is the sample structure of how my view looks like
CREATE VIEW SCHEMANAME.VIEWNAME
{
COL, COL1, COL2, COL3 }
AS SELECT
COST.ETA,
CASE
WHEN VOL.CURR IS NOT NULL
THEN COALESCE {VOL.COMM,0}
END CASE,
CASE
WHEN...
END CASE
FROM TABLE1 t1 inner join TABLE2 t2 ON t1.ETA=t2.ETA
INNER JOIN TABLE3 t3 on t2.ETA=t3.ETA
LEFT OUTER JOIN TABLE4 t4 on t2.ETA=t4.ETA
This is your query:
SELECT COST.ETA,
(CASE WHEN VOL.CURR IS NOT NULL THEN COALESCE {VOL.COMM,0}
END) as ??,
. . .
FROM TABLE1 t1 inner join
TABLE2 t2
ON t1.ETA = t2.ETA INNER JOIN
TABLE3 t3
on t2.ETA = t3.ETA LEFT OUTER JOIN
TABLE4 t4
on t2.ETA = t4.ETA;
First, I will the fact that the select clause references tables that are not in the from clause. I assume this is a typo.
Second, you should be able to use indexes to improve this query: table1(eta), table2(eta),table3(eta), andtable4(eta).
Third, I am highly suspicious on seeing the same column used for joining so many tables. I suspect that you might have cartesian products occurring, because there are multiple values of any given eta in several tables. If that is the case, you need to fix the query to better reflect what you really need. If so, ask another question with sample data and desired results, because your query is probably not correct.