Tuning/Rewriting sql query with many left outer joins and heavy tables

Tuning/Rewriting sql query with many left outer joins and heavy tables - sql

I have four - five tables which are really big in size and they are left outer joined using the below query. Is there any way that it can be rewritten so that the performance could be improved?
SELECT t1.id,
MIN(t5.date) AS first_pri_date,
MIN(t3.date) AS first_pub_date,
MAX(t3.date) AS last_publ_date,
MIN(t2.date) AS first_exp_date
FROM t1
LEFT JOIN t2 ON (t1.id = t2.id)
LEFT JOIN t3 ON (t3.id = t1.id)
LEFT JOIN t4 ON (t1.id = t4.id)
LEFT JOIN t5 ON (t5.p_id =t4.p_id)
GROUP BY t1.id
ORDER BY t1.id;
Record counts are:
t1: 6434323
t2: 6934562
t3: 9141420
t4: 11515192
t5: 3797768
There are indexes on most of the columns used for join. The most consuming part in the explain plan is the outer join with t4 which is happening in the end. I just wanted to know if there is any way to rewrite this to improve the performance.

Assuming that id is primary key in t1, your query might (or might not, depends on the setup of your Oracle's PGA) run better when written follows:
SELECT --+ leading(t1) use_hash(t2x,t3x,t45x) full(t1) no_push_pred(t2x) no_push_pred(t3x) no_push_pred(t45x) all_rows
t1.id,
t45x.first_pri_date,
t3.first_pub_date,
t3.last_publ_date,
t2.first_exp_date
FROM t1
LEFT JOIN (
SELECT t2.id,
MIN(t2.date) AS first_exp_date
FROM t2
GROUP BY t2.id
) t2x
ON t2x.id = t1.id
LEFT JOIN (
SELECT t3.id,
MIN(t3.date) AS first_pub_date,
MAX(t3.date) AS last_publ_date
FROM t3
GROUP BY t3.id
) t3x
ON t3x.id = t1.id
LEFT JOIN (
SELECT --+ leading(t5) use_hash(t4)
t4.id,
MIN(t5.date) AS first_pri_date
FROM t4
JOIN t5 ON t5.p_id = t4.p_id
GROUP BY t4.id
) t45x
ON t45x.id = t1.id
ORDER BY t1.id;
This rewrite does not impose any need for creating additional, yet otherwise useless indexes.

I would say that your problem is that you are doing many LEFT JOINs and the final resultset gets too big after applying all those JOINs. Also indexes cannot be used this way to calculate MIN or MAX in the fastest possible way. With good use of indexes you should be able to calculate MIN or MAX very quickly.
I would write the query rather like this:
SELECT t1.id,
(SELECT MIN(t5.date) FROM t5 JOIN t4 ON t5.p_id = t4.p_id WHERE t4.id = t1.id) AS first_pri_date,
(SELECT MIN(date) FROM t3 WHERE t3.id = t1.id) AS first_pub_date,
(SELECT MAX(date) FROM t3 WHERE t3.id = t1.id) AS last_publ_date,
(SELECT MIN(date) FROM t2 WHERE t2.id = t1.id) AS first_exp_date
FROM t1
ORDER BY t1.id;
For better performace create indexes on (id, date) or (p_id, date).
So your indexes would be like this:
CREATE INDEX ix2 ON T2 (id,date);
CREATE INDEX ix3 ON T3 (id,date);
CREATE INDEX ix5 ON T5 (p_id,date);
CREATE INDEX ix4 ON T4 (id);
But there still remains a problem with the join between t4 and t5.
In case there is 1:1 relation between t1 and t4, it could be even better to write something like this on the second line:
(SELECT MIN(t5.date) FROM t5 WHERE t5.p_id = (SELECT p_id FROM t4 WHERE t4.id=t1.id)) AS first_pri_date,
If it is 1:N and also if CROSS APPLY and OUTER APPLY work on your Oracle version, you can rewrite the second line like this:
(SELECT MIN(t5min.PartialMinimum)
FROM t4
CROSS APPLY
(
SELECT PartialMinimum = MIN(t5.date)
FROM t5
WHERE t5.p_id = t4.p_id
) AS t5min
WHERE t4.id = t1.id)
AS first_pri_date
All this is aimed at the best possible use of indexes during calculation of MIN or MAX.
So the whole SELECT could be rewritten like this:
SELECT t1.id,
(SELECT MIN(t5min.PartialMinimum)
FROM t4
CROSS APPLY
(
SELECT TOP 1 PartialMinimum = date
FROM t5
WHERE t5.p_id = t4.p_id
ORDER BY 1 ASC
) AS t5min
WHERE t4.id = t1.id) AS first_pri_date,
(SELECT TOP 1 date FROM t2 WHERE t2.id = t1.id ORDER BY 1 ASC) AS first_exp_date,
(SELECT TOP 1 date FROM t3 WHERE t3.id = t1.id ORDER BY 1 ASC) AS first_pub_date,
(SELECT TOP 1 date FROM t3 WHERE t3.id = t1.id ORDER BY 1 DESC) AS last_publ_date
FROM t1
ORDER BY 1;
This is as I believe most optimal way how to get MIN or MAX from historical data table.
The point is, that using MIN with a lot of non indexed values makes server load all the data into the memory and then calculate MIN or MAX from the non-indexed data, which takes long time because it has high demands on I/O operations. Bad usage of indexes when using MIN or MAX can lead to the situation, where you have all your historical table data cached in memory without needing it for anything else except MIN or MAX calculation.
Without the CROSS APPLY part of the query the server would need to load to memory all individual dates from t5 and calculate MAX from the whole loaded resultset.
Mark that MIN function on properly indexed table behaves like TOP 1 ORDER BY, which is very fast. In this way you can get your results instantly.
CROSS APPLY is available in Oracle 12C, otherwise you can use pipelined functions.
Check this SQL Fiddle, especially the differences in execution plans.

Related

Sql CE Calculate rows from same column when rows are filtered

I get wrong data because id's are not filtering. Is it possible to get t1.id < t2.id for only one vehicle (regoznaka=OS 428-EF)?
SELECT
t2.id,
t2.Regoznaka,
t2.tocenolit,
max(t2.stanjekm) as km2,
max(t1.stanjekm) as km1,
max(t2.stanjekm) - max(t1.stanjekm) as Kilometara,
max(t1.id) as id1
FROM Gorivo AS t1
Right JOIN
Gorivo AS t2 ON t1.id < t2.id
Where t2.regoznaka='OS 428-EF'
Group by t2.id, t2.Regoznaka, t2.tocenolit, t2.stanjekm

Your problem is that you are not limiting the rows from the left table.
You need to apply the condition to the left table as well:
and t1.regoznaka='OS 428-EF'
OR:
and t1.regoznaka = t2.regoznaka

Performance of two left joins versus union

I have searched but have not found a definitive answer. Which of these is better for performance in SQL Server:
SELECT T.*
FROM dbo.Table1 T
LEFT JOIN Table2 T2 ON T.ID = T2.Table1ID
LEFT JOIN Table3 T3 ON T.ID = T3.Table1ID
WHERE T2.Table1ID IS NOT NULL
OR T3.Table1ID IS NOT NULL
or...
SELECT T.*
FROM dbo.Table1 T
JOIN Table2 T2 ON T.ID = T2.Table1ID
UNION
SELECT T.*
FROM dbo.Table1 T
JOIN Table3 T3 ON T.ID = T3.Table1ID
I have tried running both but it's hard to tell for sure. I'd appreciate an explanation of why one is faster than the other, or if it depends on the situation.

Your two queries do not do the same things. In particular, the first will return duplicate rows if values are duplicated in either table.
If you are looking for rows in Table1 that are in either of the other two tables, I would suggest using exists:
select t1.*
from Table1 t1
where exists (select 1 from Table2 t2 where t2.Table1Id = t1.id) or
exists (select 1 from Table3 t3 where t3.Table1Id = t1.id);
And, create indexes on Table1Id in both Table2 and Table3.
Which of your original queries is faster depends a lot on the data. The second has an extra step to remove duplicates (union verses union all). On the other hand, the first might end up creating many duplicate rows.

Joining selected column to a table

I am try running this query and it takes long time because of the join i am using
SELECT T1.Id,T2.T2Id,T2.Col2
FROM Table1 T1
LEFT OUTER JOIN (SELECT TOP 1 Id, TT.T2Id,TT.Col2
FROM Table2 TT
WHERE TT.TypeId=3
ORDER BY TT.OrderId
)AS T2 ON T2 .Id=T1.Id
Thing is it doesn't let me do something like TT.Id=T1.Id with in the join query.
Is there any other way I can do this?

Try it with outer apply:
SELECT T1.Id, T2.T2Id, T2.Col2
FROM Table1 T1
OUTER APPLY (SELECT TOP 1 T2Id, T2.Col2
FROM Table2 TT
WHERE TT.TypeId = 3 AND TT.Id = T1.Id) T2

SELECT T1.Id, T2.T2Id, T2.Col2
FROM Table1 T1
OUTER APPLY (SELECT TOP 1 T2Id, T2.Col2
FROM Table2 TT
WHERE TT.TypeId = 3 AND T1.Id = TT.Id
Order by T2id desc) T2
I would use Outer Apply and T1.Id = TT.Id in the where condition since T1 is the parent table plus adding on order by - if needed for ordered result set

Well first of all your derived table will produce non deterministic results, as the top 1 row you return may differ each time you run it, even if the data in the table remains the same. You could put an order by clause in the the derived table to prevent that.
Is there an index on Table1.id? What exactly are you trying to achieve though, is it to return all rows from Table1, with just one row of many from Table2 that has the same ID?
If so I would look into using Cross Apply instead. Or maybe in this case Outer Apply. If I get a chance later I'll write up an example if needed, but in the mean time just Google Outer Apply for SQL Server.
Dan

Join query where table references itself

I'm using Oracle 10, but the best way to ask this question is with an example.
select *
from t1, t2
where t1.id = t2.id
and t1.otherID = (select max(otherID)
from t2
where id = THE ID FROM THE OUTER QUERY T1
)
I think you see where I'm trying to go with this. I need to reference t1 in the subquery to join it to the max of t2.
I need to know how to create a query like this.
"THE ID FROM THE OUTER QUERY T1" is where my confusion is.
I tried using t1.id, but did not get results.

Try the following
select t1.*, t2.*
from t1
join t2 on t1.id = t2.id
join (select id, max(otherID) as max_otherID
from t2
group by id
) a ON a.id = t1.id and a.max_otherID = t1.otherID
Using a sub-query on the join often gives better performance than using it in the where clause.

Rewrite SQL code SELECT block to simplify logic

I am trying to rewrite this block with simpler logic if this can be done. I am using it within a larger SELECT statement and I think IF I can simplify this block, I might be able to improve performance of my query.
proj_catg_type_id, proj_catg_id and proj_id are all PKs in their tables.
select t1.proj_catg_name
from table1 t1, table2 t2, table3 t3
where t2.proj_catg_type_id = t1.proj_catg_type_id
and t2.proj_catg_type_id = 213
and t3.proj_id = t2.proj_id

Without knowing the referential integrety rules and the logic behind the tables it is difficult to give a 100% correct answer. But just by looking to this statement the most simplified logic would be
select t1.proj_catg_name
from table1 t1
where t1.proj_catg_type_id = 213;

select t1.proj_catg_name
from table1 t1 inner join table2 t2
on t2.proj_catg_type_id=t1.proj_catg_type_id
where t2.proj_catg_type_id=213
and t3.proj_id=t2.proj_i
maybe? is t3 used outside this subselect?

If t3 is a table outside the selct you showed, then this is a correlated subquery which you should not be using at all, ever! That turns your query into a row-by agonizing row cursor.
Use derived tables or joins to get the results.
You don't give me enough code to write a specific solution for your problem, but let me give you an example:
SELECT
field1
, field2
, (SELECT t3.field3
FROM table2 t2
JOIN table3 t3 ON t2.id = t3.id
WHERE t4.somefield = t2.somefield)
FROM table1 t1
JOIn table4 t4 ON t1.id = t4.id
SELECT
field1
, field2
, t3.field3
FROM table1 t1
JOIn table4 t4
ON t1.id = t4.id
join (SELECT field3
FROM table2 t2
JOIN table3 t3 ON t2.id = t3.id) a
ON t4.somefield = t2.somefield
The first query runs one row at a time which is extremely slow. The second should give the same results but runs in a set-based fashion which is much faster. It is important to make sure the derived table has an a alias. You could also use a CTE.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tuning/Rewriting sql query with many left outer joins and heavy tables - sql

Related

Sql CE Calculate rows from same column when rows are filtered

Performance of two left joins versus union

Joining selected column to a table

Join query where table references itself

Rewrite SQL code SELECT block to simplify logic

Categories

Resources