Improving performance of a full outer join in Redshift - sql

I need to complete a full outer join using two columns: date_local and hour_local but am concerned about query performance when using an outer join as opposed to another type of join.
Below is the query using outer join:
SELECT *
FROM TABLE_A
FULL OUTER JOIN TABLE_B
USING (DATE_LOCAL, HOUR_LOCAL)
Would the following query perform better than the query above:
WITH JOIN_VALS AS
(SELECT DATE_LOCAL
, HOUR_LOCAL
FRΩM TABLE_A
UNION
SELECT
DATE_LOCAL
, HOUR_LOCAL
FROM TABLE_B
)
SELECT
JV.DATE
, JV.HOUR_LOCAL
, TA.PLANNED
, TB.ACTUAL
FROM JOIN_VALS JV
LEFT JOIN TABLE_A TA
ON JV.DATE = TA.DATE
AND JV.HOUR_LOCAL = TA.HOUR_LOCAL
LEFT JOIN TABLE_B TB
ON JV.DATE = TB.DATE
AND JV.HOUR_LOCAL = TB.HOUR_LOCAL;
Wondering if I get any performance improvements but isolating the unique join values first, rather than finding them during the outer join.

UNION can be expensive and I don’t think you will seen any benefit from this construct in Redshift. Likely performance loss. Redshift is a columnar database and will see no benefit from peeling off these columns.
The big cost will be if the matches between the two tables on these two columns will be many-to-many. This would lead to additional row creation which could lead to slow performance.

Related

What is difference between where and join in Hive SQL when joining two tables?

For example,
-- use WHERE
SELECT
a.id
FROM
table_a as a,
table_b as b
WHERE
a.id = b.id;
-- use JOIN
SELECT
t1.id
FROM
(
SELECT
id
FROM
table_a as a
) t1
JOIN (
SELECT
id
FROM
table_b as b
) t2 ON t1.id = t2.id
What is difference between where and join in Hive SQL when joining two tables?
Join like this
FROM
table_a as a,
table_b as b
WHERE
a.id = b.id;
is a bad practice because in general, WHERE is being applied after join and transforming it to JOIN and pushing predicates is upon optimizer, to convert it to proper join and avoid CROSS join (join without ON condition).
Always use explicit JOIN with ON condition when possible, in such way the optimizer will know for sure it is a JOIN condition, also it is ANSI syntax and it is easier to understand.
For not-equi join conditions like a.date between b.start and b.end it is not possible to use in ON condition, in this case they can be moved to the WHERE. In such case if you do not have other conditions in ON condition, cross join will be used, and after that WHERE filter applied, such join can extremely duplicate data before WHERE filter and cause performance degradation. So, always use explicit ANSI JOIN with ON conditions when possible, always use all equality conditions in the ON and only non-equi conditions in the WHERE if not possible to use them in the ON. Keep join conditions in the ON and only filters in the WHERE. Optimizer will push filters to the JOIN or before join when possible but better do not rely on optimizer only, write good ANSI sql which is easy to understand and port to another database if needed.
The difference in plan you can check using EXPLAIN command.

Converting Nested Subqueries into Mini Queries

I have a lot of trouble reading nested subqueries - I personally prefer to write several mini queries and work from there. I understand that more advanced SQL users find it more efficient to write nested subqueries.
For instance, in the following query:
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join ( select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1) c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1
Problem: I am trying to turn this into several different queries:
#PART 1 :
create table_1 as select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
#PART 2:
create table_2 as select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1
#PART 3 (final result: final_table)
create final_table as select a.*, b.*
from table_1 a
inner join table_2 b
on a.id_1 = b.id_1 and a.var_1 = b.max_var_1
My Question: Can someone please tell me if this is correct? Is this how the above nested subquery can be converted into 3 mini queries?
Subqueries are only inserted into separate tables when you use them multiple times. And yet, if the result of the subquery returns many records, then it is not recommended to insert them separately into the table. Because when you are using only select DB will only read data from the disc, but when using insert command, DB will write to disc. Inserting many records may be long process than selecting.
P.S. Mostly used "create temporary table" when inserting subquery process.
Another good way is to use "CTE (Common Table Expression)". When using "CTE", the database stores the results of "SELECT" queries in RAM, executing the subquery only once. If then subqueries are then used multiple times, the database only uses the results from RAM (not executing).
For the performance of your query, you can use only #PART2, and others should not be used. They are unnecessary. But for better performance, I recommended you write your query without inserting, using CTE. For example:
with sub_query as (
select
id_1,
max(var_1) as max_var_1
from
table_a
group by id_1
)
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join sub_query c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1;

Are cartesian (cross) joins with where statement still slower than inner joins?

Compare these two queries:
select b.prod_name, b.prod_category, a.transaction_amt, a.transaction_dt
from transactions a, prod_xref b
where a.prod_id = b.id
VS.
select b.prod_name, b.prod_category, a.transaction_amt, a.transaction_dt
from transactions a
inner join b.prod_xref b on a.prod_id = b.id
Is the first query still slower than the second?
What are the benefits / disadvantages of using a cartesian join vs an explicit join statement?
Answering your question, "Cartesian" or "Cross" join are much slower than almost any joins.
The reason is that CROSS join multiply each row of t1 by each row of t2.
The example you provided is not a CROSS join;
It's the old syntax of inner joins, where 2 or more consecutive tables are given in FROM clause, comma separated.

(Oracle)Does adding filter on master table improve the performance of left join condition between master-detail?

I would like to know if adding a condition in the left join clause that filters the records on a master table, improves the performance of the left join between Master-Detail tables.
E.g.
I have a Master table MT(ID, TYPE) and a Detail table DT(ID, FK, NAME), The left join would be written like:
select MT.ID, DT.NAME
from MT left join
DT
on MT.ID = DT.FK
If between the results of the left join, I only need the information regarding records of a certain type, lets say MT.TYPE='01', Does adding this condition in the left join clause improve the performance of the query?
select MT.ID, DT.NAME
from MT left join
DT
on MT.TYPE = '01' and MT.ID = DT.FK
If you have no indices setup on the MT and DT tables, then in general both queries would be executed using full table scans and both would have similar performance. The situation where the second query might evaluate faster than the first is where you had proper indices setup, e.g.
(TYPE, ID) on the MT table
(FK, NAME) on the DT table
In this case, if MT.TYPE = '01' were very restrictive, it could greatly reduce the amount of work the database would have to do. Also, this set of indices would speed up the join operation.

Which is better for performance, selecting all the columns or select only the required columns while performng join?

I am been asked to do performance tuning of a SQL Server query which has so many joins in it.
For example
LEFT JOIN
vw_BILLABLE_CENSUS_R CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
There are almost 25 columns present in vw_Billing_Cenus_R but we wanted to use only 3 of them. So I wanted to know instead of selecting all the columns from the view or table, if I only select those columns which are required and then perform join like this
LEFT JOIN (SELECT [Column_1], [Column_2], [Column_3]
FROM vw_BILLABLE_CENSUS_R) CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
So Will this improve the performance or not?
The important part is the columns you are actually using on the outmost SELECT, not the ones to are selecting to join. The SQL Server engine is smart enough to realize that he does not need to retrieve all columns from the referenced table (or view) if he doesn't need them.
So the following 2 queries should yield the exact same query execution plan:
SELECT
A.SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
*
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
SELECT
A.SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.SomeColumn
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
The difference would be if you actually use the selected column (in a conditional where or actually retrieving the value), as in here:
SELECT
A.SomeColumn,
X.* -- * has all X columns
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.*
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
SELECT
A.SomeColumn,
X.* -- * has only X's SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.SomeColumn
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
I would rather use this approach:
LEFT JOIN
vw_BILLABLE_CENSUS_R CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
than this
LEFT JOIN (SELECT [Column_1], [Column_2], [Column_3]
FROM vw_BILLABLE_CENSUS_R) CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
Since in this case:
you make your query simpler,
you does not have to rely on query optimizer smartness and expect that it will eliminate unnecessary columns and rows
finally, you can select as many columns in the outer SELECT as necessary without using derived tables techniques.
In some cases, derived tables are welcome, when you want to eliminate duplicates in a table you want to join on a fly, but, imho, not in your case.
It depends on how many records are stored, but generally it will improve performance.
In this case read #LukStorms ' comments, I think he is right