FULL OUTER JOIN with an OR condition

FULL OUTER JOIN with an OR condition - sql

I have 2 tables with 2 different id's. I want to join based on the 2 different id's and a few other parameters, but the problem is that the 1 id's don't always match. Sometimes id number 1 will have matches, some times id number 2 will not match any and some times both will match.
Using full outer join with an OR condition in the JOIN clause really slows down my query. Is there a more efficient way of doing this?
I know you can use unions instead in case of inner joins but am not sure how to optimize using outer joins.
SELEC A.*, B.*
FROM A
FULL OUTER JOIN B
ON (A.id_1 = B.id_1 or A.id_2 = B.id_2)
AND A.pay_month = B.pay_month
AND A.plan = B.plan

Hmmm . . . This might be sufficient:
select A.*, B.*
from A full outer join
B
on A.id_1 = B.id_1 and A.pay_month = B.pay_month and A.plan = B.plan
union -- intentionally to remove duplicates
select A.*, B.*
from A full outer join
B
on A.id_2 = B.id_2 and A.id_1 <> B.id_1 and A.pay_month = B.pay_month and A.plan = B.plan;
This is not 100% equivalent -- for instance, this removes duplicates even within a table. Also, the union adds the overhead of removing duplicates. But the results may be good for your purposes.
Also, is a full outer join really necessary? I rarely use it in my code.

Related

In JPQL, in case of left joins, is there any difference when we give a particular condition in ON clause or in where clause?

Is there any difference in these two queries in terms of processing and data obtained:
SELECT * from Table A Inner JOIN Table B ON A.id = B.id
LEFT JOIN Table B ON A.id = C.id AND (C.status IS NULL OR C.status <>3)
WHERE A.status = 1 and B.status = 1
OR
SELECT * from Table A Inner JOIN Table B ON A.id = B.id
LEFT JOIN Table B ON A.id = C.id
WHERE A.status = 1 and B.status = 1 AND (C.status IS NULL OR C.status <>3)

After a deep dive into left joins and different clauses, Here is what I have came up:
1) When to use where and on clauses in case of joins:
In case of inner joins, it really does not matter where you give your conditions in where clause or on clause. Giving conditions in where clause will be used to filtering the data and giving the conditions in on clause will be to used to join the table. For readability purpose it is good to provide the filtering conditions in where clause only just to make sure you convince your intent through your query. But here is the catch, the query plan will change depending on where you pass your conditions and can heavily affect you.
In case of outer joins(left, right, full) passing conditions blindly in where and on clause can give you incorrect data.
2) When Left Join(or any outer join) can start behaving like inner joins
Be careful when you specify filtering conditions for null values in where clause. Suppose you have two tables: Users and orders and you want irrespective of whether they have placed order or not, so we chose left join. Further we have condition to filter condition like noOfOrders > 3, but since it was a left join values where null in products table. All null values will be filtered out, since null<0, hence left join will behave like an inner join.

What's the purpose of a JOIN where no column from 2nd table is being used?

I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query
SELECT
c_id, v_id, COUNT(DISTINCT(m_id)) AS participants,
cast(date_sub(current_date, ${window}) as string) as event_date
from (
select
a.c_id, a.v_id, a.user_id,
case
when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id
end as m_id
from (
select * from first
where event_date <= cast(date_sub(current_date, ${window}) as string)
) a
join (
select * from second
) b on a.c_id = b.c_id
left join third c
on a.user_id = c.id1
) dx
group by c_id, v_id;
I have changed the names but otherwise this is the select statement being used to insert overwrite to another table.
Regarding the join
join (
select * from second
) b on a.c_id = b.c_id
b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing.
Or I can just remove this join and it won't make any difference at all.
Thanks.

Join (any inner, left or right) can duplicate rows if join key in joined dataset is not unique. For example if a contains single row with c_id=1 and b contains two rows with c_id=1, the result will be two rows with a.c_id=1.
Join (inner) can filter rows if join key is absent in joined dataset. I believe this is what it meant to do.
If the goal is to get only rows with keys present in both datasets(filter) and you do not want duplication, and you do not use columns from joined dataset, then better use LEFT SEMI JOIN instead of JOIN, it will work as filter only even if there are duplicated keys in joined dataset:
left semi join (
select c_id from second
) b on a.c_id = b.c_id
This is much safer way to filter rows only which exist in both a and b and avoid unintended duplication.
You can replace join with WHERE IN/EXISTS, but it makes no difference, it is implemented as the same JOIN, check the EXPLAIN output and you will see the same query plan. Better use LEFT SEMI JOIN, it implements uncorrelated IN/EXISTS in efficient way.
If you prefer to move it to the WHERE:
WHERE a.c_id IN (select c_id from second)
or correlated EXISTS:
WHERE EXISTS (select 1 from second b where a.c_id=b.c_id)
But as I said, all of them are implemented internally using JOIN operator.

Microsoft SQL Server performance- or in on clause [duplicate]

This question already has answers here:
UNION ALL vs OR condition in sql server query
(3 answers)
Closed 7 years ago.
I need to join 2 tables using a field 'cdi' from the 2nd table with cdi or cd_cliente from the 1st table. I mean that it might match the same field or cd_cliente from the 1st table.
My original query was
select
a.cd_cliente, a.cdi as cdi_cli,b.*
from
clientes a
left join
rightTable b on a.cdi = b.cdi or a.cd_cliente = b.cdi
But since it took too much time, I changed it to:
Select a.cd_cliente, a.cdi, b.*
from clientes a
left join
(select
a.cd_cliente, a.cdi as cdi_cli, b.*
from
clientes a
inner join
rightTable b on a.cdi = b.cdi
union
select
a.cd_cliente, a.cdi as cdi_cli, b.*
from
clientes a
inner join
rightTable b on a.cd_cliente = b.cdi) b
on a.cd_cliente=b.cd_cliente
And it took less time. I'm not sure if the results would be the same. And if so, why the time taken by the 2nd query is considerably less?

I'm not sure if the results would be the same. Most likely not.
Consider a row in clientes that matched a row in rightTable on cdi but did not match any row on cd_cliente. The first query will return one row for the match. The second query will return two rows. Once for the match, and once for the not match, but with nulls filled in the rightTable columns because of the left outer join.
Also, if the first query returns any legitimate duplicates those will be removed by the union operator in the second query.

SQL Server isn't good with OR and indexes. Not sure why. Your second query is getting around that by (most likely) seeking via indexes twice and then merging them somehow.
There are simpler queries you could try, such as this one:
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
b.*
FROM
dbo.clientes a
OUTER APPLY (
SELECT *
FROM dbo.rightTable b
WHERE a.cdi = b.cdi
UNION
SELECT *
FROM dbo.rightTable b
WHERE a.cd_cliente = b.cdi
) b
;
And here's a weird one that could actually work, though I'm not sure:
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
b.*
FROM
dbo.clientes a
OUTER APPLY (
SELECT *
FROM dbo.rightTable b
WHERE EXISTS (
SELECT 1 WHERE a.cdi = c.cdi
UNION
SELECT 1 WHERE a.cd_cliente = b. cd_cliente
)
) b
;
Told you it was weird! And here's an even weirder (and probably inadvisable) one.
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
BColumn1 = Max(BColumn1),
BColumn2 = Max(BColumn2),
BColumn3 = Max(BColumn3),
BColumn4 = Max(BColumn4)
-- all columns of B
FROM
dbo.clientes a
CROSS APPLY (VALUES
(a.cdi),
(a.cd_cliente)
) c (cdi)
LEFT JOIN dbo.rightTable b
ON c.cdi = b.cdi
GROUP BY
a.cd_cliente,
a.cdi,
-- all columns of A
;
Given some time to play with your data and indexes and work with execution plans, I'm sure we could come up with something that would really sizzle.

This is your original query:
select a.cd_cliente, a.cdi as cdi_cli,b.*
from clientes a left join
rightTable b
on a.cdi = b.cdi or a.cd_cliente = b.cdi;
The performance problem is due to the or in the on condition. This generally interferes with using indexes.
If you only cared about one column from b, you could do:
select a.cd_cliente, a.cdi as cdi_cli, coalesce(b1.col, b2.col)
from clientes a left join
rightTable b1
on a.cdi = b1 left join
rightTable b2
on a.cd_cliente = b2.cdi;
These easily generalizes to a small handful of columns, but is cumbersome if b is wide.
Another way of writing the query would be much more cumbersome. It would start with the b table, double left join to a and then union in the remaining values from a:
select coalesce(a1.cd_cliente, a2.cd_cliente) as cd_cliente,
coalesce(a1.cdi, a2.cd) as cdi_cli,
b.*
from rightTable b left join
clientes a1
on a1.cdi = b.cdi left join
clientes a2
on a2.cd_cliente = b.cdi
where a1.cdi is not null or c2.cdi is not null
union all
select a.cd_cliente, a.cdi, b.*
from clientes a left join
righttable b
on 1 = 0
where not exists (select 1 from righttable b where a.cdi = b.cdi) and
not exists (select 1 from righttable b where a.cd_cliente = b.cdi)
The first part of the query gets all the matching rows, to one or the other tables. The second adds in the unmatching rows. Note the strange use of left join with a condition that always evaluates to FALSE. That makes it easier to bring in the tables from b.
Although this looks complicated, the joins and not exists subqueries can all take advantage of appropriate indexes on the tables. That means that it should have more reasonable performance.

Changing the ON condition order results in different query results?

Will there be any difference if I change the order from this to the next one in the last line ESPECIALLY when I use left join or left outer join? SOme people confuse me that it might have differnet value when we change order, I reckon they themselves aren't sure about this.
Or, if we change the order, under what situations such as right outer, right, left, left outer joins the query result differs?

It makes no difference which side you put criteria on when an = is being used.
Table order matters in the case of LEFT JOIN and RIGHT JOIN, but criteria order does not.
For example:
SELECT *
FROM Table1 a
LEFT JOIN Table2 b
ON a.ID = b.ID
Is equivalent to:
SELECT *
FROM Table2 a
RIGHT JOIN Table1 b
ON a.ID = b.ID
But not equivalent to:
SELECT *
FROM Table2 a
LEFT JOIN Table1 b
ON a.ID = b.ID
Demo: SQL Fiddle

SQL trying to do a JOIN to include results from multiple Tables

I'm a complete novice teaching myself SQL by writing and modifying a few queries and reports at work.
I've got something of a handle on the various types of JOINs and I've used INNER JOIN a few times with decent success.
What I'm stuck on should be a simple task, but my Google-Fu must be weak. Here's what I'm trying to do.
Say I have 3 tables, Table_A, Table_B, and Table_C, and each table has a column called [Serial_Number].
What I'm wanting to select is 3 of the other columns if A.Serial_Number = B.Serial_Number OR C.Serial_Number.
I've tried doing:
SELECT
*
FROM
Table_A AS A
INNER JOIN Table_B AS B ON A.Serial_Number = B.Serial_Number
INNER JOIN Table_C AS C ON A.Serial_Number = C.Serial_Number
But this always yields 0 results as the nature of the data dictates that if A matches B, it will never match C and vice versa. I also tried a LEFT OUTER JOIN as the second clause, but this just includes NULLs from Table_C that have already matched on Table_B.
All the searches I have done relating to JOINs on multiple tables seem to be about using JOINS to further exclude records, where I'm actually wanting to INCLUDE more records.
Like I said, I'm sure this is really simple, just needing a nudge in right direction.
Thanks!

The use of two inner joins here is akin to saying
If A.Serial_Number = B.Serial_Number AND
A.Serial_Number = C.Serial_Number
Using left outer join on the second clause - by which i presume you mean second join - would perform a left join on a result set already filtered by A.Serial_Number = B.Serial_Number by the first inner join. Given that B.Serial_Number doesn't relate to C.Serial_Number you wouldn't expect the an equijoin to return any result from tablec.
What you want is a left outer join like you tried but for both tableb and tablec.
Select *
From tablea
Left join tableb on tableb.Serial_Number = tablea.Serial_Number
Left join tablec on tablec.Serial_Number = tablea.Serial_Number
This way regardless of whether tablea.Serial_Number is in tableb it will still be returned and thus available to be joined to tablec

Agreed. Your output for your inner joins is producing NULLs which is why it is resulting in 0. I would suggest modifying your INNER JOIN.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas