Most efficient way to join two tables on multiple fields? - sql

I'm working with an Oracle SQL DB and attempting to join 2 tables together. My issue is that there are 3 different dimensions (4 total fields) upon which the two tables may be joined and I'm looking to identify all records where any one of those methods delivers a match and then pull in a certain field from that 2nd table in those instances.
My current plan is as follows:
SELECT a.*,
CASE
WHEN b.field_1 IS NOT NULL THEN b.field_5
WHEN c.field_2 IS NOT NULL THEN c.field_5
WHEN d.field_3 IS NOT NULL THEN c.field_5
END AS match
FROM table_1 a
LEFT JOIN table_2 b ON a.field_1 = b.field_1
LEFT JOIN table_3 c ON a.field_2 = c.field_2
LEFT JOIN table_4 d ON a.field_3 = d.field3 AND a.field_4 = d.field4
I believe this will give me the results I'm looking for, but I imagine this isn't the most efficient way to accomplish that. Any thoughts on a better approach?

[TL;DR] Your query is fine.
You need to use JOINs to correlate the relationships between the four tables.
If you want to be able to include rows from the driving table when there are no rows in the related tables then the join wants to be an OUTER JOIN.
If you put the driving table first then it will be a LEFT OUTER JOIN (or just LEFT JOIN)
You do not have much option on this.
If you want to get the field_5 values then you either want:
SELECT a.*,
b.field_5 AS b_match,
c.field_5 AS c_match,
d.field_5 AS d_match
FROM table_1 a
LEFT JOIN table_2 b ON a.field_1 = b.field_1
LEFT JOIN table_3 c ON a.field_2 = c.field_2
LEFT JOIN table_4 d ON a.field_3 = d.field3 AND a.field_4 = d.field4
If you want all the matches.
Or, you want to use your query:
SELECT a.*,
CASE
WHEN b.field_1 IS NOT NULL THEN b.field_5
WHEN c.field_2 IS NOT NULL THEN c.field_5
WHEN d.field_3 IS NOT NULL THEN c.field_5 -- Should this be d.field_5?
END AS match
FROM table_1 a
LEFT JOIN table_2 b ON a.field_1 = b.field_1
LEFT JOIN table_3 c ON a.field_2 = c.field_2
LEFT JOIN table_4 d ON a.field_3 = d.field3 AND a.field_4 = d.field4
If you want to get a single match in preference order of tables b, c and then d.
If you are using Oracle 12 or later, a third alternative could be to use UNION ALL in a LATERAL join:
SELECT a.*, l.field_5
FROM table_1 a
LEFT OUTER JOIN LATERAL (
SELECT 1 AS priority, b.field_5
FROM table_2 b
WHERE a.field_1 = b.field_1
UNION ALL
SELECT 2 AS priority, c.field_5
FROM table_3 c
WHERE a.field_2 = c.field_2
UNION ALL
SELECT 3 AS priority, d.field_5
FROM table_3 d
WHERE a.field_3 = d.field_3
AND a.field_4 = d.field_4
ORDER BY priority ASC
FETCH FIRST ROW WITH TIES
) l
ON (1 = 1)
Which may reduce the number of duplicate rows from not having multiple JOINs (that you are potentially ignoring with your CASE expression) but you should test whether it does return your desired results and if it would be more or less performant.

Related

SQL Server: how to write a DELETE statement with a GROUP BY

I am using SQL Server 2008.
I have a SELECT query as follows:
SELECT
Apples.ID, COUNT(Pips.Apples_ID)
FROM
Apples
LEFT JOIN
Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN
Table_C tc ON tb.xID = tc.xID
LEFT JOIN
Pips p ON tb.Apples_ID = p.Apples_ID
WHERE
tc.X IS NULL
GROUP BY
Apples.ID
The tables are:
Apples which has a unique entry (ID) for each Apple.
Pips which can have dozens of pips belonging to 1 Apple
Table_B and Table_C are mapping tables to refine the search
I need to group the results because I do not want an Apples result for each and every Pip that apples can have. The SELECT statement works and returns a list of unique Apple IDs
I now want to DELETE these Apples. I changed my statement to:
DELETE Apples
FROM Apples
LEFT JOIN Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
LEFT JOIN Pips p ON tb.Apples_ID = p.Apples_ID
WHERE tc.X IS NULL
GROUP BY Apples.ID
but got a syntax error on the GROUP BY.
I tried:
DELETE x
FROM
(SELECT Apples.ID
FROM Apples
LEFT JOIN Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
LEFT JOIN Pips p ON tb.Apples_ID = p.Apples_ID
WHERE tc.X IS NULL
GROUP BY Apples.ID) x;
But I got an error:
View or function not updatable because the modification affects multiple base tables
How can I delete these rows I have identified in the SELECT, without using a temporary table or script?
As others have pointed out, the sub-query approach can be adapted to work by using an IN ( ... ) clause on a normal single-table delete. This is the simplest way of adapting any select statement to a delete:
DELETE FROM Apples
WHERE ID IN (
-- Sub-query selecting a single column of ID values
)
The sub-query can then be as complex as you like, using GROUP BY, HAVING, etc, as long as it only has one column in the SELECT list.
In your specific case, however, there is no need:
You have no HAVING clause, so the COUNT() doesn't change the rows to delete
The LEFT JOIN to the Pips table has no effect on the result other than the COUNT()
Mentioning the same row twice in a DELETE has no effect, so eliminating duplicates is unnecessary
You can therefore simplify this particular case without using the sub-query:
DELETE Apples
FROM Apples
LEFT JOIN Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
WHERE tc.X IS NULL
DELETE FROM Apples WHERE ID in
(
SELECT a.ID FROM Apples a
LEFT JOIN Table_B tb ON a.ID = tb.a
LEFT JOIN Table_C tc ON tb.xID = tc.xID
LEFT JOIN Pips p ON tb.Apples_ID = p.a
WHERE tc.X IS NULL
GROUP BY a.ID
) as q
Are you trying to achieve this:
DELETE FROM APPLES WHERE ID IN
(
SELECT Apples.ID FROM Apples
LEFT JOIN Table_B tb ON Apples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
LEFT JOIN Pips p ON tb.Apples_ID = p.Apples_ID
WHERE tc.X IS NULL
GROUP BY Apples.ID
) x;
The only thing that has a role in the query is tc.X is null. It can be null if there is no match or there is a match but the field X is null:
delete from Apples
where AppleId in
(
SELECT Apples.ID FROM Apples
LEFT JOIN Table_b tb ON tApples.ID = tb.Apples_ID
LEFT JOIN Table_C tc ON tb.xID = tc.xID
WHERE tc.X IS NULL
);

Is there a way to print all of the rows from two tables using full outer join?

Here there are two tables. Table A and Table B I tried joining these two tables using the outer join to get all of the rows which is the resultant_table from both tables and it isn't working for some reason the screenshot at the end shows the error that I'm getting when I happen to run the query. I wanted the output as showed in the resultant table.
Here is the script that i used,
SELECT table_b.date,
table_b.student,
table_b.location,
table_b.sub_division,
table_a.part_time_pay,
table_b.days_worked
FROM table_a
FULL OUTER JOIN table_b
ON table_a.date = table_b.date
AND table_a.student = table_b.student;
It is doing exactly what you specify. Use coalesce() to combine values from the two tables:
SELECT COALESCE(a.date, b.date) as date,
COALESCE(a.student, b.student) as student,
b.location, b.sub_division,
a.part_time_pay, b.days_worked
FROM table_a a FULL JOIN
table_b b
ON a.date = b.date AND
a.student = b.student;
I'm not sure how you want to handle LOCATION, and SUBDIVISION. What if they have different values? I might think you want to put them in the JOIN conditions and then:
SELECT COALESCE(a.date, b.date) as date,
COALESCE(a.student, b.student) as student,
COALESCE(a.location, b.location) as location,
COALESCE(a.sub_division, b.sub_division) as sub_division,
a.part_time_pay, b.days_worked
FROM table_a a FULL JOIN
table_b b
ON a.date = b.date AND
a.student = b.student AND
a.location = b.location AND
a.sub_division = b.sub_division;

Microsoft SQL Server performance- or in on clause [duplicate]

This question already has answers here:
UNION ALL vs OR condition in sql server query
(3 answers)
Closed 7 years ago.
I need to join 2 tables using a field 'cdi' from the 2nd table with cdi or cd_cliente from the 1st table. I mean that it might match the same field or cd_cliente from the 1st table.
My original query was
select
a.cd_cliente, a.cdi as cdi_cli,b.*
from
clientes a
left join
rightTable b on a.cdi = b.cdi or a.cd_cliente = b.cdi
But since it took too much time, I changed it to:
Select a.cd_cliente, a.cdi, b.*
from clientes a
left join
(select
a.cd_cliente, a.cdi as cdi_cli, b.*
from
clientes a
inner join
rightTable b on a.cdi = b.cdi
union
select
a.cd_cliente, a.cdi as cdi_cli, b.*
from
clientes a
inner join
rightTable b on a.cd_cliente = b.cdi) b
on a.cd_cliente=b.cd_cliente
And it took less time. I'm not sure if the results would be the same. And if so, why the time taken by the 2nd query is considerably less?
I'm not sure if the results would be the same. Most likely not.
Consider a row in clientes that matched a row in rightTable on cdi but did not match any row on cd_cliente. The first query will return one row for the match. The second query will return two rows. Once for the match, and once for the not match, but with nulls filled in the rightTable columns because of the left outer join.
Also, if the first query returns any legitimate duplicates those will be removed by the union operator in the second query.
SQL Server isn't good with OR and indexes. Not sure why. Your second query is getting around that by (most likely) seeking via indexes twice and then merging them somehow.
There are simpler queries you could try, such as this one:
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
b.*
FROM
dbo.clientes a
OUTER APPLY (
SELECT *
FROM dbo.rightTable b
WHERE a.cdi = b.cdi
UNION
SELECT *
FROM dbo.rightTable b
WHERE a.cd_cliente = b.cdi
) b
;
And here's a weird one that could actually work, though I'm not sure:
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
b.*
FROM
dbo.clientes a
OUTER APPLY (
SELECT *
FROM dbo.rightTable b
WHERE EXISTS (
SELECT 1 WHERE a.cdi = c.cdi
UNION
SELECT 1 WHERE a.cd_cliente = b. cd_cliente
)
) b
;
Told you it was weird! And here's an even weirder (and probably inadvisable) one.
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
BColumn1 = Max(BColumn1),
BColumn2 = Max(BColumn2),
BColumn3 = Max(BColumn3),
BColumn4 = Max(BColumn4)
-- all columns of B
FROM
dbo.clientes a
CROSS APPLY (VALUES
(a.cdi),
(a.cd_cliente)
) c (cdi)
LEFT JOIN dbo.rightTable b
ON c.cdi = b.cdi
GROUP BY
a.cd_cliente,
a.cdi,
-- all columns of A
;
Given some time to play with your data and indexes and work with execution plans, I'm sure we could come up with something that would really sizzle.
This is your original query:
select a.cd_cliente, a.cdi as cdi_cli,b.*
from clientes a left join
rightTable b
on a.cdi = b.cdi or a.cd_cliente = b.cdi;
The performance problem is due to the or in the on condition. This generally interferes with using indexes.
If you only cared about one column from b, you could do:
select a.cd_cliente, a.cdi as cdi_cli, coalesce(b1.col, b2.col)
from clientes a left join
rightTable b1
on a.cdi = b1 left join
rightTable b2
on a.cd_cliente = b2.cdi;
These easily generalizes to a small handful of columns, but is cumbersome if b is wide.
Another way of writing the query would be much more cumbersome. It would start with the b table, double left join to a and then union in the remaining values from a:
select coalesce(a1.cd_cliente, a2.cd_cliente) as cd_cliente,
coalesce(a1.cdi, a2.cd) as cdi_cli,
b.*
from rightTable b left join
clientes a1
on a1.cdi = b.cdi left join
clientes a2
on a2.cd_cliente = b.cdi
where a1.cdi is not null or c2.cdi is not null
union all
select a.cd_cliente, a.cdi, b.*
from clientes a left join
righttable b
on 1 = 0
where not exists (select 1 from righttable b where a.cdi = b.cdi) and
not exists (select 1 from righttable b where a.cd_cliente = b.cdi)
The first part of the query gets all the matching rows, to one or the other tables. The second adds in the unmatching rows. Note the strange use of left join with a condition that always evaluates to FALSE. That makes it easier to bring in the tables from b.
Although this looks complicated, the joins and not exists subqueries can all take advantage of appropriate indexes on the tables. That means that it should have more reasonable performance.

Left outer join on multiple tables

I have the following sql statement:
select
a.desc
,sum(bdd.amount)
from t_main c
left outer join t_direct bds on (bds.repid=c.id)
left outer join tm_defination def a on (a.id =bds.sId)
where c.repId=1000000134
group by a.desc;
When I run it I get the following result:
desc amount
NW 12.00
SW 10
When I try to add another left outer join to get another set of values:
select
a.desc
,sum(bdd.amount)
,sum(i.amt)
from t_main c
left outer join t_direct bds on (bds.repid=c.id)
left outer join tm_defination def a on (a.id =bdd.sId)
left outer join t_ind i on (i.id=c.id)
where c.repId=1000000134
group by a.desc;
It basically doubles the amount field like:
desc amount amt
NW 24.00 234.00
SE 20.00 234.00
While result should be:
desc amount amt
NW 12.00 234.00
SE 10.00 NULL
How do I fix this?
If you really need to receive the data as you mentioned, your can use sub-queries to perform the needed calculations. In this case you code may looks like the following:
select x.[desc], x.amount, y.amt
from
(
select
c.[desc]
, sum (bdd.amount) as amount
, c.id
from t_main c
left outer join t_direct bds on (bds.repid=c.id)
left outer join tm_defination_def bdd on (bdd.id = bds.sId)
where c.repId=1000000134
group by c.id, c.[desc]
) x
left join
(
select t.id, sum (t.amt) as amt
from t_ind t
inner join t_main c
on t.id = c.id
where c.repID = 1000000134
group by t.id
) y
on x.id = y.id
In the first sub-select you will receive the aggregated data for the two first columns: desc and amount, grouped as you need.
The second select will return the needed amt value for each id of the first set.
Left join between those results will gives the needed result. The addition of the t_main table to the second select was done because of performance issues.
Another solution can be the following:
select
c.[desc]
, sum (bdd.amount) as amount
, amt = (select sum (amt) from t_ind where id = c.id)
from #t_main c
left outer join t_direct bds on (bds.repid=c.id)
left outer join tm_defination_def bdd on (bdd.id = bds.sId)
where c.repId = 1000000134
group by c.id, c.[desc]
The result will be the same. Basically, instead of using of nested selects the calculating of the amt sum is performing inline per each row of the result joins. In case of large tables the performance of the second solution will be worse that the first one.
Your new left outer join is forcing some rows to be returned in the result set a few times due to multiple relations most likely. Remove your SUM and just review the returned rows and work out exactly which ones you require (maybe restrict it to on certain type of t_ind record if that is applicable??), then adjust your query accordingly.
Left Outer Join - Driving Table Row Count
A left outer join may return more rows than there are in the driving table if there are multiple matches on the join clause.
Using MS SQL-Server:
DECLARE #t1 TABLE ( id INT )
INSERT INTO #t1 VALUES ( 1 ),( 2 ),( 3 ),( 4 ),( 5 );
DECLARE #t2 TABLE ( id INT )
INSERT INTO #t2 VALUES ( 2 ),( 2 ),( 3 ),( 10 ),( 11 ),( 12 );
SELECT * FROM #t1 t1
LEFT OUTER JOIN #t2 t2 ON t2.id = t1.id
This gives:
1 NULL
2 2
2 2
3 3
4 NULL
5 NULL
There are 5 rows in the driving table (t1), but 6 rows are returned because there are multiple matches for id 2.
So if an aggregate function is used, eg SUM() etc, grouped by the driving table column(s), this will give the wrong results.
To fix this, use derived tables or sub-queries to calculate the aggregate values, as already stated.
Left Outer Join - Multiple Tables
Where there are left outer joins over multiple tables, or any join for that matter, the query generates a series of derived tables in the order of joins.
SELECT * FROM t1
LEFT OUTER JOIN t2 ON t2.col2 = <...>
LEFT OUTER JOIN t3 ON t3.col3 = <...>
This is equivalent to:
SELECT * FROM
(
SELECT * FROM t1
LEFT OUTER JOIN t2 ON t2.col2 = <...>
) dt1
LEFT OUTER JOIN t3 ON t3.col3 = <...>
Here, for both queries, the results of the 1st left outer join are put into a derived table (dt1) which is then left outer joined to the 3rd table (t3).
For left outer joins over multiple tables, the order of the tables in the join clauses is critical.

Sql NOT IN optimization

I'm having trouble optimizing a query. Here are two example tables I am working with:
Table 1:
UID
A
B
Table 2:
UID Parent
A 2
B 2
C 3
D 2
E 3
F 2
Here is what I am doing now:
Select Table1.UID
FROM Table1 R
INNER JOIN Table2 T ON
R.UID = T.UID
INNER JOIN Table2 E ON
T.PARENT = E.PARENT
AND E.UID NOT IN (SELECT UID FROM Table1)
I'm trying to avoid using the NOT IN clause because of obvious hindrances in performance for large numbers of records.
I know the typical ways to avoid NOT IN clauses like the LEFT JOIN where the other table is null, but can't seem to get what I want with all of the other Joins going on.
I will continue working and post if I find a solution.
EDIT: Here is what I am trying to end up with
After the first Inner Join I would have
A
B
AFter the second Inner join I would have:
A D
A F
B D
B F
The second column above is just to represent that it is matching to the other UIDs with the same parent, but I still need the As and Bs as the UID.
EDIT: RDBMS is SQL server 2005, 2008r2, 2012
Table1 is declared in the query with no index
DECLARE #Table1 TABLE ( [UNIQUE_ID] INT PRIMARY KEY )
Table2 has a clustered index on Unique ID
The general approach to this is to use a LEFT JOIN with a where clause that only selects the non-matching rows:
Select Table1.UID
FROM Table1 R
JOIN Table2 T ON R.UID = T.UID
JOIN Table2 E ON T.PARENT = E.PARENT
LEFT JOIN Table3 E2 ON E.UID = R.UID
WHERE E2.UID IS NULL
SELECT Table2.*
FROM Table2
INNER JOIN (
SELECT id FROM Table2
EXCEPT
SELECT id FROM Table1
) AS Filter ON (Table2.id = Filter.id)