Full outer join acts like inner join with multiple conditions on the two tables

Full outer join acts like inner join with multiple conditions on the two tables - sql

I am trying to have a full outer join between two tables Table1 and Table2 on ID with a query like the following in Teradata. The problem is it acts like inner join.
SELECT *
FROM Table1 AS a
FULL OUTER JOIN Table2 AS b
ON a.ID = b.ID
WHERE a.country in ('US','FR')
AND a.create_date = '2021-01-01'
AND b.country IN ('US','DE','BE')
AND b.create_date = '2021-01-01';
What I want is something like this:
SELECT * FROM
(
SELECT * FROM Table1 as a
WHERE a.country in ('US','FR')
AND a.create_date = '2021-01-01'
) as ax
FULL OUTER JOIN
(
SELECT * FROM Table2 as b
WHERE b.country IN ('US','DE','BE')
AND b.create_date = '2021-01-01'
) as bx
ON ax.ID=bx.ID;
I feel like the second query is not best practice, maybe inefficient and/or hard to read in complicated cases. How can I modify the first query to get the desired output?
I know that this is a fundamental problem and probably there are many other ways to do it (e.g. with USING, HAVING etc) but could not find a basic explanation. Would appreciate a comprehensive answer on alternative solutions as a guide for future reference.
EDIT
The difference in my question to Left Join With Where Clause is that I require a condition in both tables. I cannot figure out where to put the second WHERE condition.

The short answer: Both sets of predicates belong in the ON clause.
SELECT *
FROM Table1 AS a
FULL OUTER JOIN Table2 AS b
ON a.ID = b.ID
AND a.country in ('US','FR')
AND a.create_date = '2021-01-01'
AND b.country IN ('US','DE','BE')
AND b.create_date = '2021-01-01';
The ON clause both limits the rows that are eligible to participate in the join (pre-join filtering) and specifies how to match rows (join criteria). The WHERE clause filters results (after the join).
A generally less-desirable alternative would be to modify the predicates so as not to filter out the non-matching rows, e.g. assuming ID is NOT NULL in both tables
SELECT *
FROM Table1 AS a
FULL OUTER JOIN Table2 AS b
ON a.ID = b.ID
WHERE (a.country in ('US','FR')
AND a.create_date = '2021-01-01'
OR a.ID IS NULL)
AND (b.country IN ('US','DE','BE')
AND b.create_date = '2021-01-01'
OR b.ID IS NULL);
Logically the ON and WHERE work the same way for INNER JOIN but in that case the net result is the same (and many databases including Teradata will generate the same query plan for INNER JOIN regardless of where you put the filter predicates).

Related

Is there a way to print all of the rows from two tables using full outer join?

Here there are two tables. Table A and Table B I tried joining these two tables using the outer join to get all of the rows which is the resultant_table from both tables and it isn't working for some reason the screenshot at the end shows the error that I'm getting when I happen to run the query. I wanted the output as showed in the resultant table.
Here is the script that i used,
SELECT table_b.date,
table_b.student,
table_b.location,
table_b.sub_division,
table_a.part_time_pay,
table_b.days_worked
FROM table_a
FULL OUTER JOIN table_b
ON table_a.date = table_b.date
AND table_a.student = table_b.student;

It is doing exactly what you specify. Use coalesce() to combine values from the two tables:
SELECT COALESCE(a.date, b.date) as date,
COALESCE(a.student, b.student) as student,
b.location, b.sub_division,
a.part_time_pay, b.days_worked
FROM table_a a FULL JOIN
table_b b
ON a.date = b.date AND
a.student = b.student;
I'm not sure how you want to handle LOCATION, and SUBDIVISION. What if they have different values? I might think you want to put them in the JOIN conditions and then:
SELECT COALESCE(a.date, b.date) as date,
COALESCE(a.student, b.student) as student,
COALESCE(a.location, b.location) as location,
COALESCE(a.sub_division, b.sub_division) as sub_division,
a.part_time_pay, b.days_worked
FROM table_a a FULL JOIN
table_b b
ON a.date = b.date AND
a.student = b.student AND
a.location = b.location AND
a.sub_division = b.sub_division;

BigQuery Full outer join producing "left join" results

I have 2 tables, both of which contain distinct id values. Some of the id values might occur in both tables and some are unique to each table. Table1 has 10,910 rows and Table2 has 11,304 rows
When running a left join query:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
I get a total of 10,896 rows or 10,896 ids shared across both tables.
However, when I run a FULL OUTER JOIN on the 2 tables like this:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
I get total of 10,896 rows, but I was expecting all 10,910 rows from table1.
I am wondering if there is an issue with my query syntax.

As you are using EACH - it looks like you are running your queries in Legacy SQL mode.
In BigQuery Legacy SQL - COUNT(DISTINCT) function is probabilistic - gives statistical approximation and is not guaranteed to be exact.
You can use EXACT_COUNT_DISTINCT() function instead - this one gives you exact number but a little more expensive on back-end
Even better option - just use Standard SQL
For your specific query you will only need to remove EACH keyword and it should work as a charm
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
and
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN table2 b on a.id = b.id

I added the original query as a subquery and counted ids and produced the expected results. Still a little strange, but it works.
SELECT EXACT_COUNT_DISTINCT(a.id)
FROM
(SELECT a.id AS a.id,
b.id AS b.id
FROM table1 a FULL OUTER JOIN EACH table2 b on a.id = b.id))

It is because you count in both case the number of non-null lines for table a by using a count(distinct a.id).
Use a count(*) and it should works.

You will have to add coalesce... BigQuery, unlike traditional SQL does not recognize fields unless used explicitly
SELECT COUNT(DISTINCT coalesce(a.id,b.id))
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
This query will now take full effect of full outer join :)

Changing the ON condition order results in different query results?

Will there be any difference if I change the order from this to the next one in the last line ESPECIALLY when I use left join or left outer join? SOme people confuse me that it might have differnet value when we change order, I reckon they themselves aren't sure about this.
Or, if we change the order, under what situations such as right outer, right, left, left outer joins the query result differs?

It makes no difference which side you put criteria on when an = is being used.
Table order matters in the case of LEFT JOIN and RIGHT JOIN, but criteria order does not.
For example:
SELECT *
FROM Table1 a
LEFT JOIN Table2 b
ON a.ID = b.ID
Is equivalent to:
SELECT *
FROM Table2 a
RIGHT JOIN Table1 b
ON a.ID = b.ID
But not equivalent to:
SELECT *
FROM Table2 a
LEFT JOIN Table1 b
ON a.ID = b.ID
Demo: SQL Fiddle

SQL style question: INNER JOIN in FROM clause or WHERE clause?

If you are going to join multiple tables in a SQL query, where do you think is a better place to put the join statement: in the FROM clause or the WHERE clause?
If you are going to do it in the FROM clause, how do you format it so that it is clear and readable? (I'm talking about indents, newlines, whitespace in general.)
Are there any advantages/disadvantages to each?

I tend to use the FROM clause, or rather the JOIN clause itself, indenting like this (and using aliases):
SELECT t1.field1, t2.field2, t3.field3
FROM table1 t1
INNER JOIN table2 t2
ON t1.id1 = t2.id1
INNER JOIN table3 t3
ON t1.id1 = t3.id3
This keeps the join condition close to where the join is made. I find it easier to understand this way then trying to look through the WHERE clause to figure out what exactly is joined how.

When making OUTER JOINs (ANSI-89 or ANSI-92), filtration location matters because criteria specified in the ON clause is applied before the JOIN is made. Criteria against an OUTER JOINed table provided in the WHERE clause is applied after the JOIN is made. This can produce very different result sets.
In comparison, it doesn't matter for INNER JOINs if the criteria is provided in the ON or WHERE clauses -- the result will be the same. That said, I strive to keep the WHERE clause clean -- anything related to JOINed tables will be in their respective ON clause. Saves hunting through the WHERE clause, which is why ANSI-92 syntax is more readable.

I prefer the FROM clause if for no other reason that it distinguishes between filtering results (from a Cartesian product) merely between foreign key relationships and between a logical restriction. For example:
SELECT * FROM Products P JOIN ProductPricing PP ON P.Id = PP.ProductId
WHERE PP.Price > 10
As opposed to
SELECT * FROM Products P, ProductPricing PP
WHERE P.Id = PP.ProductID AND Price > 10
I can look at the first one and instantly know that the only logical restriction I'm placing is the price, as opposed to the implicit machinery of joining tables together on the relationship key.

I almost always use the ANSI 92 joins because it makes it clear that these conditions are for JOINING.
Typically I write it this way
FROM
foo f
INNER JOIN bar b
ON f.id = b.id
sometimes I write it this way when it trivial
FROM
foo f
INNER JOIN bar b ON f.id = b.id
INNER JOIN baz b2 ON b.id = b2.id
When its not trivial I do the first way
e.g.
FROM
foo f
INNER JOIN bar b
ON f.id = b.id
and b.type = 1
or
FROM
foo f
INNER JOIN (
SELECT max(date) date, id
FROM foo
GROUP BY
id) lastF
ON f.id = lastF.id
and f.date = lastF.Date
Or really the weird (not sure if I got the parens correctly but its supposed to be an LEFT join to table bar but bar needs an inner join to baz)
FROM
foo f
LEFT JOIN (bar b
INNER JOIN baz b2
ON b.id = b2.id
)ON f.id = b.id

You should put joins in Join clauses which means the From clause. A different question could be had about where to put filtering statements.
With respect to indenting, there are many styles. My preference is to indent related joins and keep main clauses like Select, From, Where, Group By, Having and Order By indented at the same level. In addition, I put each of these main attributes and the first line of an On clause on its own line.
Select ..
From Table1
Join Table2
On Table2.FK = Table1.PK
And Table2.OtherCol = '12345'
And Table2.OtherCol2 = 9876
Left Join (Table3
Join Table4
On Table4.FK = Table3.PK)
On Table3.FK = Table2.PK
Where ...
Group By ...
Having ...
Order By ...

Use the FROM clause to be compliant with ANSI-92 standards.
This:
select *
from a
inner join b
on a.id = b.id
where a.SomeColumn = 'x'
Not this:
select *
from a, b
where a.id = b.id
and a.SomeColumn = 'x'

I definitely always do my JOINS (of whatever type) in my FROM clause.
The way I indent them is this:
SELECT fields
FROM table1 t1
INNER JOIN table2 t2 ON t1.id = t2.t1_id
INNER JOIN table3 t3 ON t1.id = t3.t1_id
AND
t2.id = t3.t2_id
In fact, I'll generally go a step farther and move as much of my constraining logic from the WHERE clause to the FROM clause, because this (at least in MS SQL) front-loads the constraint, meaning that it reduces the size of the recordset sooner in the query construction (I've seen documentation that contradicts this, but my execution plans are invariably more efficient when I do it this way).
For example, if I wanted to only select things in the above query where t3.id = 3, you could but that in the WHERE clause, or you could do it this way:
SELECT fields
FROM table1 t1
INNER JOIN table2 t2 ON t1.id = t2.t1_id
INNER JOIN table3 t3 ON t1.id = t3.t1_id
AND
t2.id = t3.t2_id
AND
t3.id = 3
I personally find queries laid out in this way to be very readable and maintainable, but this is certainly a matter of personal preference, so YMMV.
Regardless, I hope this helps.

ANSI joins. I omit any optional keywords from the SQL as they only add noise to the equation. There's no such thing as a left inner join, is there? And by default, a simple join is an inner join, so there's no particular point to saying 'inner join'.
Then I column align things as much as possible.
The point being that a large complex SQL query can be very difficult to comprehend, so the more order that is imposed on it to make it more readable, the better. Any body looking at the query to fix, modify or tune it, needs to be able to answer a few things off right off the bat:
what tables/views are involved in the query?
what are the criteria for each join? What's the cardinality of each join?
what/how many columns are returned by the query
I like to write my queries so they look something like this:
select PatientID = rpt.ipatientid ,
EventDate = d.dEvent ,
Side = d.cSide ,
OutsideHistoryDate = convert(nchar, d.devent,112) ,
Outcome = p.cOvrClass ,
ProcedureType = cat.ctype ,
ProcedureCategoryMajor = cat.cmajor ,
ProcedureCategoryMinor = cat.cminor
from dbo.procrpt rpt
join dbo.procd d on d.iprocrptid = rpt.iprocrptid
join dbo.proclu lu on lu.iprocluid = d.iprocluid
join dbo.pathlgy p on p.iProcID = d.iprocid
left join dbo.proccat cat on cat.iproccatid = lu.iproccatid
where procrpt.ipatientid = #iPatientID

Need to create an expression in an outer join that only returns one row

I'm creating a really complex dynamic sql, it's got to return one row per user, but now I have to join against a one to many table. I do an outer join to make sure I get at least one row back (and can check for null to see if there's data in that table) but I have to make sure I only get one row back from this outer join part if there's multiple rows in this second table for this user.
So far I've come up with this: (sybase)
SELECT a.user_id
FROM table1 a
,table2 b
WHERE a.user_id = b.user_id
AND a.sub_id = (
SELECT min(c.sub_id)
FROM table2 c
WHERE b.sub_id = c.sub_id
)
The subquery finds the min value in the one to many table for that particular user.
This works but I fear nastiness from doing correlated subqueries when table 1 and 2 get very large.
Is there a better way? I'm trying to dream up a way to get joins to do it, but I'm not seeing it.
Also saying "where rowcount=1" or "top 1" doesn't help me, because I'm not trying to fix the above query, I'm ADDING the above to an already complex query.

In MySql you can ensure that any query returns at most X rows using
select *
from foo
where bar = 1
limit X;
Unfortunately, I'm fairly sure this is a MySQL-specific extension to SQL. However, a Google search for something like "mysql sybase limit" might turn up an equivalent for Sybase.

A few quick points:
You need to have definitive business rules. If the query returns more than one row then you need to think about why (beyond just "it's a 1:many relationship - WHY is it a 1:many relationship?). You should come up with the business solution rather than just use "min" because it gives you 1 row. The business solution might simply be "take the first one", in which case min might be the answer, but you need to make sure that's a conscious decision.
You should really try to use the ANSI syntax for joins. Not just because it's standard, but because the syntax that you have isn't really doing what you think it's doing (it's not an outer join) and some things are simply impossible to do with the syntax that you have.
Assuming that you end up using the MIN solution, here's one possible solution without the subquery. You should test it with various other solutions to make sure that they are equivalent in outcome and to see which performs the best.
SELECT
a.user_id, b.*
FROM
dbo.Table_1 a
LEFT OUTER JOIN dbo.Table_2 b ON b.user_id = a.user_id AND b.sub_id = a.sub_id
LEFT OUTER JOIN dbo.Table_2 c ON c.user_id = a.user_id AND c.sub_id < b.sub_id
WHERE
c.user_id IS NULL
You'll need to test this to see if it's really giving what you want and you might need to tweak it, but the basic idea is to use the second LEFT OUTER JOIN to ensure that there are no rows that exist with a lower sub_id than the one found in the first LEFT OUTER JOIN (if any is found). You can adjust the criteria in the second LEFT OUTER JOIN depending on the final business rules.

How about:
select a.user_id
from table1 a
where exists (select null from table2 b
where a.user_id = b.user_id
)

Maybe your example is too simplified, but I'd use a group by:
SELECT
a.user_id
FROM
table1 a
LEFT OUTER JOIN table2 b ON (a.user_id = b.user_id)
GROUP BY
a.user_id
I fear the only other way would be using nested queries:
The difference between this query and your example is a 'sub table' is only generated once, however in your example you generate a 'sub table' for each row in table1 (but may depend on the compiler, so you might want to use query analyser to check performance).
SELECT
a.user_id,
b.sub_id
FROM
table1 a
LEFT OUTER JOIN (
SELECT
user_id,
min(sub_id) as sub_id,
FROM
table2
GROUP BY
user_id
) b ON (a.user_id = b.user_id)
Also, if your query is getting quite complex I'd use temporary tables to simplify the code, it might cost a little more in processing time, but will make your queries much easier to maintain.
A Temp Table example would be:
SELECT
user_id
INTO
#table1
FROM
table1
WHERE
.....
SELECT
a.user_id,
min(b.sub_id) as sub_id,
INTO
#table2
FROM
#table1 a
INNER JOIN table2 b ON (a.user_id = b.user_id)
GROUP BY
a.user_id
SELECT
a.*,
b.sub_id
from
#table1 a
LEFT OUTER JOIN #table2 b ON (a.user_id = b.user_id)

First of all, I believe the query you are trying to write as your example is:
select a.user_id
from table1 a, table2 b
where a.user_id = b.user_id
and b.sub_id = (select min(c.sub_id)
from table2 c
where b.user_id = c.user_id)
Except you wanted an outer join (which I think someone edited out the Oracle syntax).
select a.user_id
from table1 a
left outer join table2 b on a.user_id = b.user_id
where b.sub_id = (select min(c.sub_id)
from table2 c
where b.user_id = c.user_id)

Well, you already have a query that works. If you are concerned about the speed you could
Add a field to table2 which
identifies which sub_id is the
'first one' or
Keep track of table2's primary key in table1, or in another table

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas