Terminology for when a join propagates out additional rows - sql

When joining between two tables/queries:
with
cte1 (id) as (
select 1 from dual),
cte2 (id) as (
select 1 from dual union all
select 1 from dual)
select
cte1.id as cte1_id,
cte2.id as cte2_id
from
cte1
left join
cte2
on cte1.id = cte2.id
CTE1_ID CTE2_ID
1 1
1 1
Unsurprisingly, that join propagates out additional rows. The query on the left side of the join only had one row. But the resultset has two rows due to the join.
I suspect “propagate” isn’t quite the right word for describing that scenario.
What’s the proper term?
For example, when talking to people who are new to SQL, I often say, “Be careful with that join. It looks like you’re accidentally propagating out additional rows, since the join is 1:many.”

In this example you are not propagating rows (at least in my understanding anyway). You have two rows in the table on the right side of the join and you have two rows in the result.
However, if you had this:
WITH cte AS
(
SELECT 1 AS id FROM dual
UNION ALL
SELECT 1 FROM dual
)
SELECT x.id, y.id
FROM cte x
INNER JOIN cte y ON x.id = y.id;
You would start with 2 rows and the query would return 4, because the join is partial. To me, this is propagating data.
When every row from one side of the join is joined with every row on the other, the term you are looking for is a "cartesian product", which is achieved in SQL using a "cross join" or, in cases where the join is not unique but is limited partially, you could use "partial cartesian product" (though I don't recommend it) or more commonly a "partial cross-join". I think the latter is more likely to be readily appreciated by SQL developers.
In either case, there are times where both can be appropriate but a lot of the time they are the result of an error in a join clause.

What’s the proper term?
"Cartesian Product" could be one term you can use.
I.e. "Be careful of that join. It looks like you are accidentally returning the cartesian product of the two tables."
A CROSS JOIN will return the cartesian product of the two joined tables; it is also called a "Cartesian Join".
An INNER JOIN will return the cartesian product of the two joined tables that is filtered by some relationship (the join condition(s)) between columns of the two tables; it is also called an "Equi Join".
An OUTER JOIN is similar to the INNER JOIN but will also return the non-matched rows on one (for LEFT or RIGHT joins) or both (for FULL joins) sides of the join condition.

why a LEFT JOIN an INNER JOIN would do the same!
And no it doesn't propagate, you have in cte2 2 id's with 1 that is what UNION ALL actually amkes so when you join both tables, with the same id you will receive 2 rows as joined result set.
A Left Join also takes all rows of the left tables and troes to join in your case by the id and if it didn't find any companion, it adds the row with the right table columns as NULL.
So no wonders and no miracles, simple SQL

Related

Oracle: Use only few tables in WHERE clause but mentioned more tables in 'FROM' in a jon SQL

What will happen in an Oracle SQL join if I don't use all the tables in the WHERE clause that were mentioned in the FROM clause?
Example:
SELECT A.*
FROM A, B, C, D
WHERE A.col1 = B.col1;
Here I didn't use the C and D tables in the WHERE clause, even though I mentioned them in FROM. Is this OK? Are there any adverse performance issues?
It is poor practice to use that syntax at all. The FROM A,B,C,D syntax has been obsolete since 1992... more than 30 YEARS now. There's no excuse anymore. Instead, every join should always use the JOIN keyword, and specify any join conditions in the ON clause. The better way to write the query looks like this:
SELECT A.*
FROM A
INNER JOIN B ON A.col1 = B.col1
CROSS JOIN C
CROSS JOIN D;
Now we can also see what happens in the question. The query will still run if you fail to specify any conditions for certain tables, but it has the effect of using a CROSS JOIN: the results will include every possible combination of rows from every included relation (where the "A,B" part counts as one relation). If each of the three parts of those joins (A&B, C, D) have just 100 rows, the result set will have 1,000,000 rows (100 * 100 * 100). This is rarely going to give the results you expect or intend, and it's especially suspect when the SELECT clause isn't looking at any of the fields from the uncorrelated tables.
Any table lacking join definition will result in a Cartesian product - every row in the intermediate rowset before the join will match every row in the target table. So if you have 10,000 rows and it joins without any join predicate to a table of 10,000 rows, you will get 100,000,000 rows as a result. There are only a few rare circumstances where this is what you want. At very large volumes it can cause havoc for the database, and DBAs are likely to lock your account.
If you don't want to use a table, exclude it entirely from your SQL. If you can't for reason due to some constraint we don't know about, then include the proper join predicates to every table in your WHERE clause and simply don't list any of their columns in your SELECT clause. If there's a cost to the join and you don't need anything from it and again for some very strange reason can't leave the table out completely from your SQL (this does occasionally happen in reusable code), then you can disable the joins by making the predicates always false. Remember to use outer joins if you do this.
Native Oracle method:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10) -- test data
SELECT A.*
FROM data a,
data b,
data c,
data d
WHERE a.col = b.col
AND DECODE('Y','Y',NULL,a.col) = c.col(+)
AND DECODE('Y','Y',NULL,a.col) = d.col(+)
ANSI style:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10)
SELECT A.*
FROM data a
INNER JOIN data b ON a.col = b.col
LEFT OUTER JOIN data c ON DECODE('Y','Y',NULL,a.col) = b.col
LEFT OUTER JOIN data d ON DECODE('Y','Y',NULL,a.col) = d.col
You can plug in a variable for the first Y that you set to Y or N (e.g. var_disable_join). This will bypass the join and avoid both the associated performance penalty and the Cartesian product effect. But again, I want to reiterate, this is an advanced hack and is probably NOT what you need. Simply leaving out the unwanted tables it the right approach 95% of the time.

There are a few questions about the select statement

I have a few questions about the select statement.
First of all, I have normalized 15 tables for this select query.
The problem is invisible because there is not much data right now.
However, since I try to process many tables in one select query, it seems to cause problems later.
So I want to add a few more select statements to divide the tables to search, but I want to know how different it is from doing it at once.
Secondly, if I use join, I will use outer join. If I join multiple tables with outer join, I'm not sure how to use left outer join and right outer join.
The currently created select query refers to 8 tables and one join is linked.
That is, the remaining rest of the tables have obtained data in subqueries and the remaining eight tables are likely to use join.
I would appreciate it if you could let me know the direction of the multiple outer joins.
Let me briefly show you some of the current select queries.
select
a.cal1,a.cal2,a.cal3,...,
(select b.cal1 from b
where a.cal4=b.cal2)
as "bcals",
(select c.cal1 from c
where a.cal5=c.cal2)
as "ccals",
....,
(select e.cal1 from e
where a.caln=e.cal2)
as "ecals",
(select sum(extract(year from age(f.endday,f.startday))
from f
where e.cal1=a.cal1)
as "fcals",
g.cal1,g.cal2,g.cal3,...,
(select h.cal1 from h
where g.cal4=h.cal2)
as "hcals"
from a left outer join g on a.cal1=g.cal5
where a.cal1=?;
Result:
a.cal1|a.cal2|a.cal3|...|hcals
var1 |var2 |var3 |...|varn
After this, I wonder how to join the rest of the tables.
To sum up
If there are many tables that need to be included in a select query statement, what is the difference between performance and performance when this complex query is divided into multiple select statements?
If we write inside a select statement, how should outer join be?
Is there a problem with the query?
Actually your code is correct, but it looks very complex. People will find it difficult to understand it. Using joins you can minimize the lines of code and also make it more readable.
SELECT
TBL1.AMOUNT T1,
TBL2.AMOUNT T2,
TBL3.AMOUNT T3
FROM TBL1
LEFT JOIN TBL2 ON TBL2.ID = TBL1.ID
LEFT JOIN TBL3 ON TBL3.ID = TBL1.ID
In the above code , there are three tables, and two joins. One can easily understand and debug/make changes. Please try this for your code.

What's the difference between Join and UNION in SQL

I'm just wondering what's the difference between Join and Union in SQL and in Join, what's the difference between Join and Cross Join? THANKS!
Join: Joins the table based on certain conditions. Lets say Table A has 2 rows rowA1 and rowA2. And you join this with Table B which has 3 rows rowB1 rowB2 rowB3. So the result will be:
rowA1.data RowB1.data
rowA1.data RowB2.data
rowA1.data RowB3.data
rowA2.data RowB1.data
rowA2.data RowB2.data
rowA2.data RowB3.data
But in a union, the result will be:
rowA1.data
rowA2.data
rowB1.data
rowB2.data
rowB3.data
Union will also check for duplicate. The data types should be consistent. The data types of columns returned should be in same order and number as the datatype and no. of columns returned by second table.
Join is a concept. It can be various types liek Inner Join, Outer Join, Cross Join. Cross Join means, there is a missing condition that would uniquely join the table data.

Performance comparison with postgresql : left join vs union all

I want to know what is the best and the fastest solution between a "left outer join" and an "union all".
The database is a PostgreSQL.
Query with an UNION ALL :
SELECT * FROM element, user WHERE elm_usr_id = usr_id
UNION ALL
SELECT * FROM element WHERE elm_usr_id ISNULL;
Query with a LEFT OUTER JOIN :
SELECT * FROM element LEFT OUTER JOIN user ON elm_usr_id = usr_id;
Your two queries may not produce the same result.
Your query with UNION ALL returns rows that matches plus rows that not matches because of a null value in elm_usr_id.
While the query with LEFT JOIN (same as LEFT OUTER JOIN) returns rows that matches plus rows that not matches because of any not corresponding value.
Regarding to this, the query with LEFT JOIN is more secure if you expect to see all rows.
Back to your original question, the query with LEFT JOIN is the best on for taking advantage of indexes. For example, if you'd like to have a sorted result, then the UNION query will be far slowest. Or if your query is a subquery in a main query, then the UNION will prevent any possible exploitation of table [element] indexes. So it will be slow to perform a JOIN or WHERE of such a subquery.
I would suggest LEFT OUTER JOIN over union all in this particular scenario,
as in union all you have to read the tables twice, whereas in LEFT OUTER JOIN only once
Probably the LEFT JOIN, but you can see the query plan by running EXPLAIN ANALYSE SELECT.... The UNION ALL form might be clearer if you were modifying columns based on the null-ness of elm_usr_id but you could always use CASE to do column modifications with a LEFT JOIN.

Left join or select from multiple table using comma (,) [duplicate]

This question already has answers here:
SQL left join vs multiple tables on FROM line?
(12 answers)
Closed 8 years ago.
I'm curious as to why we need to use LEFT JOIN since we can use commas to select multiple tables.
What are the differences between LEFT JOIN and using commas to select multiple tables.
Which one is faster?
Here is my code:
SELECT mw.*,
nvs.*
FROM mst_words mw
LEFT JOIN (SELECT no as nonvs,
owner,
owner_no,
vocab_no,
correct
FROM vocab_stats
WHERE owner = 1111) AS nvs ON mw.no = nvs.vocab_no
WHERE (nvs.correct > 0 )
AND mw.level = 1
...and:
SELECT *
FROM vocab_stats vs,
mst_words mw
WHERE mw.no = vs.vocab_no
AND vs.correct > 0
AND mw.level = 1
AND vs.owner = 1111
First of all, to be completely equivalent, the first query should have been written
SELECT mw.*,
nvs.*
FROM mst_words mw
LEFT JOIN (SELECT *
FROM vocab_stats
WHERE owner = 1111) AS nvs ON mw.no = nvs.vocab_no
WHERE (nvs.correct > 0 )
AND mw.level = 1
So that mw.* and nvs.* together produce the same set as the 2nd query's singular *. The query as you have written can use an INNER JOIN, since it includes a filter on nvs.correct.
The general form
TABLEA LEFT JOIN TABLEB ON <CONDITION>
attempts to find TableB records based on the condition. If the fails, the results from TABLEA are kept, with all the columns from TableB set to NULL. In contrast
TABLEA INNER JOIN TABLEB ON <CONDITION>
also attempts to find TableB records based on the condition. However, when fails, the particular record from TableA is removed from the output result set.
The ANSI standard for CROSS JOIN produces a Cartesian product between the two tables.
TABLEA CROSS JOIN TABLEB
-- # or in older syntax, simply using commas
TABLEA, TABLEB
The intention of the syntax is that EACH row in TABLEA is joined to EACH row in TABLEB. So 4 rows in A and 3 rows in B produces 12 rows of output. When paired with conditions in the WHERE clause, it sometimes produces the same behaviour of the INNER JOIN, since they express the same thing (condition between A and B => keep or not). However, it is a lot clearer when reading as to the intention when you use INNER JOIN instead of commas.
Performance-wise, most DBMS will process a LEFT join faster than an INNER JOIN. The comma notation can cause database systems to misinterpret the intention and produce a bad query plan - so another plus for SQL92 notation.
Why do we need LEFT JOIN? If the explanation of LEFT JOIN above is still not enough (keep records in A without matches in B), then consider that to achieve the same, you would need a complex UNION between two sets using the old comma-notation to achieve the same effect. But as previously stated, this doesn't apply to your example, which is really an INNER JOIN hiding behind a LEFT JOIN.
Notes:
The RIGHT JOIN is the same as LEFT, except that it starts with TABLEB (right side) instead of A.
RIGHT and LEFT JOINS are both OUTER joins. The word OUTER is optional, i.e. it can be written as LEFT OUTER JOIN.
The third type of OUTER join is FULL OUTER join, but that is not discussed here.
Separating the JOIN from the WHERE makes it easy to read, as the join logic cannot be confused with the WHERE conditions. It will also generally be faster as the server will not need to conduct two separate queries and combine the results.
The two examples you've given are not really equivalent, as you have included a sub-query in the first example. This is a better example:
SELECT vs.*, mw.*
FROM vocab_stats vs, mst_words mw
LEFT JOIN vocab_stats vs ON mw.no = vs.vocab_no
WHERE vs.correct > 0
AND mw.level = 1
AND vs.owner = 1111