Join conditions, intermediate SQL - sql

what's the difference between the join conditions "on" and "using" if both are used to select specified column(s)?

The main difference with using is that the columns for the join have to have the same names. This is generally a good practice anyway in the data model.
Another important difference is that the columns come from different tables -- the join condition doesn't specify the tables (some people view this as a weakness, but you'll see it is quite useful).
A handy feature is that the common columns used for the join are removed when you use select *. So
select *
from a join
b
on a.x = b.x
will result in x appearing twice in the result set. This is not allowed for subqueries or views. On the other hand, this query only has x once in the result set.
select *
from a join
b
using (x)
Of course, other columns could be duplicated.
For an outer join, the value is the non-NULL value, if any. This becomes quite handy for full joins:
select *
from a full join
b
using (x) full join
c
using (x);
Because of the null values, expressing this without using is rather cumbersome:
select *
from a full join
b
on b.x = a.x full join
c
on c.x = coalesce(a.x, b.x);

using is just a short-circuit to express the join condition when the related columns have the same name.
Consider the following example:
select ...
from orders o
inner join order_items oi on oi.order_id = o.order_id
This can be shortened with using, as follows:
select ...
from orders o
inner join order_items oi using(order_id)
Notes:
this also works when joining on several columns having identical names
parentheses are mandatory with using

Related

Joining two tables, customers and orders to get list of all customers and the order number IF they have an order

SELECT a.org,
a.id,
a.Name,
b.ordNum
FROM customers A,
orders B
WHERE a.org = 'JJJ'
AND a.org = b.org (+)
AND b.addr_type (+) = 'ST' -- <<<<<<<<<<<<<<<<< why do i need to add (+) here
AND a.cust_id = b.cust_id (+)
ORDER BY 2
I have a table with a list of customers (A) and a table called orders (B) that have orders the customers may have placed .
The query i have above is supposed to give me the names of all customers and the order number IF there is an order linked to that customer.
My question is.. why do i need to add the (+) after b.addr_type to get all the customers even if they have not placed an order.
That is the old-style JOIN syntax, wherein (+) denotes an OUTER JOIN. This means that every row in the left table will be returned whether it has a match on the right table or not. To get only the customers with order, use an INNER JOIN. Additionally, you should use explicit JOIN and not the old-style syntax:
SELECT
c.ORG, c.ID, c.NAME, o.ordNum
FROM customers c -- Use meaningful aliases to improve readability
LEFT JOIN orders o
ON c.org = o.org
AND c.cust_id = o.cust_id
AND o.addr_type = 'ST'
WHERE
c.org = 'JJJ'
ORDER BY c.ID
The (+) is an "outer join" in old style syntax. This means EVERY row on the left side of the join is returned with a "null" in the right hand side table's colums if no match is made.
An INNER join (regular equals in old style SQL) would not return a record if there was no match on the right side.
Modern syntax is
SELECT A.ORG, A.ID, A.NAME, b.ordNum FROM
customers A
LEFT OUTER JOIN customers b on a.id = b.id
AND a.cust_id = b.cust_id
AND b.addr_type = 'ST'
WHERE a.org = 'JJJ'
ORDER BY 2
The "OUTER" part is optional, and indeed implicit if you're using the word "LEFT". Your other options are RIGHT and FULL for outer joins.
Why use this new syntax? Because it's ANSI SQL compliant, the (+) is deprecated and won't port over to some modern RDBMS implementations. Plus, as per the comment on this post, it's as ugly as sin and hard to maintain.
The (+) syntax tells Oracle to execute a left join instead of an inner join.
The result is a list of records with all valorized columns from customers and some empty columns from orders table.
If the columns from orders table are NULL, the where condition b.addr_type = 'ST' will be always FALSE for these records, so you will not obtain the desired result.
Instead if you write b.addr_type(+) = 'ST' you'll get all columns matching the condition plus the columns with NULL value because of the left join, that is what you want to get.
In order to avoid such questions,
switch to LEFT JOIN syntax which is more readable
SELECT a.org,
a.id,
a.Name,
b.ordNum
FROM customers a LEFT JOIN
orders b ON (a.org = b.org)
AND (b.addr_type = 'ST')
AND (a.cust_id = b.cust_id)
WHERE a.org = 'JJJ'
ORDER BY a.id -- better put it direct, not field's index

What is actually sef-join?

I have several question about self join, could anyone help answer it?
is there strict format of self join? There are sample like this:
SELECT a.column_name, b.column_name...
FROM table1 a, table1 b
WHERE a.common_field = b.common_field;
But there are sample like:
SELECT a.ID, b.NAME, a.SALARY
FROM CUSTOMERS a, CUSTOMERS b
WHERE a.SALARY < b.SALARY;
I wonder is the connection (a.common_field = b.common_field) necessary? since both formats are self join.
How will the self join be optimized? will they are treated as INNER JOIN or CROSS JOIN? especially, for the second format, is it SELF CROSS JOIN? In SQLite and PostgreSQL, are they treated same way?
My question is I want to extract a structure from a bunch of graph-like data and My query is like
SELECT A.colum, B.colum,....N.colum
FROM
table1 as A, table1 as B, table1 as C .... table2 as M, table2 as N ....
where
A.colum1<B.colum1 and
C.colum1=D.colum1 and
....
In the query, table1,table2... are single column tables, they are components of final structure. is my problem best in this kind of self-join format? I find it's very slow in PostgreSQL but fast in SQLite which makes me confused.
A self join is no different than any other join as far as structure/behavior goes, but they are typically used in different ways.
You should ditch the deprecated syntax of comma separated lists of tables and use ANSI joins:
SELECT a.column_name, b.column_name...
FROM table1 a
JOIN table1 b
ON a.common_field = b.common_field;
You can specify what type of JOIN you want it to be (JOIN,LEFT JOIN, RIGHT JOIN,CROSS JOIN..), and how you want to relate the tables to each other, just like any other join. Equivalency is not required, as you've noted in your a.Salary < b.Salary example.
No, there's no such thing.
A self join is just a special case of joining the table with itself. Think about it like joining two instances of the same thing (is fact no using two instances but two references)
In general you ill inner self join but you can cross join or outter join a table with itself.
Example:
select * from tbPeople p0
join tbPeople p1 on p1.id = p0.parentId
where p0.id = you
that returns you and your parents
select * from tbPeople p0
left join tbPeople p1 on p1.parentId = p0.id
where p0.id = you
that returns your kids, or just you in case you don't have offspring yet

What's the difference between filtering in the WHERE clause compared to the ON clause?

I would like to know if there is any difference in using the WHERE clause or using the matching in the ON of the inner join.
The result in this case is the same.
First query:
with Catmin as
(
select categoryid, MIN(unitprice) as mn
from production.Products
group by categoryid
)
select p.productname, mn
from Catmin
inner join Production.Products p
on p.categoryid = Catmin.categoryid
and p.unitprice = Catmin.mn;
Second query:
with Catmin as
(
select categoryid, MIN(unitprice) as mn
from production.Products
group by categoryid
)
select p.productname, mn
from Catmin
inner join Production.Products p
on p.categoryid = Catmin.categoryid
where p.unitprice = Catmin.mn; // this is changed
Result both queries:
My answer may be a bit off-topic, but I would like to highlight a problem that may occur when you turn your INNER JOIN into an OUTER JOIN.
In this case, the most important difference between putting predicates (test conditions) on the ON or WHERE clauses is that you can turn LEFT or RIGHT OUTER JOINS into INNER JOINS without noticing it, if you put fields of the table to be left out in the WHERE clause.
For example, in a LEFT JOIN between tables A and B, if you include a condition that involves fields of B on the WHERE clause, there's a good chance that there will be no null rows returned from B in the result set. Effectively, and implicitly, you turned your LEFT JOIN into an INNER JOIN.
On the other hand, if you include the same test in the ON clause, null rows will continue to be returned.
For example, take the query below:
SELECT * FROM A
LEFT JOIN B
ON A.ID=B.ID
The query will also return rows from A that do not match any of B.
Take this second query:
SELECT * FROM A
LEFT JOIN B
WHERE A.ID=B.ID
This second query won't return any rows from A that don't match B, even though you think it will because you specified a LEFT JOIN. That's because the test A.ID=B.ID will leave out of the result set any rows with B.ID that are null.
That's why I favor putting predicates in the ON clause rather than in the WHERE clause.
The results are exactly same.
Using "ON" clause is more suggested due to increasing performance of the query.
Instead of requesting the data from tables then filtering, by using on clause, you first filter first data-set and then join the data to other tables. So, lesser data to match and faster result is given.
There is no difference between the above two queries outputs both of them result same.
When you are using On Clause the join operation joins only those rows that matches the codidtion specified on ON Clause
Where as in case of Where Clause, the join opeartion joins all the rows and then filters out based on where condidtion Specified
So, obviously On Clause is more effective and should be preferred over where condidtion

Adding more condition while joining or in where which is better?

SELECT C.*
FROM Content C
INNER JOIN ContentPack CP ON C.ContentPackId = CP.ContentPackId
AND CP.DomainId = #DomainId
...and:
SELECT C.*
FROM Content C
INNER JOIN ContentPack CP ON C.ContentPackId = CP.ContentPackId
WHERE CP.DomainId = #DomainId
Is there any performance difference between this 2 queries?
Because both queries use an INNER JOIN, there is no difference -- they're equivalent.
That wouldn't be the case if dealing with an OUTER JOIN -- criteria in the ON clause is applied before the join; criteria in the WHERE is applied after the join.
But your query would likely run better as:
SELECT c.*
FROM CONTENT c
WHERE EXISTS (SELECT NULL
FROM CONTENTPACK cp
WHERE cp.contentpackid = c.contentpackid
AND cp.domainid = #DomainId)
Using a JOIN risks duplicates if there's more than one CONTENTPACK record related to a CONTENT record. And it's pointless to JOIN if your query is not using columns from the table being JOINed to... JOINs are not always the fastest way.
There's no performance difference but I would prefer the inner join because I think it makes very clear what is it that you are trying to join on both tables.

What is the difference between using a cross join and putting a comma between the two tables?

What is the difference between
select * from A, B
and
select * from A cross join B
? They seem to return the same results.
Is the second version preferred over the first? Is the first version completely syntactically wrong?
They return the same results because they are semantically identical. This:
select *
from A, B
...is (wince) ANSI-89 syntax. Without a WHERE clause to link the tables together, the result is a cartesian product. Which is exactly what alternative provides as well:
select *
from A
cross join B
...but the CROSS JOIN is ANSI-92 syntax.
About Performance
There's no performance difference between them.
Why Use ANSI-92?
The reason to use ANSI-92 syntax is for OUTER JOIN support (IE: LEFT, FULL, RIGHT)--ANSI-89 syntax doesn't have any, so many databases implemented their own (which doesn't port to any other databases). IE: Oracle's (+), SQL Server's =*
Stumbled upon this post from another SO question, but a big difference is the linkage cross join creates. For example using cross apply or another join after B on the first ('comma') variant, the cross apply or join would only refer to the table(s) after the dot. e.g, the following:
select * from A, B join C on C.SomeField = A.SomeField and C.SomeField = B.SomeField
would create an error:
The multi-part identifier "A.SomeField" could not be bound.
because the join on C only scopes to B, whereas the same with cross join...
select * from A cross join B join C on C.SomeField = A.SomeField and C.SomeField = B.SomeField
..is deemed ok. The same would apply if cross apply is used. For example placing a cross apply on a function after B, the function could only use fields of B, where the same query with cross join, could use fields from both A and B.
Of course, this also means the reverse can be used as well. If you want to add a join solely for one of the tables, you can achieve that by going 'comma' on the tables.
They are the same and should (almost) never be used.
Besides brevity (favoring ,) and consistency (favoring CROSS JOIN), the sole difference is precedence.
The comma is lower precedence than other joins.
For example, the explicit form of
SELECT *
FROM a
CROSS JOIN b
JOIN c ON a.id = c.id
is
SELECT *
FROM (
a
CROSS JOIN b
)
INNER JOIN c ON a.id = c.id
which is valid.
Whereas the explicit form of
SELECT *
FROM a,
b
JOIN c ON a.id = c.id
is
SELECT *
FROM a
CROSS JOIN (
b
INNER JOIN c ON a.id = c.id
)
which is invalid (the join clause references inaccessible a).
In your example, there are only two tables, so the two queries are exactly equivalent.
The first version was originally the only way to join two tables. But it has a number of problems so the JOIN keyword was added in the ANSI-92 standard. They give the same results but the second is more explicit and is to be preferred.
To add to the answers already given:
select * from A, B
This was the only way of joining prior to the 1992 SQL standard. So if you wanted an inner join, you'd have to use the WHERE clause for the criteria:
select * from A, B
where A.x = B.y;
One problem with this syntax was that there was no standard for outer joins. Another was that this gets unreadable with many tables and is hence prone to errors and less maintainable.
select * from A, B, C, D
where B.id = C.id_b
and C.id_d = D.id;
Here we have a cross join of A with B/C/D. On purpose or not? Maybe the programmer just forgot the and B.id = A.id_b (or whatever), or maybe this line was deleted by mistake, and maybe still it was really meant to be a cross join. Who could say?
Here is the same with explicit joins
select *
from A
cross join B
inner join C on C.id_b = B.id
inner join D on D.id = C.id_d;
No doubt about the programmers intentions anymore.
The old comma-separated syntax was replaced for good reasons and should not be used anymore.
These are the examples of implicit and explicit cross joins. See http://en.wikipedia.org/wiki/Join_%28SQL%29#Cross_join.
To the comments as to the utility of cross joins, there is one very useful and valid example of using cross joins or commas in the admittedly somewhat obscure world of Postgres generate_series and Postgis spatial sql where you can use a cross join against generate_series to extract the nth geometry out of a Geometry Collection or Multi-(Polygon/Point/Linestring), see: http://postgis.refractions.net/documentation/manual-1.4/ST_GeometryN.html
SELECT n, ST_AsEWKT(ST_GeometryN(the_geom, n)) As geomewkt
FROM (
VALUES (ST_GeomFromEWKT('MULTIPOINT(1 2 7, 3 4 7, 5 6 7, 8 9 10)') ),
( ST_GeomFromEWKT('MULTICURVE(CIRCULARSTRING(2.5 2.5,4.5 2.5, 3.5 3.5), (10 11, 12 11))') )
) As foo(the_geom)
CROSS JOIN generate_series(1,100) n
WHERE n <= ST_NumGeometries(the_geom);
This can be very useful if you want to get the area, centroid, bounding box or many of the other operations you can perform on a single geometry, when they are contained within a larger one.
I have always written such queries using a comma before generate_series, until one day when I wondered if this really meant cross join, which brought me to this post. Obscure, but definitely useful.