Ignoring version, what are the best practices for formatting SQL code?
I prefer this way (method A):
select col from a inner join b on a.id = b.id inner join c on b.id = c.id
a colleague prefers another (method B):
select col from a inner join (b inner join c on b.id=c.id) on a.id = b.id
I'd like to know if there is any difference - the query optimiser appears to generate the same execution plan for both. So maybe it is just readability?
This is the first time I've seen SQL written using method B, does anyone else write SQL like this? Personally I find it really difficult to read method B.
EDIT: Please note the code is on one line and in upper case to make both more comparable for the purpose of this question.
I think A is more readable, and most sample code out there uses that style. Both parse the same and product the same query plan, so as far as SQL Server is concerned, there is no difference.
I normally also uppercase keywords and indent for readability:
SELECT col
FROM a
INNER JOIN b
ON a.id = b.id
INNER JOIN c
ON b.id = c.id
Method B is a subselect-like syntax, but it is parsed the same way as method A. There's no harm in using it. I personally prefer method A too, because it can be read in a lineair fashion.
My personal preference is
SELECT col1, col2, col3,
col4, col5
FROM a
INNER JOIN b ON a.id = b.id
INNER JOIN c ON b.id = c.id
WHERE a.col1 = 1
I think consistency is key, I prefer your way over your colleagues for readability.
Related
Vertica has an interesting update syntax when updating a table based on a join value. Instead of using a join to find the update rows, it mandates a syntax like this:
UPDATE a
SET col = b.val
where a.id = b.id
(Note that this syntax is indeed mandated in this case, because Vertica prohibits us from using a where clause that includes a "self-join", that is a join referencing the table being updated, in this case a.)
This syntax is nice, but it's less explicit about the join being used than other SQL dialects. For example, what happens in this case?
UPDATE a
SET col = CASE 0 if b.id IS NULL ELSE b.val END
where a.id = b.id
What happens when a.id has no match in b.id? Does a.col not get updated, as though the condition a.id = b.id represented an inner join of a and b? Or does it get updated to zero, as if the condition were a left outer join?
I think Vertica uses the Postgres standard for this syntax:
UPDATE a
SET col = b.val
FROM b
whERE a.id = b.id;
This is an INNER JOIN. I agree that it would be nice if Postgres and the derived databases supported explicit JOINs to the update table (as some other databases do). But the answer to your question is that this is an INNER JOIN.
I should note that if you want a LEFT JOIN, you have two options. One is a correlated subquery:
UPDATE a
SET col = (SELECT b.val FROM b whERE a.id = b.id);
The other is an additional level of JOIN (assuming that id is unique in a):
UPDATE a
SET col = b.val
FROM a a2 LEFT JOIN
b
ON a2.id = b.id
WHERE a.id = a2.id;
While preparing some requests, I was writing this :
SELECT *
FROM ta A
JOIN tb B
ON A.col1 = B.col1
JOIN tc C
ON B.col2 = C.col2
WHERE B.col3 = 'whatever'
AND C.col4 = 'whatever2'
And I began to think about the following :
SELECT *
FROM ta A
JOIN (SELECT * FROM tb WHERE col3 = 'whatever') B
ON A.col1 = B.col1
JOIN (SELECT * FROM tc WHERE col4 = 'whatever2') C
ON B.col2 = C.col2
(If I'm not mistaken, the result would be the same). I'm wondering if it would be significantly faster ? My guess is that it would but I'd be interested in knowing why/why not ?
(Because our server is down at the moment, I can't test it myself right now, so I'm asking here, I hope you won't mind.)
(In case it matters, the engine is Vertica, but my question isn't really specific to Vertica)
Your second query is a little off, it should be:
SELECT *
FROM ta A
JOIN (SELECT * FROM tb WHERE tb.col3 = 'whatever') B
ON A.col1 = B.col1
JOIN (SELECT * FROM tc WHERE tc.col4 = 'whatever2') C
ON B.col2 = C.col2
Notice the inline view where clauses need to reference the table in scope, not the alias for the view. B and C are out of scope within the inline views.
In any case, because you are doing an inner join, it won't matter from a results perspective because the condition is the same whether it occurs pre-join or post-join.
You can reasonably rely on the optimizer to do the following:
Only materialize the columns required when needed.
Push predicates down where it makes sense
That said, there should be no difference between the two statements. Most likely it is pushing down predicates for the first one to make it more like the second one. If you have statistics gathered, the optimizer should be smart enough to query these the same way (or really close).
That isn't to say I haven't seen what you have in your second query "fix" query issues for me in Vertica... but usually it's only when I am using multiple COUNT(DISTINCT ...) expressions or theta joins, etc.
Now if this were an outer join, then the statements would be different. The first one would apply the filter after the join, the second would be before the join.
Of course, I'll mention that you really just need to do an explain of both methods. Just make sure statistics are gathered.
Hope it helps.
Your first query will work fine, but the second query will not be executed and causes error. The reason behind it is, you are taking JOIN (SELECT * FROM tb WHERE B.col3 = 'whatever') B ON A.col1 = B.col1.
In this condition you are matching the column with A.col1 = B.col1. Here you will get A.col1 from ta table, but you will not get B.col1. While specifying a sub query in the join, you should not use ' * ' operator. Joins will not recognize this operator in a sub query. You need to specify required column names. Like the example in below query,
SELECT *
FROM ta A
JOIN (SELECT col1,col2 FROM tb WHERE B.col3 = 'whatever') B
ON A.col1 = B.col1
JOIN (SELECT col2 FROM tc WHERE C.col4 = 'whatever2') C
ON B.col2 = C.col2
This will execute and provides you a result. Two columns is taken in the first join sub query col1,col2, as you are using the condition B.col2 from B table in the second join condition. In a select clause you can provide ' * ' operator which provides you all the columns from all three tables. But you are not supposed to use the operator in a sub query of a join, as joins are coded in such a way.
Both the queries does not have much difference, but your first logic will execute faster compared to the second. In the second logic, two sub queries are used which makes multiple searches in the database and provides you result little slower than the first logic.
When joining tables with either the ANSI-89 (old) or the ANSI-92 ("new") method of joining tables, does it matter which side you place the fields from the 2 joining tables.
For example, is it better to do:
From
TABLE_1 A
Join
TABLE_2 B
on A.ID = B.ID
Or is the following better?
on B.ID = A.ID
Is it simply aesthetics? Or does it effect how the joins work?
EDIT: For further clarification, what about Left Joins? For example:
From
TABLE_1 A
Left Join
TABLE_2 B
on A.ID = B.ID
Is this the same as
on B.ID = A.ID
However, if using ANSI-89 Where A.ID = B.ID (+) is NOT the same as Where B.ID = A.ID (+) since the second joins A ONTO B?
It makes no difference. The only time the order matters is when you are doing LEFT and RIGHT OUTER joins, but those keywords all fall before the ON keyword.
The = operator is symmetric, so a.id = b.id is exactly the same as b.id = a.id. Personally, I prefer having the fields from the driving table (the one in the FROM clause) on the left hand side of the operator, but that's purely an aesthetic preference.
I am trying to left join to tables using a query like this
SELECT * FROM table1 a, table2 b WHERE (a.ID = b.ID OR b.ID IS NULL)
In Oracle, this is equivalent to a LEFT JOIN (and in other databases as well, afaik).
Doing the same thing in DB2 (z/OS) produces an inner join - the b.ID IS NULL clause has no effect on the result, removing it does not change anything.
Is there a way to make this work in DB2? Is this something that should work according to ANSI SQL?
PS: I am aware that I can use the JOIN syntax, I'm just interested in why this doesn't work and if there is a way around this.
You can use
SELECT a.*, b.*
FROM tbl1 a LEFT JOIN tbl2 b ON a.id=b.id;
I have several statements which access very large Postgresql tables i.e. with:
SELECT a.id FROM a WHERE a.id IN ( SELECT b.id FROM b );
SELECT a.id FROM a WHERE a.id NOT IN ( SELECT b.id FROM b );
Some of them even access even more tables in that way. What is the best approach to increase the performence, should I switch i.e. to joins?
Many thanks!
JOIN will be far more efficient, or you can use EXISTS:
SELECT a.id FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
The subquery will return at most 1 row.
Here's a way to filter rows with an INNER JOIN:
SELECT a.id
FROM a
INNER JOIN b ON a.id = b.id
Note that each version can perform differently; sometimes IN is faster, sometimes EXISTS, and sometimes the INNER JOIN.
Yes, i would recomend going to joins. It will speed up the select statements.