Joins, conditions and speed in SQL - sql

While preparing some requests, I was writing this :
SELECT *
FROM ta A
JOIN tb B
ON A.col1 = B.col1
JOIN tc C
ON B.col2 = C.col2
WHERE B.col3 = 'whatever'
AND C.col4 = 'whatever2'
And I began to think about the following :
SELECT *
FROM ta A
JOIN (SELECT * FROM tb WHERE col3 = 'whatever') B
ON A.col1 = B.col1
JOIN (SELECT * FROM tc WHERE col4 = 'whatever2') C
ON B.col2 = C.col2
(If I'm not mistaken, the result would be the same). I'm wondering if it would be significantly faster ? My guess is that it would but I'd be interested in knowing why/why not ?
(Because our server is down at the moment, I can't test it myself right now, so I'm asking here, I hope you won't mind.)
(In case it matters, the engine is Vertica, but my question isn't really specific to Vertica)

Your second query is a little off, it should be:
SELECT *
FROM ta A
JOIN (SELECT * FROM tb WHERE tb.col3 = 'whatever') B
ON A.col1 = B.col1
JOIN (SELECT * FROM tc WHERE tc.col4 = 'whatever2') C
ON B.col2 = C.col2
Notice the inline view where clauses need to reference the table in scope, not the alias for the view. B and C are out of scope within the inline views.
In any case, because you are doing an inner join, it won't matter from a results perspective because the condition is the same whether it occurs pre-join or post-join.
You can reasonably rely on the optimizer to do the following:
Only materialize the columns required when needed.
Push predicates down where it makes sense
That said, there should be no difference between the two statements. Most likely it is pushing down predicates for the first one to make it more like the second one. If you have statistics gathered, the optimizer should be smart enough to query these the same way (or really close).
That isn't to say I haven't seen what you have in your second query "fix" query issues for me in Vertica... but usually it's only when I am using multiple COUNT(DISTINCT ...) expressions or theta joins, etc.
Now if this were an outer join, then the statements would be different. The first one would apply the filter after the join, the second would be before the join.
Of course, I'll mention that you really just need to do an explain of both methods. Just make sure statistics are gathered.
Hope it helps.

Your first query will work fine, but the second query will not be executed and causes error. The reason behind it is, you are taking JOIN (SELECT * FROM tb WHERE B.col3 = 'whatever') B ON A.col1 = B.col1.
In this condition you are matching the column with A.col1 = B.col1. Here you will get A.col1 from ta table, but you will not get B.col1. While specifying a sub query in the join, you should not use ' * ' operator. Joins will not recognize this operator in a sub query. You need to specify required column names. Like the example in below query,
SELECT *
FROM ta A
JOIN (SELECT col1,col2 FROM tb WHERE B.col3 = 'whatever') B
ON A.col1 = B.col1
JOIN (SELECT col2 FROM tc WHERE C.col4 = 'whatever2') C
ON B.col2 = C.col2
This will execute and provides you a result. Two columns is taken in the first join sub query col1,col2, as you are using the condition B.col2 from B table in the second join condition. In a select clause you can provide ' * ' operator which provides you all the columns from all three tables. But you are not supposed to use the operator in a sub query of a join, as joins are coded in such a way.
Both the queries does not have much difference, but your first logic will execute faster compared to the second. In the second logic, two sub queries are used which makes multiple searches in the database and provides you result little slower than the first logic.

Related

Is it necessary to reduce update times even use group by statement?

There are two tables
Table A col1,col2,col3
100,200,aaa;
101,200,bbb;
102,200,ccc;
Table B col1,col2,col3
aaa,1,ok;
aaa,2,ok;
aaa,3,ok;
bbb,1,fine;
bbb,3,fine;
Assume table A is a very large table and table B is a small table. In table B, col1 only have one col3 value, e.g, if col1 is 'aaa', col3 must be 'ok'
case 1:
update a set a.col2 = b.col3
from A a, B b
where a.col3 = b.col1
case 2:
update a set a.col2 = b.col3
from A a, (select col1, col3 from B group by col1,col3) b
where a.col3 = b.col1
The result of case 1 and case 2 are the same, but I just want to ask which statement is better? Whether case 1 will update table A for 5 times? Will the group by statement in case 2 consume more calcuation?
You should run EXPLAIN on both these queries to see how your database is actually handling things. That being said, one thing does stand out in terms of performance. In your first query:
update a set a.col2 = b.col3
from A a, B b
where a.col3 = b.col1
you are joining table A with B via the col3 and col1 columns. If there were an index on B.col1 then the join could proceed much faster than if the database were forced to do a full table scan of B. But an index on B.col1 probably would not help in your second query:
update a set a.col2 = b.col3
from A a, (select col1, col3 from B group by col1,col3) b
where a.col3 = b.col1
Here you are joining A to a table derived from B and as such no index is likely available. So I would opt for your first query.
By the way, you are using the old pre ANSI-92 syntax for joining in your first query and you might want to update it.
Since these 2 statements are logically equal (result wise) they might have the same execution plan and therefore have the same performance.
Different execution plans might give an advantage to each of the statements.
I would like to emphasize one thing -
Nested-loops is not the only option to implement JOIN and in databases that support HASH JOIN they are rarely used for equality JOIN therefore the all way you are thinking about what is going here needs to be revised.
Thank you guys, according to sql execution plan, it will dedup data going to update at background so no need to distinct manually, see below screenshot.
sql server automatically sort/distinct

Change join condition based on a condition - Oracle

assume we have two tables.
TabA(Acol1,Acol2,Acol3)
TabB(Bcol1,Bcol2.Ccol3)
Requirement is like, join two tables on Acol1,Bcol1 and if Acol3='C' then join based on Acol2=Bcol2 in addition to above join. Can we make this in single SQL query ? Is join is record wise or table wise ?
One solution I can get to is using Union, but I dont think this will be a optimized one. Any other solutions ?
Another solution I figured
SELECT A.*,B.* FROM TabA A
INNER JOIN TabB B ON A.Acol1 = B.BCol1
and case when A.Acol3='C' then A.ACol2 else '1' end =
case when A.Acol3='C' then B.BCol2 else '1' end ;
Any other solution without case and Union ?
Thanks in advance
If you want to join on TabA.col2 only when it is 'C' then in those case, TabB.col2 will also be 'C', as you are already joining from col1. So your output will be same which you get just by first join.
select a.*, b.* from tabA a join tabB b
on a.col1=b.col1
This should give you the same output anytime. Try creating a different scenario on values of 'C'. The result will always be a subset of your first join result.
Hmmm, after thinking about the question, I think you might just want a complicated on clause:
select a.*, b.*
from tabA a join
tabB b
on a.col1 = b.col1 and
(a.col3 <> 'C' or a.col2 = b.col2);
Note: the above assumes that a.col2 is not null (that condition is easily included if needed).
You may need to work out some examples by hand to see that the or method is equivalent to the case statement.

join clause, match or null

So I have some procs I inherited that I am trying to clean up. One of the things I see over and over in them is the following:
Update Table_A
Set A.ColX = B.Colx
From Table_A A
Join Table_B B on B.col1 =A.col1
and B.col2 = A.col2
Update Table_A
Set A.ColX = B.Colx
From Table_A A
Join Table_B B on a.col1 =b.col1
and B.col2 is null
Now , I have tried to combine these to make them a single query using the following different final lines (not at the same time!):
1) and (B.col2 = A.col2 or B.col2 is null)
2) and (isnull(B.col2,'') = COALESCE(a.col2, ''))
However, it always seems to do one of the updates, not both. I feel like I am missing something rather obvious, Is there a good way to combine these two queries?
thanks
This query should work:
Update Table_A
Set A.ColX = B.Colx
From Table_A A
Join Table_B B on B.col1 = A.col1
and (B.col2 = A.col2 OR or B.col2 is null)
which you said you tried - but you may try it as a SELECT first and see what the results are. That may shed some light on why you're not getting the results you expect.
I would expect the following query to work in SQL Server:
Update A
Set ColX = B.Colx
From Table_A A Join
Table_B B
on a.col1 = b.col1 and
(B.col2 = A.col2 or B.col2 is null);
Notes:
You should use the alias defined in the from clause after the update. My understanding is that if you use the table name and the table is not in the from clause without an alias, then all rows will be updated.
Although I was pretty sure that SQL Server does not support table aliases in the set, I appear to be wrong about that, as this simple SQL Fiddle shows. Perhaps this was not allowed in some ancient version of SQL Server, and the limitation just stuck with me.

Effect of style/format on SQL

Ignoring version, what are the best practices for formatting SQL code?
I prefer this way (method A):
select col from a inner join b on a.id = b.id inner join c on b.id = c.id
a colleague prefers another (method B):
select col from a inner join (b inner join c on b.id=c.id) on a.id = b.id
I'd like to know if there is any difference - the query optimiser appears to generate the same execution plan for both. So maybe it is just readability?
This is the first time I've seen SQL written using method B, does anyone else write SQL like this? Personally I find it really difficult to read method B.
EDIT: Please note the code is on one line and in upper case to make both more comparable for the purpose of this question.
I think A is more readable, and most sample code out there uses that style. Both parse the same and product the same query plan, so as far as SQL Server is concerned, there is no difference.
I normally also uppercase keywords and indent for readability:
SELECT col
FROM a
INNER JOIN b
ON a.id = b.id
INNER JOIN c
ON b.id = c.id
Method B is a subselect-like syntax, but it is parsed the same way as method A. There's no harm in using it. I personally prefer method A too, because it can be read in a lineair fashion.
My personal preference is
SELECT col1, col2, col3,
col4, col5
FROM a
INNER JOIN b ON a.id = b.id
INNER JOIN c ON b.id = c.id
WHERE a.col1 = 1
I think consistency is key, I prefer your way over your colleagues for readability.

Restricting a LEFT JOIN

I have a table, let's call it "a" that is used in a left join in a view that involves a lot of tables. However, I only want to return rows of "a" if they also join with another table "b". So the existing code looks like
SELECT ....
FROM main ...
...
LEFT JOIN a ON (main.col2 = a.col2)
but it's returning too many rows, specifically ones where a doesn't have a match in b. I tried
SELECT ...
FROM main ...
...
LEFT JOIN (
SELECT a.col1, a.col2
FROM a
JOIN b ON (a.col3 = b.col3)) ON (a.col2 = main.col2)
which gives me the correct results but unfortunately "EXPLAIN PLAN" tells that doing it this way ends up forcing a full table scan of both a and b, which is making things quite slow. One of my co-workers suggested another LEFT JOIN on b, but that doesn't work because it gives me the b row when it's present, but doesn't stop returning the rows from a that don't have a match in b.
Is there any way to put the main.col2 condition in the sub-SELECT, which would get rid of the full table scans? Or some other way to do what I want?
SELECT ...
FROM ....
LEFT JOIN ( a INNER JOIN b ON .... ) ON ....
add a where (main.col2 = a.col2)
just do a join instead of a left join.
What if you created a view that gets you the "a" to "b" join, then do your left joins to that view?
Select ...
From Main
Left Join a on main.col2 = a.col2
where a.col3 in (select b.col3 from b) or a.col3 is null
you may also need to do some indexing on a.col3 and b.col3
First define your query between table "a" and "b" to make sure it is returning the rows you want:
Select
a.field1,
a.field2,
b.field3
from
table_a a
JOIN table_b b
on (b.someid = a.someid)
then put that in as a sub-query of your main query:
select
m.field1,
m.field2,
m.field3,
a.field1 as a_field1,
b.field1 as b_field1
from
Table_main m
LEFT OUTER JOIN
(
Select
a.field1,
a.field2,
b.field3
from
table_a a
JOIN table_b b
on (b.someid = a.someid)
) sq
on (sq.field1 = m.field1)
that should do it.
Ahh, missed the performance problem note - what I usually end up doing is putting the query from the view in a stored procedure, so I can generate the sub-queries to temp tables and put indexes on them. Suprisingly faster than you would expect. :-)