Nested sql joins process explanation needed - sql

I want to understand the process of nested join clauses in sql queries. Can you explain this example with pseudo codes? (What is the order of joining tables?)
FROM
table1 AS t1 (nolock)
INNER JOIN table2 AS t2 (nolock)
INNER JOIN table3 as t3 (nolock)
ON t2.id = t3.id
ON t1.mainId = t2.mainId

In SQl basically we have 3 ways to join two tables.
Nested Loop ( Good if one table has small number of rows),
Hash Join (Good if both table has very large rows, it does expensive hash formation in memory)
Merge Join (Good when we have sorted data to join).
From your question it seems that you want for Nested Loop.
Let us say t1 has 20 rows, t2 has 500 rows.
Now it will be like
For each row in t1
Find rows in t2 where t1.MainId = t2.MainId
Now out put of that will be joined to t3.
Order of Joining depends on Optimizer, Expected Row count etc.

Try EXPLAIN query.
It tells you exactly what's going on. :)
Of course that doesn't work in SQL Server. For that you can try Razor SQLServer Explain Plan
Or even SET SHOWPLAN_ALL

If you're using SQL Server Query Analyzer, look for "Show Execution Plan" under the "Query" menu, and enable it.

Related

INNER JOIN with complex condition dramatically increases the execution time

I have 2 tables with several identical fields needed to be linked in JOIN condition. E.g. in each table there are fields: P1, P2. I want to write the following join query:
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P1
OR Table1.P2 = Table2.P2
OR Table1.P1 = Table2.P2
OR Table1.P2 = Table2.P1
In the case I have huge tables this request is executing a lot of time.
I tried to test how long will be the request of a query with one condition only. First, I have modified the tables in such way all data from P2 & P1 where copied as new rows into Table1 & Table2. So my query is simple:
SELECT ... FROM Table1 INNER JOIN Table2 ON Table1.P = Table2.P
The result was more then surprised: the execution time from many hours (the 1st case) was reduced to 2-3 seconds!
Why is it so different? Does it mean the complex conditions are always reduce performance? How can I improve the issue? May be P1,P2 indexing will help? I want to remain the 1st DB schema and not to move to one field P.
The reason the queries are different is because of the join strategies being used by the optimizer. There are basically four ways that two tables can be joined:
"Hash join": Creates a hash table on one of the tables which it uses to look up the values in the second.
"Merge join": Sorts both tables on the key and then readsthe results sequentially for the join.
"Index lookup": Uses an index to look up values in one table.
"Nested Loop": Compars each value in each table to all the values in the other table.
(And there are variations on these, such as using an index instead of a table, working with partitions, and handling multiple processors.) Unfortunately, in SQL Server Management Studio both (3) and (4) are shown as nested loop joins. If you look more closely, you can tell the difference from the parameters in the node.
In any case, your original join is one of the first three -- and it goes fast. These joins can basically only be used on "equi-joins". That is, when the condition joining the two tables includes an equality operator.
When you switch from a single equality to an "in" or set of "or" conditions, the join condition has changed from an equijoin to a non-equijoin. My observation is that SQL Server does a lousy job of optimization in this case (and, to be fair, I think other databases do pretty much the same thing). Your performance hit is the hit of going from a good join algorithm to the nested loops algorithm.
Without testing, I might suggest some of the following strategies.
Build an index on P1 and P2 in both tables. SQL Server might use the index even for a non-equijoin.
Use the union query suggested in another solution. Each query should be correctly optimized.
Assuming these are 1-1 joins, you can also do this as a set of multiple joins:
from table1 t1 left outer join
table2 t2_11
on t1.p1 = t2_11.p1 left outer join
table2 t2_12
on t1.p1 = t2_12.p2 left outer join
table2 t2_21
on t1.p2 = t2_21.p2 left outer join
table2 t2_22
on t1.p2 = t2_22.p2
And then use case/coalesce logic in the SELECT to get the value that you actually want. Although this may look more complicated, it should be quite efficient.
you can use 4 query and Union there result
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P2
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P2 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P2 = Table2.P2
Does using CTEs help performance?
;WITH Table1_cte
AS
(
SELECT
...
[P] = P1
FROM Table1
UNION
SELECT
...
[P] = P2
FROM Table1
)
, Table2_cte
AS
(
SELECT
...
[P] = P1
FROM Table2
UNION
SELECT
...
[P] = P2
FROM Table2
)
SELECT ... FROM Table1_cte x
INNER JOIN
Table2_cte y
ON x.P = y.P
I suspect, as far as the processor is concerned, the above is just different syntax for the same complex conditions.

Most optimal order (of joins) for left join

I have 3 tables Table1 (with 1020690 records), Table2(with 289425 records), Table 3(with 83692 records).I have something like this
SELECT * FROM Table1 T1 /* OK fine select * is bad when not all columns are needed, this is just an example*/
LEFT JOIN Table2 T2 ON T1.id=T2.id
LEFT JOIN Table3 T3 ON T1.id=T3.id
and a query like this
SELECT * FROM Table1 T1
LEFT JOIN Table3 T3 ON T1.id=T3.id
LEFT JOIN Table2 T2 ON T1.id=T2.id
The query plan shows me that it uses 2 Merge Join for both the joins. For the first query, the first merge is with T1 and T2 and then with T3. For the second query, the first merge is with T1 and T3 and then with T2.
Both these queries take about the same time(40 seconds approx.) or sometimes Query1 takes couple of seconds longer.
So my question is, does the join order matter ?
The join order for a simple query like this should not matter. If there's a way to reorder the joins to improve performance, that's the job of the query optimizer.
In theory, you shouldn't worry about it -- that's the point of SQL. Trying to outthink the query optimizer is generally not going to give better results. Especially in MS SQL Server, which has a very good query optimizer.
I wouldn't expect this query to take 40 seconds. You might not have the right indexes defined. You should use tools like SQL Server Profiler or SQL Server Database Engine Tuning Advisor to see if it can recommend any new indexes.
The query optimizer will use a combination of the constraints, indexes, and statistics collected on the table to build an execution plan. In most cases this works well. However, I do occasionally encounter scenarios where the execution plan is chosen poorly. Often times tweaking the query can effectively coerce the optimizer into a choosing a better plan. I can offer no general rules for doing this though. When all else fails you could resort to the FORCE ORDER query hint.
And yes, the join order can have a significant impact on execution time of your query. The idea is that by joining the tables that yield the smallest results first will cause the next join to be computed more quickly. Edit: It is important to note, however, that in the abscense of FORCE ORDER and in all other things being equal the order you specify in the query may have no correlation with the way the optimizer builds the execution plan.
In general, SQL Server is smart enough to pick out the best way to join and it will not only use the order you wrote in the query. That said, I find it easier to understand a complex query if all the inner joins are first and then the left joins.

TABLE1 T1, TABLE2 T2 WHERE T1.Blah = T2.Blah - VS - INNER JOIN

Provided that the tables could essentially be inner joined, since the where clause excludes all records that don't match, just exactly how bad is it to use the first of the following 2 query statement syntax styles:
SELECT {COLUMN LIST}
FROM TABLE1 t1, TABLE2 t2, TABLE3 t3, TABLE4 t4 (etc)
WHERE t1.uid = t2.foreignid
AND t2.uid = t3.foreignid
AND t3.uid = t4.foreignid
etc
instead of
SELECT {COLUMN LIST}
FROM TABLE1 t1
INNER JOIN TABLE2 t2 ON t1.uid = t2.foreignid
INNER JOIN TABLE3 t3 ON t2.uid = t3.foreignid
INNER JOIN TABLE4 t4 ON t3.uid = t4.foreignid
I'm not sure if this is limited to microsoft SQL, or even a particular version, but my understanding is that the first scenario does a full outer join to make all possible correlations accessible.
I've used the first approach in the past to optimise queries that access two significantly large stores of data that each have peripheral table joined to them, with the product of those joins coming together late in the query. By allowing each of the "larger" table to join to their respective lookup tables, and only combining a specific subset of each of the larger tables, I found that there were notable speed improvements over introducing the large tables to each other prior to specific filtering.
Under normal (simple joins) circumstance, would it not be far better to use the second scenario? I find it to be more easily readable and it seems like it'll be much faster.
INNER JOIN ON vs WHERE clause
Maybe the best way to answer this is to take a look at how the database handles the query internally. If you're on SQL Server, use Profiler to see how many reads etc. each query takes and the query plan to see what route is being taken through the data. Statistics, skewing etc. will also most likely play a role.
The first query doesn't produce a full OUTER join (which is the union of both LEFT and RIGHT joins). Essentially unless there are some [internal] SQL parser - specific optimizations, both queries are equal.
Personally I would never use the first syntax. It may be the same performancewise but it is harder to maintain and far more subject to accidental cross joins when things get complex. If you miss an ON condition, it will fail the syntax check , if you miss one of the WHERE conditions that is the equivalent of an ON condition, it will happily do a cross join. It is also a syntax that is 17 years out of date for goodness sakes!
Further, the left and right join syntax in the old syntax are broken in SQL Server and do NOT always return the correct results (it can sometimes interpet the results as a corss join instead of an outerjoin) and they have been deprecated and will not be useable at all in the next version. If you need to change one of the queries to use an outer join, then you can be looikng at a major rewrite as it is especially bad to try to mix the two kinds of syntax.

Execute MySQL update query on 750k rows

I've added a field to a MySQL table. I need to populate the new column with the value from another table. Here is the query that I'd like to run:
UPDATE table1 t1
SET t1.user_id =
(
SELECT t2.user_id
FROM table2 t2
WHERE t2.usr_id = t1.usr_id
)
I ran that query locally on 239K rows and it took about 10 minutes. Before I do that on the live environment I wanted to ask if what I am doing looks ok i.e. does 10 minutes sound reasonable. Or should I do it another way, a php loop? a better query?
Use an UPDATE JOIN! This will provide you a native inner join to update from, rather than run the subquery for every bloody row. It tends to be much faster.
update table1 t1
inner join table2 t2 on
t1.usr_id = t2.usr_id
set t1.user_id = t2.user_id
Ensure that you have an index on each of the usr_id columns, too. That will speed things up quite a bit.
If you have some rows that don't match up, and you want to set t1.user_id = null, you will need to do a left join in lieu of an inner join. If the column is null already, and you're just looking to update it to the values in t2, use an inner join, since it's faster.
I should make mention, for posterity, that this is MySQL syntax only. The other RDBMS's have different ways of doing an update join.
There are two rather important pieces of information missing:
What type of tables are they?
What indexes exist on them?
If table2 has an index that contains user_id and usr_id as the first two columns and table1 is indexed on user_id, it shouldn't be that bad.
You don't have an index on t2.usr_id.
Create this index and run your query again, or a multiple-table UPDATE proposed by #Eric (with LEFT JOIN, of course).
Note that MySQL lacks other JOIN methods than NESTED LOOPS, so it's index that matters, not the UPDATE syntax.
However, the multiple table UPDATE is more readable.

is it better to put more logic in your ON clause or should it only have the minimum necessary?

Given these two queries:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId
WHERE t2.aField <> 'C'
OR:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId and t2.aField <> 'C'
Is there a demonstrable difference between the two? Seems to me that the clause "t2.aField <> 'C'" will run on every row in t2 that meets the join criteria regardless. Am I incorrect?
Update: I did an "Include Actual Execution Plan" in SQL Server. The two queries were identical.
I prefer to use the Join criteria for explaining how the tables are joined together.
So I would place the additional clause in the where section.
I hope (although I have no stats), that SQL Server would be clever enough to find the optimal query plan regardless of the syntax you use.
HOWEVER, if you have indexes which also have id, and aField in them, I would suggest placing them together in the inner join criteria.
It would be interesting to see the query plan's in these 2 (or 3) scenarios, and see what happens. Nice question.
There is a difference. You should do an EXPLAIN PLAN for both of the selects and see it in detail.
As for a simplier explanation:
The WHERE clause gets executed only after the joining of the two tables, so it executes for each row returned from the join and not nececerally every one from table2.
Performance wise its best to eliminate unwanted results early on so there should be less rows for joins, where clauses or other operations to deal with later on.
In the second example, there are 2 columns that have to be same for the rows to be joined together so it usually will give different results than the first one.
It depends.
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId
WHERE
t2.SomeValue IS NULL
is different from
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId AND t2.SomeValue IS NULL
It is different because the former crosses out all records from t2 that have NULL in t2.SomeValue and those from t1 that are not referenced in t2. The latter crosses out only the t2 records that have NULL in t2.SomeValue.
Just use the ON clause for the join condition and the WHERE clause for the filter.
Unless moving the join condition to the where clause changes the meaning of the query (like in the left join example above), then it doesn't matter where you put them. SQL will re-arrange them, and as long as they are provably equivalent, you'll get the same query.
That being said, I think it's more of a logical / readability thing. I usually put anything that relates two tables in the join, and anything that filters in the where.
I'd prefer first query. SQL server will use the best join type for your query based on indexes you have, after that will apply WHERE clause. But you can run both queries at the same time, look at execution plans, compare and choose the fastest (optimize adding indexes also).
unless you are working on a single-user app or something similarly small that creates trivial load, the only considerations that mean anything is how the server will process your query.
The answers that mention query plans give good advice.
In addition, set io statistics on to get an idea of how many reads your query will generate (I especially love Azder's post).
Think of every DB server as a pump of data from disk to client. That pump goes faster if it performs only the IO needed to get the job done. If the data is in cache it will be even faster. But you don't want to be reading more than you need from disk - that will result in crowding out of your cache useful data for no good reason.