Performance impacts on specifying multiple columns in inner join

Performance impacts on specifying multiple columns in inner join - sql

I want to select some records from two tables based on matching the values of two columns.
I have got two queries for the same, out of these one contains join on two columns as:
SELECT
*
FROM
USER_MASTER UM
INNER JOIN
USER_LOCATION UL
ON
UM.CUSTOMER_ID=UL.CUSTOMER_ID AND UM.CREATED_BY=UL.USER_ID
and the same results can be achieved by following query having single column join as:
SELECT
*
FROM
USER_MASTER UM
INNER JOIN
USER_LOCATION UL
ON
UM.CREATED_BY=UL.USER_ID
WHERE
UM.CUSTOMER_ID=UL.CUSTOMER_ID
Is there any difference in performance of above queries?

As everything concerning performance the answer is: It Depends.
In general the engine is smart enough to optimize both queries, I'm not surprised if both produce the same execution plan.
In fact you must run both queries a few times and study the execution plan to actually determine if both run about the same time AND using the same amount of CPU, IO and memory. (Remember performance is not only about running fast, is about smart use of all resources).
For a "semantic" vision, your data is using two keys to be "determined". In that case you can let both expression at the JOIN predicate. Let only filters at the WHERE clause.
The advantage of explicit joins over implicit ones is for create this logic (and visual) separation

Related

Adding a join condition in the from clause and where clause makes query faster. Why?

I'm tuning a query for a large transactional financial system. I've noticed that including a join condition in the where clause as well as the from clause makes the query run significantly faster than either of the two individually. I note that the join in the from clause has more than one condition; I mention this in case it is significant. Here's a simplified example:
SELECT *
FROM employee e
INNER JOIN car c ON c.id = e.car_id AND -- some other join
-- Adding the join above again, in the where clause makes the query faster
WHERE c.id = e.car_id;
I thought ANSI vs old-school was purely syntactic. What's going on?
Update
Having analysed the two execution plans, it's clear that adding the same join in the where clause as the from clause, produces a very different execution plan than having the join in either of the two.
Comparing the plans, I could see what the plan with the additional where clause condition was doing better, and wondered why the one without, was joining in the way that it was. Knowing the optimal plan, a quick tweak to the join conditions resolved matters, although I'm still surprised that both queries didn't compile into the same thing. Black magic.

could be that the WHERE c.id = e.car_id addition is a way for control the order in which the tables are used to perform the proper search ..
this could a way for forcing the query optimizer to use as main table the table in where condition and the table related beacause the the sequence of table joins could not so valid for searching as is usefull for understand the query logic

Why the planner does not execute joins participating in WHERE clause first?

I'm experimenting with PostgreSQL (v9.3). I have a quite large database, and often I need to execute queries with 8-10 joined tables (as source of large data grids). I'm using Devexpress XPO as the ORM above PostgreSQL, so unfortunately I don't have any control over how joins are generated.
The following example is a fairly simplified one, the real scenario is more complex, but as far as my examination the main problem can be seen on this too.
Consider the following variants of the (semantically) same query:
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN orderdetails od ON o.details = od.oid
LEFT JOIN customers c ON o.customer = c.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN customers c ON o.customer = c.oid
LEFT JOIN orderdetails od ON o.details = od.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
The orders table contains about 1 million rows, and the customers about 30 thousand. The order details contains the same amount as orders due to a one-to-one relation.
UPDATE:
It seems like the example is too simplified to reproduce the issue, because I checked again and in this case the two execution plain is identical. However in my real query where there are much more joins, the problem occures: if I put customers as the first join, the execution is 100x faster. I'll add my real query, but due to the hungarian language and the fact that it's been generated by XPO and Npgsql makes it less readable.
The first query is significantly slower (about 100x) than the second, and when I output the plans with EXPLAIN ANALYZE I can see that the order of the joins reflects to their position in the query string. So firstly the two "giant" tables are joined together, and then after the filtered customer table is joined (where the filter selects only one row).
The second query is faster because the join starts with that one customer row, and after that it joins the 20-30 order details rows.
Unfortunately in my case XPO generates the first version so I'm suffering with performance.
Why PostgreSQL query planner not noticing that the join on customers has a condition in the WHERE clauuse? IMO the correct optimization would be to take those joins first which has any kind of filter, and then take those joins which participate only in selection.
Any kind of help or advice is appreciated.

Join orders only matters, if your query's joins not collapsed. This is done internally by the query planner, but you can manipulate the process with the join_collapse_limit runtime option.
Note however, the query planner will not find every time the best join order by default:
Constraining the planner's search in this way is a useful technique both for reducing planning time and for directing the planner to a good query plan. If the planner chooses a bad join order by default, you can force it to choose a better order via JOIN syntax — assuming that you know of a better order, that is. Experimentation is recommended.
For the best performance, I recommend to use some kind of native querying, if available. Raising the join_collapse_limit can be a good-enough solution though, if you ensure, this hasn't caused other problems.
Also worth to mention, that raising join_collapse_limit will most likely increase the planning time.

Filter table before inner join condition

There's a similar question here, but my doubt is slight different:
select *
from process a inner join subprocess b on a.id=b.id and a.field=true
and b.field=true
So, when using inner join, which operation comes first: the join or the a.field=true condition?
As the two tables are very big, my goal is to filter table process first and after that join only the rows filtered with table subprocess.
Which is the best approach?

First things first:
which operation comes first: the join or the a.field=true condition?
Your INNER JOIN includes this (a.field=true) as part of the condition for the join. So it will prevent rows from being added during the JOIN process.
A part of an RDBMS is the "query optimizer" which will typically find the most efficient way to execute the query - there is no guarantee on the order of evaluation for the INNER JOIN conditions.
Lastly, I would recommend rewriting your query this way:
SELECT *
FROM process AS a
INNER JOIN subprocess AS b ON a.id = b.id
WHERE a.field = true AND b.field = true
This will effectively do the same thing as your original query, but it is widely seen as much more readable by SQL programmers. The optimizer can rearrange INNER JOIN and WHERE predicates as it sees fit to do so.

You are thinking about SQL in terms of a procedural language which it is not. SQL is a declarative language, and the engine is free to pick the execution plan that works best for a given situation. So, there is no way to predict if a join or a where will be executed first.
A better way to think about SQL is in terms of optimizing queries. Things like assuring that your joins and wheres are covered by indexes. Also, at least in MS Sql Server, you can preview an estimated or actual execution plan. There is nothing stopping you from doing that and seeing for yourself.

What are the differences between these?

What are the differences between the two queries?
SELECT CountryMaster.Id
FROM Districts INNER JOIN
CountryMaster ON Districts.CountryId = CountryMaster.Id
SELECT CountryMaster.Id
FROM CountryMaster INNER JOIN
Districts ON Districts.CountryId = CountryMaster.Id
I know the output will be same, but I want to know is there any drastic effects of the same if I neglect positions of tables and columns in complex queries or tables having tons of data like hundreds of thousands of rows.

No difference whatsoever. The order of the joins is irrelevant. The query optimizer inside the database engine will decide on a merge plan to actually process the records from the two tables based on the stored statistics for the data in those tables.
In fact, in many cases, the query optimizer's will generate exactly the same plan for both a query phrased using joins as it would for a query phrased with a correlated sub-query.
The lesson here I have learned is:
Always start with the syntax, or representation, that most clearly represents the meaning of the process you are trying to create, and trust the query optimizer to do its job. Having said that, the query optimizer is not perfect, so if there is a performance issue, use the query show plan with alternate constructions and see if it improves...
One quick comment on performance of inner vs. outer joins. It is simply not true that inner joins are intrinsically faster than outer joins. The relative performance depends entirely on which of the three types of processing joins are used by the query engine;
1. Nested Loop Join, 2., Merge Join, or 3. Hash Join.
The Nested Loop join, for example, is used when the set of records on one side of the join is very much smaller than on the other side, and the larger set is indexed on the join column[s]. In this case, if the smaller set is the "outer" side, then an outer join will be faster. The reason is that the nested loop join takes the entire set of records from that smaller set, and iterates through each one, finding the records from the larger set that match. An inner join has to perform a second step of removing rows from the smaller side when no matches were found in the larger set. The outer join does not do this second step.
Each of the three possible types of join processes has its own characterisitic behavior patterns... See Nested Loop Joins, Merge Joins and Hash Joins for the details.

As written they are identical. Excellent answer from Charles.
If you want to know if they will have different execution plans then simply display the execution plan in SSMS.
As for speed have the columns used in the join indexed.
Maintain the indexes - a fragmented index is not nearly as effective.
The query plan will not always be the same.
The query optimizer keeps statistics and as the profile of the data changes the optimal plan may change.
Thousands of rows is not a lot.
Once you get into millions then tune indexes and syntax (with hints).
Some times you have to get into millions before you have enough data to tune.
There is also a UNION operator that is equivalent and sometimes faster.
The join hint Loop is not symmetric so in that case the query plan is different for the following but they are still that same results.
If one is a PK table I always put it first.
In this case the first is twice as fast as the second.
select top 10 docSVsys.sID, docMVtext.fieldID
from docSVsys
inner loop join docMVtext
on docMVtext.sID = docSVsys.sID
where docSVsys.sID < 100
order by docSVsys.sID, docMVtext.fieldID
select top 10 docSVsys.sID, docMVtext.fieldID
from docMVtext
inner loop join docSVsys
on docMVtext.sID = docSVsys.sID
where docSVsys.sID < 100
order by docSVsys.sID, docMVtext.fieldID
Advanced Query Tuning Concepts

Question about SQL Server Optmization Sub Query vs. Join

Which of these queries is more efficient, and would a modern DBMS (like SQL Server) make the changes under the hood to make them equal?
SELECT DISTINCT S#
FROM shipments
WHERE P# IN (SELECT P#
FROM parts
WHERE color = ‘Red’)
vs.
SELECT DISTINCT S#
FROM shipments, parts
WHERE shipments.P# = parts.P#
AND parts.color = ‘Red’

The best way to satiate your curiosity about this kind of thing is to fire up Management Studio and look at the Execution Plan. You'll also want to look at SQL Profiler as well. As one of my professors said: "the compiler is the final authority." A similar ethos holds when you want to know the performance profile of your queries in SQL Server - just look.
Starting here, this answer has been updated
The actual comparison might be very revealing. For example, in testing that I just did, I found that either approach might yield the fastest time depending on the nature of the query. For example, a query of the form:
Select F1, F2, F3 From Table1 Where F4='X' And UID in (Select UID From Table2)
yielded a table scan on Table1 and a mere index scan on table 2 followed by a right semi join.
A query of the form:
Select A.F1, A.F2, A.F3 From Table1 A inner join Table2 B on (A.UID=B.UID)
Where A.Gender='M'
yielded the same execution plan with one caveat: the hash match was a simple right join this time. So that is the first thing to note: the execution plans were not dramatically different.
These are not duplicate queries though since the second one may return multiple, identical records (one for each record in table 2). The surprising thing here was the performance: the subquery was far faster than the inner join. With datasets in the low thousands (thank you Red Gate SQL Data Generator) the inner join was 40 times slower. I was fairly stunned.
Ok, how about a real apples to apples? This is the matching inner join - note the extra step to winnow out the duplicates:
Select Distinct A.F1, A.F2, A.F3 From Table1 A inner join Table2 B
on (A.UID=B.UID)
Where A.Gender='M'
The execution plan does change in that there is an extra step - a sort after the inner join. Oddly enough, though, the time drops dramatically such that the two queries are almost identical (on two out of five trials the inner join is very slightly faster). Now, I can imagine the first inner join (without the "distinct") being somewhat longer just due to the fact that more data is being forwarded to the query window - but it was only twice as much (two Table2 records for every Table1 record). I have no good explanation why the first inner join was so much slower.
When you add a predicate to the search on table 2 using a subquery:
Select F1, F2, F3 From Table1 Where F4='X' And UID in
(Select UID From Table2 Where F1='Y')
then the Index Scan is changed to a Clustered Index Scan (which makes sense since the UID field has its own index in the tables I am using) and the percentage of time it takes goes up. A Stream Aggregate operation is also added. Sure enough, this does slow the query down. However, plan caching obviously kicks in as the first run of the query shows a much greater effect than subsequent runs.
When you add a predicate using the inner join, the entire plan changes pretty dramatically (left as an exercise to the reader - this post is long enough). The performance, again, is pretty much the same as that of the subquery - as long as the "Distinct" is included. Similar to the first example, omitting distinct led to a significant increase in time to completion.
One last thing: someone suggested (and your question now includes) a query of the form:
Select Distinct F1, F2, F3 From table1, table2
Where (table1.UID=table2.UID) AND table1.F4='X' And table2.F1='Y'
The execution plan for this query is similar to that of the inner join (there is a sort after the original table scan on table2 and a merge join rather than a hash join of the two tables). The performance of the two is comparable as well. I may need a larger dataset to tease out difference but, so far, I'm not seeing any advantage to this construct or the "Exists" construct.
With all of this being said - your results may vary. I came nowhere near covering the full range of queries that you may run into when I was doing the above tests. As I said at the beginning, the tools included with SQL Server are your friends: use them.
So: why choose one over the other? It really comes down to your personal preferences since there appears to be no advantage for an inner join to a subquery in terms of time complexity across the range of examples I tests.
In most classic query cases I use an inner join just because I "grew up" with them. I do use subqueries, however, in two situations. First, some queries are simply easier to understand using a subquery: the relationship between the tables is manifest. The second and most important reason, though, is that I am often in a position of dynamically generating SQL from within my application and subqueries are almost always easier to generate automatically from within code.
So, the takeaway is simply that the best solution is the one that makes your development the most efficient.

Using IN is more readable, and I recommend using ANSI-92 over ANSI-89 join syntax:
SELECT DISTINCT S#
FROM SHIPMENTS s
JOIN PARTS p ON p.p# = s.p#
AND p.color = 'Red'
Check your explain plans to see which is better, because it depends on data and table setup.

If you aren't selecting anything from the table I would use an EXISTS clause.
SELECT DISTINCT S#
FROM shipments a
WHERE EXISTS (SELECT 1
FROM parts b
WHERE b.color = ‘Red’
AND a.P# = b.P#)
This will optimize out to be the same as the second one you posted.

SELECT DISTINCT S#
FROM shipments,parts
WHERE shipments.P# = parts.P# and parts.color = ‘Red’;
Using IN forces SQL Server to not use indexing on that column, and subqueries are usually slower

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas