Normal Join vs Join with Subqueries - sql

What is the best way for query with joins?
First join tables and then add where conditions
First add where conditions with subquery and then join
For example which one of the following queries have a better performance?
select * from person persons
inner join role roles on roles.person_id_fk = persons.id_pk
where roles.deleted is null
or
select * from person persons
inner join (select * from role roles where roles.deleted is null) as roles
on roles.person_id_fk = persons.id_pk

In a decent database, there should be no difference between the two queries. Remember, SQL is a descriptive language, not a procedural language. That is, a SQL SELECT statement describes the result set that should be returned. It does not specify the steps for creating it.
Your two queries are semantically equivalent and the SQL optimizer should be able to recognize that.
Of course, SQL optimizers are not omniscient. So, sometimes how you write a query does affect the execution plan. However, the queries that you are describing are turned into execution plans that have no concept of "subquery", so it is reasonable that they would produce the same execution plan.
Note: Some databases -- such as MySQL and MS Access -- do not have very good optimizers and such queries do produce different execution plans. Alas.

Related

SQL Performance comparison for 2 sqls

Which query is better in performance aspect ?
A
select users.email
from (select * from purchases where id = ***) A
left join users on A.user_id = users.id;
B
select users.email
from purchases A
left join users on A.user_id = users.id where A.id = ***;
Basically, I'm thinking A is better. (But If sql server is optimizing query significantly)
Please explain me which query is better and why. Thanks.
Almost any database optimizer is going to ignore the subquery. Why? SQL queries describer the result set being produced, not the steps for processing it. The SQL optimizer produces the underlying code that is run.
And most optimizers are smart enough to ignore subqueries and to choose optimal indexes and partitions and algorithms regardless of them. One except is that some versions of MySQL/MariaDB tend to materialize subqueries -- and that is a performance killer. I think even that has improved in more recent versions.

Adding a join condition in the from clause *and* where clause makes query faster. Why?

I'm tuning a query for a large transactional financial system. I've noticed that including a join condition in the where clause as well as the from clause makes the query run significantly faster than either of the two individually. I note that the join in the from clause has more than one condition; I mention this in case it is significant. Here's a simplified example:
SELECT *
FROM employee e
INNER JOIN car c ON c.id = e.car_id AND -- some other join
-- Adding the join above again, in the where clause makes the query faster
WHERE c.id = e.car_id;
I thought ANSI vs old-school was purely syntactic. What's going on?
Update
Having analysed the two execution plans, it's clear that adding the same join in the where clause as the from clause, produces a very different execution plan than having the join in either of the two.
Comparing the plans, I could see what the plan with the additional where clause condition was doing better, and wondered why the one without, was joining in the way that it was. Knowing the optimal plan, a quick tweak to the join conditions resolved matters, although I'm still surprised that both queries didn't compile into the same thing. Black magic.
could be that the WHERE c.id = e.car_id addition is a way for control the order in which the tables are used to perform the proper search ..
this could a way for forcing the query optimizer to use as main table the table in where condition and the table related beacause the the sequence of table joins could not so valid for searching as is usefull for understand the query logic

Performance impacts on specifying multiple columns in inner join

I want to select some records from two tables based on matching the values of two columns.
I have got two queries for the same, out of these one contains join on two columns as:
SELECT
*
FROM
USER_MASTER UM
INNER JOIN
USER_LOCATION UL
ON
UM.CUSTOMER_ID=UL.CUSTOMER_ID AND UM.CREATED_BY=UL.USER_ID
and the same results can be achieved by following query having single column join as:
SELECT
*
FROM
USER_MASTER UM
INNER JOIN
USER_LOCATION UL
ON
UM.CREATED_BY=UL.USER_ID
WHERE
UM.CUSTOMER_ID=UL.CUSTOMER_ID
Is there any difference in performance of above queries?
As everything concerning performance the answer is: It Depends.
In general the engine is smart enough to optimize both queries, I'm not surprised if both produce the same execution plan.
In fact you must run both queries a few times and study the execution plan to actually determine if both run about the same time AND using the same amount of CPU, IO and memory. (Remember performance is not only about running fast, is about smart use of all resources).
For a "semantic" vision, your data is using two keys to be "determined". In that case you can let both expression at the JOIN predicate. Let only filters at the WHERE clause.
The advantage of explicit joins over implicit ones is for create this logic (and visual) separation

Filter table before inner join condition

There's a similar question here, but my doubt is slight different:
select *
from process a inner join subprocess b on a.id=b.id and a.field=true
and b.field=true
So, when using inner join, which operation comes first: the join or the a.field=true condition?
As the two tables are very big, my goal is to filter table process first and after that join only the rows filtered with table subprocess.
Which is the best approach?
First things first:
which operation comes first: the join or the a.field=true condition?
Your INNER JOIN includes this (a.field=true) as part of the condition for the join. So it will prevent rows from being added during the JOIN process.
A part of an RDBMS is the "query optimizer" which will typically find the most efficient way to execute the query - there is no guarantee on the order of evaluation for the INNER JOIN conditions.
Lastly, I would recommend rewriting your query this way:
SELECT *
FROM process AS a
INNER JOIN subprocess AS b ON a.id = b.id
WHERE a.field = true AND b.field = true
This will effectively do the same thing as your original query, but it is widely seen as much more readable by SQL programmers. The optimizer can rearrange INNER JOIN and WHERE predicates as it sees fit to do so.
You are thinking about SQL in terms of a procedural language which it is not. SQL is a declarative language, and the engine is free to pick the execution plan that works best for a given situation. So, there is no way to predict if a join or a where will be executed first.
A better way to think about SQL is in terms of optimizing queries. Things like assuring that your joins and wheres are covered by indexes. Also, at least in MS Sql Server, you can preview an estimated or actual execution plan. There is nothing stopping you from doing that and seeing for yourself.

Which of these select statements is "better," and why?

I have 2 table person and role.
I have to all the persons based on role.
select person.* from person inner join role on
person.roleid = role.id Where role.id = #Roleid
or
select person.* from person inner join role on
person.roleid = role.id AND role.id = #Roleid
Which one of the above two solutions is better and Why?
The first is better because it's logically coherent. The scoping condition isn't relevant to the join, so making it a part of the join is a kludge, and in this case an unhelpful one.
There is no difference in the relational algebra. Criteria from the where and inner joins like this are interchangeable. I use both depending on the readability and situation.
In this particular case, you could also use:
select person.* from person WHERE person.roleid = #Roleid
The only difference being that it does not require that a row exist in the role table (but I assume you have referential integrity for that) and it will not return multiple rows if roleid is not unique (which it almost certainly is in most scenarios I could foresee).
Your best bet is to try these queries out and run them through MS Sql's execution plan. I did this and the results look like this:
Execution plan showing identical performance of queries http://img223.imageshack.us/img223/6491/querycompare.png
As you can see, the performance is the same (granted, running it on your db may produce different results.) So, the best query is the one that follows the consistent convention you use for writing queries.
SQL Server should evaluate those queries identically. Personally, I would use the AND. I like to keep all of the criteria for a joined table in the join itself so that it's all together and easy to find.
Both queries are identical. During query processing, SQL server applies the WHERE filter immediately after applying the join condition filter, so you'll wind up with the same things filtered either way.
I prefer #1, I believe that it expresses the intent of the statement better. That you are joining the two tables based on the roleid & role.id and that you are filtering based on #Roleid
SELECT person.*
FROM person INNER JOIN role ON person.roleid = role.id
Where role.id = #Roleid
Sqlserver probably has an equivalent of the "explain plan" statement that Oracle, PostgreSQL and MySQL all support in one form or another. It can be very useful in telling you how the query parser and optimizer is going to treat your query.
I'd go with the first one, but remember when done testing the code you should explicitly select each table, instead of doing select *
As you are not fetching columns from role you'd better not include it in the FROM clause at all. Use this:
SELECT *
FROM person
WHERE person.roleid IN (SELECT id FROM role WHERE id = #Roleid)
This way the optimizer sees only one table in the FROM clause and can quickly figure out the cardinality of the resultset (that is the number of rows in the resultset is <= the number of rows in table person).
When you throw two tables with a JOIN the optimizer has to look in the ON clause to figure out if these tables are equi-joined and whether unique indexes exist on the joined columns. If the predicate in the ON clause is complicated one (multiple ANDs and ORs) or simply wrong (sometimes very wrong) the optimizer might choose sub-optimal join strategy.
Obviously this particular sample is very contrived, because you can filter persons by roleid = #Roleid directly (no join or sub-query) but the considerations above are valid if you had to filter on other columns in role (#Rolename for instance).
Would there be any performance hit/gain by using this query?
SELECT person.* FROM person,role WHERE person.roleid=role.id AND role.id=#RoleID