SQL Query Performance Join with condition - sql

calling all sql experts. I have the following select statement:
SELECT 1
FROM table1 t1
JOIN table2 t2 ON t1.id = t2.id
WHERE t1.field = xyz
I'm a little bit worried about the performance here. Is the where clause evaluated before or after the join? If its evaluated after, is there way to first evaluate the where clause?
The whole table could easily contain more than a million entries but after the where clause it may be only 1-10 entries left so in my opinion it really is a big performance difference depending on when the where clause is evaluated.
Thanks in advance.
Dimi

You could rewrite your query like this:
SELECT 1
FROM (SELECT * FROM table1 WHERE field = xyz) t1
JOIN table2 t2 ON t1.id = t2.id
But depending on the database product the optimiser might still decide that the best way to do this is to JOIN table1 to table2 and then apply the constraint.

For this query:
SELECT 1
FROM table1 t1 JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.field = xyz;
The optimal indexes are table1(field, id), table2(id).
How the query is executed depends on the optimizer. It is tasked with choosing the based execution plan, given the table statistics and environment.

Each DBMS has its own query optimizer. So by logic of things in case like yours WHERE will be executed first and then JOINpart of the query

As mentioned in the comments and other answers with performance the answer is always "it depends" depending on your dbms and the indexing of the base tables the query may be fine as is and the optimizer may evaluate the where first. Or the join may be efficient anyway if the indexes cover the join requirements.
Alternatively you can force the behavior you require by reducing the dataset of t1 before you do the join using a nested select as Richard suggested or adding the t1.field = xyz to the join for example
ON t1.field = xyz AND t1.id = t2.id
personally if i needed to reduce the dataset before the join I would use a cte
With T1 AS
(
SELECT * FROM table1
WHERE T1.Field = 'xyz'
)
SELECT 1
FROM T1
JOIN Table2 T2
ON T1.Id = T2.Id

Related

In joining 2 tables, do the WHERE clauses reduce the table sizes before or after the join occurs?

For example, does the first query get processed different than the second query?
Query 1
SELECT t1.var1, t2.var2 FROM table1 t1
INNER JOIN table2 t2
ON t1.key = t2.key
WHERE t2.ID = 'ABCD'
Query 2
SELECT t1.var1, t2.var2 FROM table1 t1
INNER JOIN (
SELECT var2, key from table2
WHERE ID = 'ABCD'
) t2
ON t1.key = t2.key
WHERE t2.ID = 'ABCD'
At a glance, it seems as if the second query would be more efficient - table2 is reduced before the join begins, whereas the first query appears to join the tables first, then reduce later. I'm using teradata, if it matters.
Depends on vendor, version and configuration.
Teradata older version/legacy configuration might spool the sub-query as a first stage for Query 2 leading to reduced performance in comparison to Query 1 in depends with the table's' primary indexes and join algorithm.
I would suggest to avoid this kind of "optimization".
P.s.
Check if you get the same execution plan for both plans or different execution plans.
Check the query log for AMPCPUTime (for start)

Best way to use Where Clause in SQL SERVER Query (for best performance)?

I have written 2 queries in SQL Server and used SUB Query in join with Where Clause, Please let me know which is the best way to do in below Queries.
Query 1. SELECT column1,column2,column3,JOINTAB.column5 FROM Table1
INNER JOIN (SELECT column1,column4,column5 FROM Table2 WHERE
column4='xxxx') AS JOINTAB ON Table1.column1=JOINTAB.column1
Query 2. SELECT column1,column2,column3,JOINTAB.column5 FROM Table1
INNER JOIN (SELECT column1,column4,column5 FROM Table2) AS JOINTAB
ON Table1.column1=JOINTAB.column1 WHERE JOINTAB.column4='xxxx'
For the best performance Query 1 or 2?
Your best option here is not to use a subquery at all. It's not needed. Either of thsese will give you the same results:
SELECT t1.column1,t1.column2,t1.column3,t2.column5
FROM Table1 t1
INNER JOIN Table2 t1 on t2.column1 = t1.column1 and t2.column4 = 'xxxx'
--or
SELECT t1.column1,t1.column2,t1.column3,t2.column5
FROM Table1 t1
INNER JOIN Table2 t1 on t2.column1 = t1.column1
WHERE t2.column4 = 'xxxx'
Query Performance is a great concern for all the developers.
To find the performance and optimization simply you can view 'query execution plans'.
You can use the Query Analyzer, to which execution plan is chosen by the optimizer. Simply type an SQL statement in the Query window and press the Ctrl+L key. The query is displayed graphically.
SQL Server offers is the ability to see query execution plans
Run the query and find the performance.

Most Efficiently Written Query in SQL

I have two tables, Table1 and Table2 and am trying to select values from Table1 based on values in Table2. I am currently writing my query as follows:
SELECT Value From Table1
WHERE
(Key1 in
(SELECT KEY1 FROM Table2 WHERE Foo = Bar))
AND
(Key2 in
(SELECT KEY2 FROM Table2 WHERE Foo = Bar))
This seems a very inefficent way to code the query, is there a better way to write this?
It depends on how the table(s) are indexed. And it depends on what SQL implementation you're using (SQL Server? MySq1? Oracle? MS Access? something else?). It also depends on table size (if the table(s) are small, a table scan may be faster than something more advanced). It matters, too, whether or not the indices are covering indices (meaning that the test can be satisfied with data in the index itself, rather than requiring an additional look-aside to fetch the corresponding data page.) Unless you look at the execution plan, you can't really say that technique X is "better" than technique Y.
However, in general, for this case, you're better off using correlated subqueries, thus:
select *
from table1 t1
where exists( select *
from table2 t2
where t2.key1 = t1.key1
)
and exists( select *
from table2 t2
where t2.key2 = t1.key2
)
A join is a possibility, too:
select t1.*
from table1 t1
join table2 t2a = t2a.key1 = t1.key1 ...
join table2 t2b = t2b.key2 = t1.key2 ...
though that will give you 1 row for every matching combination, though that can be alleviated by using the distinct keyword. It should be noted that a join is not necessarily more efficient than other techniques. Especially, if you have to use distinct as that requires additional work to ensure distinctness.

Is there a logical difference between putting a condition in the ON clause of an inner join versus the where clause of the main query?

Consider these two similar SQLs
(condition in ON clause)
select t1.field1, t2.field1
from
table1 t1 inner join table2 t2 on t1.id = t2.id and t1.boolfield = 1
(condition in WHERE clause)
select t1.field1, t2.field1
from
table1 t1 inner join table2 t2 on t1.id = t2.id
where t1.boolfield = 1
I have tested this out a bit and I can see the difference between putting a condition in the two different places for an outer join.
But in the case of an inner join can the result sets ever be different?
For INNER JOIN, there is no effective difference, although I think the second option is cleaner.
For LEFT JOIN, there is a huge difference. The ON clause specifies which records will be selected from the tables for comparison and the WHERE clause filters the results.
Example 1: returns all the rows from tbl 1 and matches them up with appropriate rows from tbl2 that have boolfield=1
Select *
From tbl1
LEFT JOIN tbl2 on tbl1.id=tbl2.id and tbl2.boolfield=1
Example 2: will only include rows from tbl1 that have a matching row in tbl2 with boolfield=1. It joins the tables, and then filters out the rows that don't meet the condition.
Select *
From tbl1
LEFT JOIN tbl2 on tbl1.id=tbl2.id
WHERE tbl2.boolfield=1
In your specific case, the t1.boolfield specifies an additional selection condition, not a condition for matching records between the two tables, so the second example is more correct.
If you're speaking about the cases when a condition for matching records is put in the ON clause vs. in the WHERE clause, see this question.
Both versions return the same data.
Although this is true for an inner join, it is not true for outer joins.
Stylistically, there is a third possibility. In addition to your two, there is also:
select t1.field1, t2.field1
from (select t1.*
from table1 t1
where t1.boolfield = 1
) t1 inner join
table2 t2
on t1.id = t2.id
Which is preferable all depends on what you want to highlight, so you (or someone else) can later understand and modify the query. I often prefer the third version, because it emphasizes that the query is only using certain rows from the table -- the boolean condition is very close to where the table is specified.
In the other two cases, if you have a long query, it can be problematic to figure out what "t1" really means. I think this is why some people prefer to put the condition in the ON clause. Others prefer the WHERE clause.

How can I speed up MySQL query with multiple joins

Here is my issue, I am selecting and doing multiple joins to get the correct items...it pulls in a fair amount of rows, above 100,000. This query takes more than 5mins when the date range is set to 1 year.
I don't know if it's possible but I am afraid that the user might extend the date range to like ten years and crash it.
Anyone know how I can speed this up? Here is the query.
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM table1 AS t1
INNER JOIN table2 AS t2 ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id
WHERE t1.subscribe =1
AND t1.Cdate >= $startDate
AND t1.Cdate <= $endDate
AND t5.store =2
I am not the greatest with mysql so any help would be appreciated!
Thanks in advance!
UPDATE
Here is the explain you asked for
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t5 ref PRIMARY,C_store_type,C_id,C_store_type_2 C_store_type_2 1 const 101 Using temporary
1 SIMPLE t4 ref PRIMARY,P_cat P_cat 5 alphacom.t5.C_id 326 Using where
1 SIMPLE t3 ref I_pid,I_oref I_pid 4 alphacom.t4.P_id 31
1 SIMPLE t2 eq_ref O_ref,O_cid O_ref 28 alphacom.t3.I_oref 1
1 SIMPLE t1 eq_ref PRIMARY PRIMARY 4 alphacom.t2.O_cid 1 Using where
Also I added an index to table5 rows and table4 rows because they don't really change, however the other tables get around 500-1000 entries a month... I heard you should add an index to a table that has that many new entries....is this true?
I'd try the following:
First, ensure there are indexes on the following tables and columns (each set of columns in parentheses should be a separate index):
table1 : (subscribe, CDate)
(CU_id)
table2 : (O_cid)
(O_ref)
table3 : (I_oref)
(I_pid)
table4 : (P_id)
(P_cat)
table5 : (C_id, store)
Second, if adding the above indexes didn't improve things as much as you'd like, try rewriting the query as
SELECT DISTINCT t1.first_name, t1.last_name, t1.email FROM
(SELECT CU_id, t1.first_name, t1.last_name, t1.email
FROM table1
WHERE subscribe = 1 AND
CDate >= $startDate AND
CDate <= $endDate) AS t1
INNER JOIN table2 AS t2
ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3
ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4
ON t3.I_pid = t4.P_id
INNER JOIN (SELECT C_id FROM table5 WHERE store = 2) AS t5
ON t4.P_cat = t5.C_id
I'm hoping here that the first sub-select would cut down significantly on the number of rows to be considered for joining, hopefully making the subsequent joins do less work. Ditto the reasoning behind the second sub-select on table5.
In any case, mess with it. I mean, ultimately it's just a SELECT - you can't really hurt anything with it. Examine the plans that are generated by each different permutation and try to figure out what's good or bad about each.
Share and enjoy.
Make sure your date columns and all the columns you are joining on are indexed.
Doing an unequivalence operator on your dates means it checks every row, which is inherently slower than an equivalence.
Also, using DISTINCT adds an extra comparison to the logic that your optimizer is running behind the scenes. Eliminate that if possible.
Well, first, make a subquery to decimate table1 down to just the records you actually want to go to all the trouble of joining...
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM (
SELECT first_name, last_name, email, CU_id FROM table1 WHERE
table1.subscribe = 1
AND table1.Cdate >= $startDate
AND table1.Cdate <= $endDate
) AS t1
INNER JOIN table2 AS t2 ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id
WHERE t5.store = 2
Then start looking at modifying the directionality of the joins.
Additionally, if t5.store is only very rarely 2, then flip this idea around: construct the t5 subquery, then join it back and back and back.
At present, your query is returning all matching rows on table2-table5, just to establish whether t5.store = 2. If any of table2-table5 have a significantly higher row count than table1, this may be greatly increasing the number of rows processed - consequently, the following query may perform significantly better:
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM table1 AS t1
WHERE t1.subscribe =1
AND t1.Cdate >= $startDate
AND t1.Cdate <= $endDate
AND EXISTS
(SELECT NULL FROM table2 AS t2
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id AND t5.store =2
WHERE t1.CU_id = t2.O_cid);
Try adding indexes on the fields that you join. It may or may not improve the performance.
Moreover it also depends on the engine that you are using. If you are using InnoDB check your configuration params. I had faced a similar problem, as the default configuration of innodb wont scale much as myisam's default configuration.
As everyone says, make sure you have indexes.
You can also check if your server is set up properly so it can contain more of, of maybe the entire, dataset in memory.
Without an EXPLAIN, there's not much to work by. Also keep in mind that MySQL will look at your JOIN, and iterate through all possible solutions before executing the query, which can take time. Once you have the optimal JOIN order from the EXPLAIN, you could try and force this order in your query, eliminating this step from the optimizer.
It sounds like you should think about delivering subsets (paging) or limit the results some other way unless there is a reason that the users need every row possible all at once. Typically 100K rows is more than the average person can digest.