Rewrite SQL code SELECT block to simplify logic - sql

I am trying to rewrite this block with simpler logic if this can be done. I am using it within a larger SELECT statement and I think IF I can simplify this block, I might be able to improve performance of my query.
proj_catg_type_id, proj_catg_id and proj_id are all PKs in their tables.
select t1.proj_catg_name
from table1 t1, table2 t2, table3 t3
where t2.proj_catg_type_id = t1.proj_catg_type_id
and t2.proj_catg_type_id = 213
and t3.proj_id = t2.proj_id

Without knowing the referential integrety rules and the logic behind the tables it is difficult to give a 100% correct answer. But just by looking to this statement the most simplified logic would be
select t1.proj_catg_name
from table1 t1
where t1.proj_catg_type_id = 213;

select t1.proj_catg_name
from table1 t1 inner join table2 t2
on t2.proj_catg_type_id=t1.proj_catg_type_id
where t2.proj_catg_type_id=213
and t3.proj_id=t2.proj_i
maybe? is t3 used outside this subselect?

If t3 is a table outside the selct you showed, then this is a correlated subquery which you should not be using at all, ever! That turns your query into a row-by agonizing row cursor.
Use derived tables or joins to get the results.
You don't give me enough code to write a specific solution for your problem, but let me give you an example:
SELECT
field1
, field2
, (SELECT t3.field3
FROM table2 t2
JOIN table3 t3 ON t2.id = t3.id
WHERE t4.somefield = t2.somefield)
FROM table1 t1
JOIn table4 t4 ON t1.id = t4.id
SELECT
field1
, field2
, t3.field3
FROM table1 t1
JOIn table4 t4
ON t1.id = t4.id
join (SELECT field3
FROM table2 t2
JOIN table3 t3 ON t2.id = t3.id) a
ON t4.somefield = t2.somefield
The first query runs one row at a time which is extremely slow. The second should give the same results but runs in a set-based fashion which is much faster. It is important to make sure the derived table has an a alias. You could also use a CTE.

Related

SQL beginner question: unexpected behavior with where exists select 1

I started using SQL a week ago. I am sorry but I have a "why my code does not work" question.
Please look at the following three queries on table1 and table2.
A. Inner join (returned 2 row results)
select t1.*, t2.* from table1 t1, table2 t2
where t1.item = t2.item
and t1.something = t2.something
B. Subquery (returned 2 row results)
select t1.* from table1 t1
where exists (select 1 from table2 t2
where t1.item = t2.item
and t1.something = t2.something)
C. My code (Expected the same results as in A. "Inner join" but takes forever to return results)
select t1.*, t2.* from table1 t1, table2 t2
where exists (select 1 from table2 t2
where t1.item = t2.item
and t1.something = t2.something)
For your reference, # of rows for each table is the following.
select count(*) from table1 -- (100K)
select count(*) from table2 -- (10K)
Would somebody kindly educate me know why my code (C) does not work?
Thank you for your help in advance.
The problem with your (C) query is that the outer reference to table2 is completed unconstrained1. This means that you're effectively writing query B again but also cross joining that result to table2, meaning that you'll get not 2 results but 20000.
You should be using explicit join syntax. One of the advantages of this is that it forces you to think about the join conditions at the point of joining rather than having to remember to include them in the general where clause.
select t1.*, t2.*
from table1 t1
inner join table2 t2
on t1.item = t2.item
and t1.something = t2.something
It's an error to omit the on clause. It's never an error to forget to constrain a column in the where clause2.
1Just because you refer to table2 again inside your exists subquery, and even though you assign it the same t2 alias, that doesn't mean that they are the same reference. The two references to table2 are unrelated in any way.
2Of course, it's often a logical error to do this, but what I mean in this paragraph is specifically about error messages that the system will raise.

Exactly when to user inner join or an alternative query

SELECT .... FROM TABLE1 T1, TABLE2 T2, TABLE3 T3
WHERE T1.NAME = 'ABC' AND T1.ID = T2.COL_ID AND T2.COL1 = T3.COL2
vs
SELECT .... FROM TABLE1 T1
WHERE T1.NAME = 'ABC'
INNER JOIN TABLE2 T2 ON T1.ID = T2.COL_ID
INNER JOIN TABLE3 T3 ON T2.COL1 = T3.COL2
Two questions
In terms of performance, which will perform better and why?
If Option 2 has the better performance, when should be using Option 1? (vice versa question if Option 1 has better performance)
The second query is not correct. It should be:
SELECT .... FROM TABLE1 T1
INNER JOIN TABLE2 T2 ON T1.ID = T2.COL_ID
INNER JOIN TABLE3 T3 ON T2.COL1 = T3.COL2
WHERE T1.NAME = 'ABC'
This is the right way to write your join condition. The 1st one is accepted, but technically creates a cartesian product. All modern database deals perfectly with both 1st and 2nd queries and interprets them the same way, therefore, performance should be the same. But still, you should use the second one because it is more readable and allows you to have only one way to write join weither it is a inner, left or full outer.
The answer is easy: Don't use comma-separated joins (first query). We used these in the 1980s for the lack of something better, but then in 1992 the new syntax (second query) was introduced1, because the old syntax was error-prone (it was easier to forget to apply join criteria) and harder to maintain (was missing join criteria intended or not in a query?) and there was no standard syntax for outer joins.
1 Oracle was a little late though featuring the new syntax. They introduced the new ANSI joins in Oracle 9i in 2001.
In terms of performance: There should be no difference in speed, because DBMS optimizers see that this is essentially the same query.
Your second query is syntactically incorrect by the way. The query's WHERE clause belongs after the complete FROM clause, i.e. after all the joins:
SELECT ....
FROM table1 t1
INNER JOIN table2 t2 ON t1.id = t2.col_id
INNER JOIN table3 t3 ON t2.col1 = t3.col2
WHERE t1.name = 'ABC';

Performance of two left joins versus union

I have searched but have not found a definitive answer. Which of these is better for performance in SQL Server:
SELECT T.*
FROM dbo.Table1 T
LEFT JOIN Table2 T2 ON T.ID = T2.Table1ID
LEFT JOIN Table3 T3 ON T.ID = T3.Table1ID
WHERE T2.Table1ID IS NOT NULL
OR T3.Table1ID IS NOT NULL
or...
SELECT T.*
FROM dbo.Table1 T
JOIN Table2 T2 ON T.ID = T2.Table1ID
UNION
SELECT T.*
FROM dbo.Table1 T
JOIN Table3 T3 ON T.ID = T3.Table1ID
I have tried running both but it's hard to tell for sure. I'd appreciate an explanation of why one is faster than the other, or if it depends on the situation.
Your two queries do not do the same things. In particular, the first will return duplicate rows if values are duplicated in either table.
If you are looking for rows in Table1 that are in either of the other two tables, I would suggest using exists:
select t1.*
from Table1 t1
where exists (select 1 from Table2 t2 where t2.Table1Id = t1.id) or
exists (select 1 from Table3 t3 where t3.Table1Id = t1.id);
And, create indexes on Table1Id in both Table2 and Table3.
Which of your original queries is faster depends a lot on the data. The second has an extra step to remove duplicates (union verses union all). On the other hand, the first might end up creating many duplicate rows.

Inner join with where conditions, which will excute first? Join or where conditions?

For example1:
select T1.*, T2.*
from TABLE1 T1, TABLE2 T2
where T1.id = T2.id
and T1.name = 'foo'
and T2.name = 'bar';
That will first join T1 and T2 together by id, then select the records that satisfy the name conditions?
Or select the records that satisfy the name condition in T1 or T2, then join those together?
And, Is there a difference in performance between example1 and example2(DB2)?
example2:
select *
from
(
select * from TABLE1 T1 where T1.name = 'foo'
) A,
(
select * from TABLE2 T2 where T2.name = 'bar'
) B
where A.id = B.id;
How the query will be executed depends on what the query planner does with it. Depending on the available indexes and how much data is in the tables the query plan may look different. The planner tries to do the work in the order that it thinks is most efficient.
If the planner does a good job, the plan for both queries should be the same, otherwise the first query is likely to be faster because the second would create two intermediate results that doesn't have any indexes.
Exemple 1 is more efficient because it has no embedded queries. About how the result set is build, I have no idea - I don't know DB2.

Is there documentation on/can someone explain a nested join in TSQL?

I'm not quite sure how to describe this, and I'm not quite sure if it's just syntactical sugar. This is the first time I've seen it, and I'm having trouble finding a reference or explanation as to the why and what of it.
I have a query as follows:
select * from
table1
join table2 on field1 = field2
join (
table3
join table4 on field3 = field4
join table5 on field5 = field6
) on field3 = field2
-- notice the fields in the parens and outside the parens
-- are part of the on clause
Are the parentheses necessary? Will removing them change the join order? I'm in a SQL Server 2005 environment in this case. Thanks!
Join order should make no difference in the result set of a query using natural joins (outside of column order). The query
select *
from t1
join t2 on t2.t1_id = t1.id
produces the same result set as
select *
from t2
join t1 on t1.id = t2.t1_id
If you're using outer joins and change the order of the tables in the from clause, naturally the direction of the outer join must change:
select *
from t1
left join t2 on t2.t1_id = t1.id
is the same as
select *
from t2
right join t1 on t1.id = t2.t1_id
However, if you see a subquery used as a table, with syntax like
select *
from t1
join ( select t2.*
from t2
join t3 on t3.t2_id = t2.id
where t3.foobar = 37
) x on x.t1_id = t1.id
You'll note the table alias (x) assigned to the subquery above.
What you have is something called a derived table (though some people call it a virtual table). You can think of it as a temporary view that exists for the life of a query. It's particularly useful when you need to filter something based on something like the result of a aggregration (group by).
The T-SQL documentation on the select, under the from clause goes into the details:
http://msdn.microsoft.com/en-us/library/ms189499(v=SQL.100).aspx
http://msdn.microsoft.com/en-us/library/ms177634(v=sql.100).aspx
It's not necessary in this case.
It's necessary (or at the very least, a lot simpler) in some others, especially where you name the nested call:
select table1.fieldX, table2.fieldY, sq.field6 from
table1 join table2 on field1 = field2
join ( select
top 1 table3.field6
from table3 join table4
on field3 = field4
where table3.field7 = table2.field8
order by fieldGoshIveUsedALotOfFieldsAlready
) sq on sq.field6 = field12345
The code you had could have been:
Like the above once, and then refactored.
Machine produced.
Reflecting the thought process of the developer as he or she arrived at the query, as they thought of that part of the larger query as a unit, then worked it into the larger query.
In this case they are not necessary:
select * from table1
join table2 on field1 = field2
join table3 on field3 = field2
join table4 on field3 = field4
join table5 on field5 = field6
Produces the same result.