SQL query: how to translate IN() into a JOIN? - sql

I have a lot of SQL queries like this:
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o
WHERE o.Id IN (
SELECT DISTINCT Id
FROM table1
, table2
, table3
WHERE ...
)
These queries have to run on different database engines (MySql, Oracle, DB2, MS-Sql, Hypersonic), so I can only use common SQL syntax.
Here I read, that with MySql the IN statement isn't optimized and it's really slow, so I want to switch this into a JOIN.
I tried:
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o, table2, table3
WHERE ...
But this does not take into account the DISTINCT keyword.
Question: How do I get rid of the duplicate rows using the JOIN approach?

To write this with a JOIN you can use an inner select and join with that:
SELECT o.Id, o.attrib1, o.attrib2 FROM table1 o
JOIN (
SELECT DISTINCT Id FROM table1, table2, table3 WHERE ...
) T1
ON o.id = T1.Id
I'm not sure this will be much faster, but maybe... you can try it for yourself.
In general restricting yourself only to SQL that will work on multiple databases is not going to result in the best performance.

But this does not take into account
the DISTINCT keyword.
You do not need the distinct in the sub-query. The in will return one row in the outer query regardless of whether it matches one row or one hundred rows in the sub-query. So, if you want to improve the performance of the query, junking that distinct would be a good start.
One way of tuning in clauses is to rewrite them using exists instead. Depending on the distribution of data this may be a lot more efficient, or it may be slower. With tuning, the benchmark is king.
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o
WHERE EXISTS (
SELECT Id FROM table1 t1, table2 t2, table3 t3 WHERE ...
AND ( t1.id = o.id
or t2.id = o.id
or t3.id = o.id
)
Not knowing your business logic the precise formulation of that additional filter may be wrong.
Incidentally I notice that you have table1 in both the outer query and the sub-query. If that is not a mistake in transcribing your actual SQL to here you may want to consider whether that makes sense. It would be better to avoid querying that table twice; using exists make make it easier to avoid the double hit.

SELECT DISTINCT o.Id, o.attrib1, o.attrib2
FROM table1 o, table2, table3
WHERE ...
Though if you need to support a number of different database back ends you probably want to give each its own set of repository classes in your data layer, so you can optimize your queries for each. This also gives you the power to persist in other types of databases, or xml, or web services, or whatever should the need arise down the road.

I'm not sure to really understand what is your problem. Why don't you try this :
SELECT distinct o.Id, o.attrib1, o.attrib2
FROM
table1 o
, table o1
, table o2
...
where
o1.id1 = o.id
or o2.id = o.id

Related

SQL Inner Join with no WHERE clause

I was wondering, how does an inner join work when no WHERE clause is specified? For example,
SELECT table1.letter, table2.letter, table1.number, table2.number
FROM tbl AS table1, tbl AS table2;
tbl:
text, integer
a , 1
b , 2
c , 3
Tried finding some examples online but I couldn't seem to find any :-/
Thanks!
The current implicit join syntax you are using:
FROM tbl AS table1, tbl AS table2;
will result in a cross join if no restrictions are present in the WHERE clause. But really you should use modern ANSI-92 syntax when writing your queries, e.g.
SELECT
table1.letter,
table2.letter,
table1.number,
table2.number
FROM tbl AS table1
INNER JOIN tbl AS table2
-- ON <some conditions>
One obvious reason to use this syntax is that it makes it much easier to see the logic of your query. In this case, if your updated query were missing an ON clause, then we would know right away that it is doing a cross join, which most of the time is usually not what you want to be doing.
The comma operator generates a Cartesian product -- every row in the first table combined with every row of the second.
This is more properly written using the explicit cross join:
SELECT table1.letter, table2.letter, table1.number, table2.number
FROM tbl table1 CROSS JOIN
tbl table2;
If you have conditions for combining the two tables, then you would normally use JOIN with an ON clause.
You can use cross join
select * from table1 cross join table2
Here is a link to understand more about the use of cross join.
https://www.w3resource.com/sql/joins/cross-join.php

Which is better for performance, selecting all the columns or select only the required columns while performng join?

I am been asked to do performance tuning of a SQL Server query which has so many joins in it.
For example
LEFT JOIN
vw_BILLABLE_CENSUS_R CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
There are almost 25 columns present in vw_Billing_Cenus_R but we wanted to use only 3 of them. So I wanted to know instead of selecting all the columns from the view or table, if I only select those columns which are required and then perform join like this
LEFT JOIN (SELECT [Column_1], [Column_2], [Column_3]
FROM vw_BILLABLE_CENSUS_R) CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
So Will this improve the performance or not?
The important part is the columns you are actually using on the outmost SELECT, not the ones to are selecting to join. The SQL Server engine is smart enough to realize that he does not need to retrieve all columns from the referenced table (or view) if he doesn't need them.
So the following 2 queries should yield the exact same query execution plan:
SELECT
A.SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
*
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
SELECT
A.SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.SomeColumn
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
The difference would be if you actually use the selected column (in a conditional where or actually retrieving the value), as in here:
SELECT
A.SomeColumn,
X.* -- * has all X columns
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.*
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
SELECT
A.SomeColumn,
X.* -- * has only X's SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.SomeColumn
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
I would rather use this approach:
LEFT JOIN
vw_BILLABLE_CENSUS_R CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
than this
LEFT JOIN (SELECT [Column_1], [Column_2], [Column_3]
FROM vw_BILLABLE_CENSUS_R) CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
Since in this case:
you make your query simpler,
you does not have to rely on query optimizer smartness and expect that it will eliminate unnecessary columns and rows
finally, you can select as many columns in the outer SELECT as necessary without using derived tables techniques.
In some cases, derived tables are welcome, when you want to eliminate duplicates in a table you want to join on a fly, but, imho, not in your case.
It depends on how many records are stored, but generally it will improve performance.
In this case read #LukStorms ' comments, I think he is right

Best way to use Where Clause in SQL SERVER Query (for best performance)?

I have written 2 queries in SQL Server and used SUB Query in join with Where Clause, Please let me know which is the best way to do in below Queries.
Query 1. SELECT column1,column2,column3,JOINTAB.column5 FROM Table1
INNER JOIN (SELECT column1,column4,column5 FROM Table2 WHERE
column4='xxxx') AS JOINTAB ON Table1.column1=JOINTAB.column1
Query 2. SELECT column1,column2,column3,JOINTAB.column5 FROM Table1
INNER JOIN (SELECT column1,column4,column5 FROM Table2) AS JOINTAB
ON Table1.column1=JOINTAB.column1 WHERE JOINTAB.column4='xxxx'
For the best performance Query 1 or 2?
Your best option here is not to use a subquery at all. It's not needed. Either of thsese will give you the same results:
SELECT t1.column1,t1.column2,t1.column3,t2.column5
FROM Table1 t1
INNER JOIN Table2 t1 on t2.column1 = t1.column1 and t2.column4 = 'xxxx'
--or
SELECT t1.column1,t1.column2,t1.column3,t2.column5
FROM Table1 t1
INNER JOIN Table2 t1 on t2.column1 = t1.column1
WHERE t2.column4 = 'xxxx'
Query Performance is a great concern for all the developers.
To find the performance and optimization simply you can view 'query execution plans'.
You can use the Query Analyzer, to which execution plan is chosen by the optimizer. Simply type an SQL statement in the Query window and press the Ctrl+L key. The query is displayed graphically.
SQL Server offers is the ability to see query execution plans
Run the query and find the performance.

Optimization of DB2 query which uses joins and takes 1.5 hours to execute

when i run SELECT stataement on my view it takes around 1.5 hours to run, what can i do to optimize it.
Below is the sample structure of how my view looks like
CREATE VIEW SCHEMANAME.VIEWNAME
{
COL, COL1, COL2, COL3 }
AS SELECT
COST.ETA,
CASE
WHEN VOL.CURR IS NOT NULL
THEN COALESCE {VOL.COMM,0}
END CASE,
CASE
WHEN...
END CASE
FROM TABLE1 t1 inner join TABLE2 t2 ON t1.ETA=t2.ETA
INNER JOIN TABLE3 t3 on t2.ETA=t3.ETA
LEFT OUTER JOIN TABLE4 t4 on t2.ETA=t4.ETA
This is your query:
SELECT COST.ETA,
(CASE WHEN VOL.CURR IS NOT NULL THEN COALESCE {VOL.COMM,0}
END) as ??,
. . .
FROM TABLE1 t1 inner join
TABLE2 t2
ON t1.ETA = t2.ETA INNER JOIN
TABLE3 t3
on t2.ETA = t3.ETA LEFT OUTER JOIN
TABLE4 t4
on t2.ETA = t4.ETA;
First, I will the fact that the select clause references tables that are not in the from clause. I assume this is a typo.
Second, you should be able to use indexes to improve this query: table1(eta), table2(eta),table3(eta), andtable4(eta).
Third, I am highly suspicious on seeing the same column used for joining so many tables. I suspect that you might have cartesian products occurring, because there are multiple values of any given eta in several tables. If that is the case, you need to fix the query to better reflect what you really need. If so, ask another question with sample data and desired results, because your query is probably not correct.

Use join with a table and SQL Statement

Joins are usually used to fetch data from 2 tables using a common factor from either tables
Is it possible to use a join statement using a table and results of another SQL statement and if it is what is the syntax
Sure, this is called a derived table
such as:
select a.column, b.column
from
table1 a
join (select statement) b
on b.column = a.column
keep in mind that it will run the select for the derived table in entirety, so it can be helpful if you only select things you need.
EDIT: I've found that I rarely need to use this technique unless I am joining on some aggregated queries.... so I would carefully consider your design here.
For example, thus far most demonstrations in this thread have not required the use of a derived table.
It depends on what the other statement is, but one of the techniques you can use is common table expressions - this may not be available on your particular SQL platform.
In the case of SQL Server, if the other statement is a stored procedure, you may have to insert the results into a temporary table and join to that.
It's also possible in SQL Server (and some other platforms) to have table-valued functions which can be joined just like a view or table.
select *
from TableA a
inner join (select x from TableB) b
on a.x = b.x
Select c.CustomerCode, c.CustomerName, sq.AccountBalance
From Customers c
Join (
Select CustomerCode, AccountBalance
From Balances
)sq on c.CustomerCode = sq.CustomerCode
Sure, as an example:
SELECT *
FROM Employees E
INNER JOIN
(
SELECT EmployeeID, COUNT(EmployeeID) as ComplaintCount
FROM Complaints
GROUP BY EmployeeID
) C ON E.EmployeeID = C.EmployeeID
WHERE C.ComplaintCount > 3
It is. But what specifically are you looking to do?
That can be done with either a sub-select, a view or a temp table... More information would help us answer this question better, including which SQL software, and an example of what you'd like to do.
Try this:
SELECT T1.col1, t2.col2 FROM Table1 t1 INNER JOIN
(SELECT col1, col2, col3 FROM Table 2) t2 ON t1.col1 = t2.col1