What's the difference between filtering in the WHERE clause compared to the ON clause? - sql

I would like to know if there is any difference in using the WHERE clause or using the matching in the ON of the inner join.
The result in this case is the same.
First query:
with Catmin as
(
select categoryid, MIN(unitprice) as mn
from production.Products
group by categoryid
)
select p.productname, mn
from Catmin
inner join Production.Products p
on p.categoryid = Catmin.categoryid
and p.unitprice = Catmin.mn;
Second query:
with Catmin as
(
select categoryid, MIN(unitprice) as mn
from production.Products
group by categoryid
)
select p.productname, mn
from Catmin
inner join Production.Products p
on p.categoryid = Catmin.categoryid
where p.unitprice = Catmin.mn; // this is changed
Result both queries:

My answer may be a bit off-topic, but I would like to highlight a problem that may occur when you turn your INNER JOIN into an OUTER JOIN.
In this case, the most important difference between putting predicates (test conditions) on the ON or WHERE clauses is that you can turn LEFT or RIGHT OUTER JOINS into INNER JOINS without noticing it, if you put fields of the table to be left out in the WHERE clause.
For example, in a LEFT JOIN between tables A and B, if you include a condition that involves fields of B on the WHERE clause, there's a good chance that there will be no null rows returned from B in the result set. Effectively, and implicitly, you turned your LEFT JOIN into an INNER JOIN.
On the other hand, if you include the same test in the ON clause, null rows will continue to be returned.
For example, take the query below:
SELECT * FROM A
LEFT JOIN B
ON A.ID=B.ID
The query will also return rows from A that do not match any of B.
Take this second query:
SELECT * FROM A
LEFT JOIN B
WHERE A.ID=B.ID
This second query won't return any rows from A that don't match B, even though you think it will because you specified a LEFT JOIN. That's because the test A.ID=B.ID will leave out of the result set any rows with B.ID that are null.
That's why I favor putting predicates in the ON clause rather than in the WHERE clause.

The results are exactly same.
Using "ON" clause is more suggested due to increasing performance of the query.
Instead of requesting the data from tables then filtering, by using on clause, you first filter first data-set and then join the data to other tables. So, lesser data to match and faster result is given.

There is no difference between the above two queries outputs both of them result same.
When you are using On Clause the join operation joins only those rows that matches the codidtion specified on ON Clause
Where as in case of Where Clause, the join opeartion joins all the rows and then filters out based on where condidtion Specified
So, obviously On Clause is more effective and should be preferred over where condidtion

Related

Join conditions, intermediate SQL

what's the difference between the join conditions "on" and "using" if both are used to select specified column(s)?
The main difference with using is that the columns for the join have to have the same names. This is generally a good practice anyway in the data model.
Another important difference is that the columns come from different tables -- the join condition doesn't specify the tables (some people view this as a weakness, but you'll see it is quite useful).
A handy feature is that the common columns used for the join are removed when you use select *. So
select *
from a join
b
on a.x = b.x
will result in x appearing twice in the result set. This is not allowed for subqueries or views. On the other hand, this query only has x once in the result set.
select *
from a join
b
using (x)
Of course, other columns could be duplicated.
For an outer join, the value is the non-NULL value, if any. This becomes quite handy for full joins:
select *
from a full join
b
using (x) full join
c
using (x);
Because of the null values, expressing this without using is rather cumbersome:
select *
from a full join
b
on b.x = a.x full join
c
on c.x = coalesce(a.x, b.x);
using is just a short-circuit to express the join condition when the related columns have the same name.
Consider the following example:
select ...
from orders o
inner join order_items oi on oi.order_id = o.order_id
This can be shortened with using, as follows:
select ...
from orders o
inner join order_items oi using(order_id)
Notes:
this also works when joining on several columns having identical names
parentheses are mandatory with using

SQL - why is this 'where' needed to remove row duplicates, when I'm already grouping?

Why, in this query, is the final 'WHERE' clause needed to limit duplicates?
The first LEFT JOIN is linking programs to entities on a UID
The first INNER JOIN is linking programs to a subquery that gets statistics for those programs, by linking on a UID
The subquery (that gets the StatsForDistributorClubs subset) is doing a grouping on UID columns
So, I would've thought that this would all be joining unique records anyway so we shouldn't get row duplicates
So why the need to limit based on the final WHERE by ensuring the 'program' is linked to the 'entity'?
(irrelevant parts of query omitted for clarity)
SELECT LmiEntity.[DisplayName]
,StatsForDistributorClubs.*
FROM [Program]
LEFT JOIN
LMIEntityProgram
ON LMIEntityProgram.ProgramUid = Program.ProgramUid
INNER JOIN
(
SELECT e.LmiEntityUid,
sp.ProgramUid,
SUM(attendeecount) [Total attendance],
FROM LMIEntity e,
Timetable t,
TimetableOccurrence [to],
ScheduledProgramOccurrence spo,
ScheduledProgram sp
WHERE
t.LicenseeUid = e.lmientityUid
AND [to].TimetableOccurrenceUid = spo.TimetableOccurrenceUid
AND sp.ScheduledProgramUid = spo.ScheduledProgramUid
GROUP BY e.lmientityUid, sp.ProgramUid
) AS StatsForDistributorClubs
ON Program.ProgramUid = StatsForDistributorClubs.ProgramUid
INNER JOIN LmiEntity
ON LmiEntity.LmiEntityUid = StatsForDistributorClubs.LmiEntityUid
LEFT OUTER JOIN Region
ON Region.RegionId = LMIEntity.RegionId
WHERE (
[Program].LicenseeUid = LmiEntity.LmiEntityUid
OR
[LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid
)
If you were grouping in your outer query, the extra criteria probably wouldn't be needed, but only your inner query is grouped. Your LEFT JOIN to a grouped inner query can still result in multiple records being returned, for that matter any of your JOINs could be the culprit.
Without seeing sample of duplication it's hard to know where the duplicates originate from, but GROUPING on the outer query would definitely remove full duplicates, or revised JOIN criteria could take care of it.
You have in result set:
SELECT LmiEntity.[DisplayName]
,StatsForDistributorClubs.*
I suppose that you dublicates comes from LMIEntityProgram.
My conjecture: LMIEntityProgram - is a bridge table with both LmiEntityId an ProgramId, but you join only by ProgramId.
If you have several LmiEntityId for single ProgramId - you must have dublicates.
And this dublicates you're filtering in WHERE:
[LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid
You can do it in JOIN:
LEFT JOIN LMIEntityProgram
ON LMIEntityProgram.ProgramUid = Program.ProgramUid
AND [LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid

When I add a LEFT OUTER JOIN, the query returns only a few rows

The original query returns 160k rows. When I add the LEFT OUTER JOIN:
LEFT OUTER JOIN Table_Z Z WITH (NOLOCK) ON A.Id = Z.Id
the query returns only 150 rows. I'm not sure what I'm doing wrong.
All I need to do is add a column to the query, which will bring back a code from a different table. The code could be a number or a NULL. I still have to display NULL, hence the reason for the LEFT join. They should join on the "id" columns.
SELECT <lots of stuff> + the new column that I need (called "code").
FROM
dbo.Table_A A WITH (NOLOCK)
INNER JOIN
dbo.Table_B B WITH (NOLOCK) ON A.Id = B.Id AND A.version = B.version
--this is where I added the LEFT OUTER JOIN. with it, the query returns 150 rows, without it, 160k rows.
LEFT OUTER JOIN
Table_Z Z WITH (NOLOCK) ON A.Id = Z.Id
LEFT OUTER JOIN
Table_E E WITH (NOLOCK) ON A.agent = E.agent
LEFT OUTER JOIN
Table_D D WITH (NOLOCK) ON E.location = D.location
AND E.type = 'Organization'
AND D.af_type = 'agent_location'
INNER JOIN
(SELECT X , MAX(Version) AS MaxVersion
FROM LocalTable WITH (NOLOCK)
GROUP BY agemt) P ON E.agent = P.location AND E.Version = P.MaxVersion
Does anyone have any idea what could be causing the issue?
When you perform a LEFT OUTER JOIN between tables A and E, you are maintaining your original set of data from A. That is to say, there is no data, or lack of data, in table E that can reduce the number of rows in your query.
However, when you then perform an INNER JOIN between E and P at the bottom, you are indeed opening yourself up to the possibility of reducing the number of rows returned. This will treat your subsequent LEFT OUTER JOINs like INNER JOINs.
Now, without your exact schema and a set of data to test against, this may or may not be the exact issue you are experiencing. Still, as a general rule, always put your INNER JOINs before your OUTER JOINs. It can make writing queries like this much, much easier. Your most restrictive joins come first, and then you won't have to worry about breaking any of your outer joins later on.
As a quick fix, try changing your last join to P to a LEFT OUTER JOIN, just to see if the Z join works.
You have to be very careful once you start with LEFT JOINs.
Let's suppose this model: You have tables Products, Orders and Customers. Not all products necessarily have been ordered, but every order must have customer entered.
Task: Show all products, and if the product was ordered, list the ordering customers; i.e., product without orders will be shown as one row, product with 10 orders will have 10 rows in the resultset. This calls for a query designed around FROM Products LEFT JOIN Orders.
Now someone could think "OK, Customer is always entered into orders, so I can make inner join from orders to customers". Wrong. Since the table Customers is joined through left-joined table Orders, it has to be left-joined itself... otherwise the inner join will propagate into the previous level(s) and as a result, you will lose all products that have no orders.
That is, once you join any table using LEFT JOIN, any subsequent tables that are joined through this table, need to keep LEFT JOINs. But it does not mean that once you use LEFT JOIN, all joins have to be of that type... only those that are dependent on the first performed LEFT JOIN. It would be perfectly fine to INNER JOIN the table Products with another table Category for example, if you only want to see Products which have a category set.
(Answer is based on this answer: http://www.sqlservercentral.com/Forums/Topic247971-8-1.aspx -> last entry)

Adding more condition while joining or in where which is better?

SELECT C.*
FROM Content C
INNER JOIN ContentPack CP ON C.ContentPackId = CP.ContentPackId
AND CP.DomainId = #DomainId
...and:
SELECT C.*
FROM Content C
INNER JOIN ContentPack CP ON C.ContentPackId = CP.ContentPackId
WHERE CP.DomainId = #DomainId
Is there any performance difference between this 2 queries?
Because both queries use an INNER JOIN, there is no difference -- they're equivalent.
That wouldn't be the case if dealing with an OUTER JOIN -- criteria in the ON clause is applied before the join; criteria in the WHERE is applied after the join.
But your query would likely run better as:
SELECT c.*
FROM CONTENT c
WHERE EXISTS (SELECT NULL
FROM CONTENTPACK cp
WHERE cp.contentpackid = c.contentpackid
AND cp.domainid = #DomainId)
Using a JOIN risks duplicates if there's more than one CONTENTPACK record related to a CONTENT record. And it's pointless to JOIN if your query is not using columns from the table being JOINed to... JOINs are not always the fastest way.
There's no performance difference but I would prefer the inner join because I think it makes very clear what is it that you are trying to join on both tables.

What is the most efficient way to write a select statement with a "not in" subquery?

What is the most efficient way to write a select statement similar to the below.
SELECT *
FROM Orders
WHERE Orders.Order_ID not in (Select Order_ID FROM HeldOrders)
The gist is you want the records from one table when the item is not in another table.
For starters, a link to an old article in my blog on how NOT IN predicate works in SQL Server (and in other systems too):
Counting missing rows: SQL Server
You can rewrite it as follows:
SELECT *
FROM Orders o
WHERE NOT EXISTS
(
SELECT NULL
FROM HeldOrders ho
WHERE ho.OrderID = o.OrderID
)
, however, most databases will treat these queries the same.
Both these queries will use some kind of an ANTI JOIN.
This is useful for SQL Server if you want to check two or more columns, since SQL Server does not support this syntax:
SELECT *
FROM Orders o
WHERE (col1, col2) NOT IN
(
SELECT col1, col2
FROM HeldOrders ho
)
Note, however, that NOT IN may be tricky due to the way it treats NULL values.
If Held.Orders is nullable, no records are found and the subquery returns but a single NULL, the whole query will return nothing (both IN and NOT IN will evaluate to NULL in this case).
Consider these data:
Orders:
OrderID
---
1
HeldOrders:
OrderID
---
2
NULL
This query:
SELECT *
FROM Orders o
WHERE OrderID NOT IN
(
SELECT OrderID
FROM HeldOrders ho
)
will return nothing, which is probably not what you'd expect.
However, this one:
SELECT *
FROM Orders o
WHERE NOT EXISTS
(
SELECT NULL
FROM HeldOrders ho
WHERE ho.OrderID = o.OrderID
)
will return the row with OrderID = 1.
Note that LEFT JOIN solutions proposed by others is far from being a most efficient solution.
This query:
SELECT *
FROM Orders o
LEFT JOIN
HeldOrders ho
ON ho.OrderID = o.OrderID
WHERE ho.OrderID IS NULL
will use a filter condition that will need to evaluate and filter out all matching rows which can be numerius
An ANTI JOIN method used by both IN and EXISTS will just need to make sure that a record does not exists once per each row in Orders, so it will eliminate all possible duplicates first:
NESTED LOOPS ANTI JOIN and MERGE ANTI JOIN will just skip the duplicates when evaluating HeldOrders.
A HASH ANTI JOIN will eliminate duplicates when building the hash table.
"Most efficient" is going to be different depending on tables sizes, indexes, and so on. In other words it's going to differ depending on the specific case you're using.
There are three ways I commonly use to accomplish what you want, depending on the situation.
1. Your example works fine if Orders.order_id is indexed, and HeldOrders is fairly small.
2. Another method is the "correlated subquery" which is a slight variation of what you have...
SELECT *
FROM Orders o
WHERE Orders.Order_ID not in (Select Order_ID
FROM HeldOrders h
where h.order_id = o.order_id)
Note the addition of the where clause. This tends to work better when HeldOrders has a large number of rows. Order_ID needs to be indexed in both tables.
3. Another method I use sometimes is left outer join...
SELECT *
FROM Orders o
left outer join HeldOrders h on h.order_id = o.order_id
where h.order_id is null
When using the left outer join, h.order_id will have a value in it matching o.order_id when there is a matching row. If there isn't a matching row, h.order_id will be NULL. By checking for the NULL values in the where clause you can filter on everything that doesn't have a match.
Each of these variations can work more or less efficiently in various scenarios.
You can use a LEFT OUTER JOIN and check for NULL on the right table.
SELECT O1.*
FROM Orders O1
LEFT OUTER JOIN HeldOrders O2
ON O1.Order_ID = O2.Order_Id
WHERE O2.Order_Id IS NULL
I'm not sure what is the most efficient, but other options are:
1. Use EXISTS
SELECT *
FROM ORDERS O
WHERE NOT EXISTS (SELECT 1
FROM HeldOrders HO
WHERE O.Order_ID = HO.OrderID)
2. Use EXCEPT
SELECT O.Order_ID
FROM ORDERS O
EXCEPT
SELECT HO.Order_ID
FROM HeldOrders
Try
SELECT *
FROM Orders
LEFT JOIN HeldOrders
ON HeldOrders.Order_ID = Orders.Order_ID
WHERE HeldOrders.Order_ID IS NULL