Exists sub-query with a HAVING clause

Exists sub-query with a HAVING clause - sql

I'm trying to understand how EXISTS work.
The following query is based on this answer, and it queries for all SalesOrderIDs that have more than 1 record in the table, where at lease one of those records has OrderQty > 1 and ProductID = 777:
USE AdventureWorks2012;
GO
SELECT SalesOrderID, OrderQty, ProductID
FROM Sales.SalesOrderDetail s
WHERE EXISTS
( SELECT 1
FROM Sales.SalesOrderDetail s2
WHERE s.SalesOrderID = s2.SalesOrderID
GROUP BY SalesOrderID
HAVING COUNT(*) > 1
AND COUNT(CASE WHEN OrderQty > 1 AND ProductID = 777 THEN 1 END) >= 1
);
What I don't understand is this: The sub-query returns a single-columned table filled with the value 1 on each row. So the way I understand it, the WHERE in the outer query has no real condition to apply, just a bunch of 1s. Why\How, then, the outer query returns only part of the Sales.SalesOrderDetail, and not its entirety?

What happens in EXISTS is that, it only checks if the record from the outer table satisfies the conditions given in the inner query. That's why we specify "1" unlike IN where we need to specify the individual columns (and data is checked for each and every record).
So, it does not return any bunch of 1's and validates it. As the name implies, it checks only for the existence of the record as per the given condition.
Hope this clarifies.
Note : Always use table alias names for the columns to prevent ambiguity.

the inner SELECT 1 ... will not always return 1.
When inner WHERE/HAVING condition is not met you will not get 1 returned. Instead there will be nothing, I mean the SQL Server Management Studio (if I recall correctly) will display NO result at all, not even NULL for the inner SELECT 1 thus failing the whole outer WHERE for that particular row.
Therefore part of your outer query result set will be cut off and the total number of rows returned with EXITS(...) will be less then if EXISTS(...) was not present.

Related

Query with Left outer join and group by returning duplicates

To begin with, I have a table in my db that is fed with SalesForce info. When I run this example query it returns 2 rows:
select * from SalesForce_INT_Account__c where ID_SAP_BAYER__c = '3783513'
When I run this next query on the same table I obtain one of the rows, which is what I need:
SELECT MAX(ID_SAP_BAYER__c) FROM SalesForce_INT_Account__c where ID_SAP_BAYER__c = '3783513' GROUP BY ID_SAP_BAYER__c
Now, I have another table (PedidosEspecialesZarateCabeceras) which has a field (NroClienteDireccionEntrega) that I can match with the field I've been using in the SalesForce table (ID_SAP_BAYER__c). This table has a key that consists of just 1 field (NroPedido).
What I need to do is join these 2 tables to obtain a row from PedidosEspecialesZarateCabeceras with additional fields coming from the SalesForce table, and in case those additional fields are not available, they should come as NULL values, so for that im using a LEFT OUTER JOIN.
The problem is, since I have to match NroClienteDireccionEntrega and ID_SAP_BAYER__c and there's 2 rows in the salesforce table with the same ID_SAP_BAYER__c, my query returns 2 duplicate rows from PedidosEspecialesZarateCabeceras (They both have the same NroPedido).
This is an example query that returns duplicates:
SELECT
cab.CUIT AS CUIT,
convert(nvarchar(4000), cab.NroPedido) AS NroPedido,
sales.BillingCity__c as Localidad,
sales.BillingState__c as IdProvincia,
sales.BillingState__c_Desc as Provincia,
sales.BillingStreet__c as Calle,
sales.Billing_Department__c as Distrito,
sales.Name as RazonSocial,
cab.NroCliente as ClienteId
FROM PedidosEspecialesZarateCabeceras AS cab WITH (NOLOCK)
LEFT OUTER JOIN
SalesForce_INT_Account__c AS sales WITH (NOLOCK) ON
cab.NroClienteDireccionEntrega = sales.ID_SAP_BAYER__c
and sales.ID_SAP_BAYER__c in
( SELECT MAX(ID_SAP_BAYER__c)
FROM SalesForce_INT_Account__c
GROUP BY ID_SAP_BAYER__c
)
WHERE cab.NroPedido ='5320'
Even though the join has MAX and Group By, this returns 2 duplicate rows with different SalesForce information (Because of the 2 salesforce rows with the same ID_SAP_BAYER__c), which should not be possible.
What I need is for the left outer join in my query to pick only ONE of the salesforce rows to prevent duplication like its happening right now. For some reason the select max with the group by is not working.
Maybe I should try to join this tables in a different way, can anyone give me some other ideas on how to join the two tables to return just 1 row? It doesnt matter if the SalesForce row that gets picked out of the 2 isn't the correct one, I just need it to pick one of them.

Your IN clause is not actually doing anything, since...
SELECT MAX(ID_SAP_BAYER__c)
FROM SalesForce_INT_Account__c
GROUP BY ID_SAP_BAYER__c
... returns all possible IDSAP_BAYER__c values. (The GROUP BY says you want to return one row per unique ID_SAP_BAYER__c and then, since your MAX is operating on exactly one unique value per group, you simply return that value.)
You will want to change your query to operate on a value that is actually different between the two rows you are trying to differentiate (probably the MAX(ID) for the relevant ID_SAP_BAYER__c). Plus, you will want to link that inner query to your outer query.
You could probably do something like:
...
LEFT OUTER JOIN
SalesForce_INT_Account__c sales
ON cab.NroClienteDireccionEntrega = sales.ID_SAP_BAYER__c
and sales.ID in
(
SELECT MAX(ID)
FROM SalesForce_INT_Account__c sales2
WHERE sales2.ID_SAP_BAYER__c = cab.NroClienteDireccionEntrega
)
WHERE cab.NroPedido ='5320'
By using sales.ID in ... SELECT MAX(ID) ... instead of sales.ID_SAP_BAYER__c in ... SELECT MAX(ID_SAP_BAYER__c) ... this ensures you only match one of the two rows for that ID_SAP_BAYER__c. The WHERE sales2.ID_SAP_BAYER__c = cab.NroClienteDireccionEntrega condition links the inner query to the outer query.
There are multiple ways of doing the above, especially if you don't care which of the relevant rows you match on. You can use the above as a starting point and make it match your preferred style.
An alternative might be to use OUTER APPLY with TOP 1. Something like:
SELECT
...
FROM PedidosEspecialesZarateCabeceras AS cab
OUTER APPLY(
SELECT TOP 1 *
FROM SalesForce_INT_Account__c s1
WHERE cab.NroClienteDireccionEntrega = s1.ID_SAP_BAYER__c
) sales
WHERE cab.NroPedido ='5320'
Without an ORDER BY the match that TOP 1 chooses will be arbitrary, but I think that's what you want anyway. (If not, you could add an ORDER BY).

I am getting this error when trying to run the Count(Subscriber) line of code: Subquery returned more than 1 value. This is not permitted [duplicate]

I run the following query:
SELECT
orderdetails.sku,
orderdetails.mf_item_number,
orderdetails.qty,
orderdetails.price,
supplier.supplierid,
supplier.suppliername,
supplier.dropshipfees,
cost = (SELECT supplier_item.price
FROM supplier_item,
orderdetails,
supplier
WHERE supplier_item.sku = orderdetails.sku
AND supplier_item.supplierid = supplier.supplierid)
FROM orderdetails,
supplier,
group_master
WHERE invoiceid = '339740'
AND orderdetails.mfr_id = supplier.supplierid
AND group_master.sku = orderdetails.sku
I get the following error:
Msg 512, Level 16, State 1, Line 2
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
Any ideas?

Try this:
SELECT
od.Sku,
od.mf_item_number,
od.Qty,
od.Price,
s.SupplierId,
s.SupplierName,
s.DropShipFees,
si.Price as cost
FROM
OrderDetails od
INNER JOIN Supplier s on s.SupplierId = od.Mfr_ID
INNER JOIN Group_Master gm on gm.Sku = od.Sku
INNER JOIN Supplier_Item si on si.SKU = od.Sku and si.SupplierId = s.SupplierID
WHERE
od.invoiceid = '339740'
This will return multiple rows that are identical except for the cost column. Look at the different cost values that are returned and figure out what is causing the different values. Then ask somebody which cost value they want, and add the criteria to the query that will select that cost.

Check to see if there are any triggers on the table you are trying to execute queries against. They can sometimes throw this error as they are trying to run the update/select/insert trigger that is on the table.
You can modify your query to disable then enable the trigger if the trigger DOES NOT need to be executed for whatever query you are trying to run.
ALTER TABLE your_table DISABLE TRIGGER [the_trigger_name]
UPDATE your_table
SET Gender = 'Female'
WHERE (Gender = 'Male')
ALTER TABLE your_table ENABLE TRIGGER [the_trigger_name]

SELECT COLUMN
FROM TABLE
WHERE columns_name
IN ( SELECT COLUMN FROM TABLE WHERE columns_name = 'value');
note: when we are using sub-query we must focus on these points:
if our sub query returns 1 value in this case we need to use (=,!=,<>,<,>....)
else (more than one value), in this case we need to use (in, any, all, some )

cost = Select Supplier_Item.Price from Supplier_Item,orderdetails,Supplier
where Supplier_Item.SKU=OrderDetails.Sku and
Supplier_Item.SupplierId=Supplier.SupplierID
This subquery returns multiple values, SQL is complaining because it can't assign multiple values to cost in a single record.
Some ideas:
Fix the data such that the existing subquery returns only 1 record
Fix the subquery such that it only returns one record
Add a top 1 and order by to the subquery (nasty solution that DBAs hate - but it "works")
Use a user defined function to concatenate the results of the subquery into a single string

The fix is to stop using correlated subqueries and use joins instead. Correlated subqueries are essentially cursors as they cause the query to run row-by-row and should be avoided.
You may need a derived table in the join in order to get the value you want in the field if you want only one record to match, if you need both values then the ordinary join will do that but you will get multiple records for the same id in the results set. If you only want one, you need to decide which one and do that in the code, you could use a top 1 with an order by, you could use max(), you could use min(), etc, depending on what your real requirement for the data is.

I had the same problem , I used in instead of = , from the Northwind database example :
Query is : Find the Companies that placed orders in 1997
Try this :
SELECT CompanyName
FROM Customers
WHERE CustomerID IN (
SELECT CustomerID
FROM Orders
WHERE YEAR(OrderDate) = '1997'
);
Instead of that :
SELECT CompanyName
FROM Customers
WHERE CustomerID =
(
SELECT CustomerID
FROM Orders
WHERE YEAR(OrderDate) = '1997'
);

Either your data is bad, or it's not structured the way you think it is. Possibly both.
To prove/disprove this hypothesis, run this query:
SELECT * from
(
SELECT count(*) as c, Supplier_Item.SKU
FROM Supplier_Item
INNER JOIN orderdetails
ON Supplier_Item.sku = orderdetails.sku
INNER JOIN Supplier
ON Supplier_item.supplierID = Supplier.SupplierID
GROUP BY Supplier_Item.SKU
) x
WHERE c > 1
ORDER BY c DESC
If this returns just a few rows, then your data is bad. If it returns lots of rows, then your data is not structured the way you think it is. (If it returns zero rows, I'm wrong.)
I'm guessing that you have orders containing the same SKU multiple times (two separate line items, both ordering the same SKU).

The select statement in the cost part of your select is returning more than one value. You need to add more where clauses, or use an aggregation.

The error implies that this subquery is returning more than 1 row:
(Select Supplier_Item.Price from Supplier_Item,orderdetails,Supplier where Supplier_Item.SKU=OrderDetails.Sku and Supplier_Item.SupplierId=Supplier.SupplierID )
You probably don't want to include the orderdetails and supplier tables in the subquery, because you want to reference the values selected from those tables in the outer query. So I think you want the subquery to be simply:
(Select Supplier_Item.Price from Supplier_Item where Supplier_Item.SKU=OrderDetails.Sku and Supplier_Item.SupplierId=Supplier.SupplierID )
I suggest you read up on correlated vs. non-correlated subqueries.

As others have suggested, the best way to do this is to use a join instead of variable assignment. Re-writing your query to use a join (and using the explicit join syntax instead of the implicit join, which was also suggested--and is the best practice), you would get something like this:
select
OrderDetails.Sku,
OrderDetails.mf_item_number,
OrderDetails.Qty,
OrderDetails.Price,
Supplier.SupplierId,
Supplier.SupplierName,
Supplier.DropShipFees,
Supplier_Item.Price as cost
from
OrderDetails
join Supplier on OrderDetails.Mfr_ID = Supplier.SupplierId
join Group_Master on Group_Master.Sku = OrderDetails.Sku
join Supplier_Item on
Supplier_Item.SKU=OrderDetails.Sku and Supplier_Item.SupplierId=Supplier.SupplierID
where
invoiceid='339740'

Even after 9 years of the original post, this helped me.
If you are receiving these types of errors without any clue, there should be a trigger, function related to the table, and obviously it should end up with an SP, or function with selecting/filtering data NOT USING Primary Unique column. If you are searching/filtering using the Primary Unique column there won't be any multiple results. Especially when you are assigning value for a declared variable. The SP never gives you en error but only an runtime error.
"System.Data.SqlClient.SqlException (0x80131904): Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
The statement has been terminated."
In my case obviously there was no clue, but only this error message. There was a trigger connected to the table and the table updating by the trigger also had another trigger likewise it ended up with two triggers and in the end with an SP. The SP was having a select clause which was resulting in multiple rows.
SET #Variable1 =(
SELECT column_gonna_asign
FROM dbo.your_db
WHERE Non_primary_non_unique_key= #Variable2
If this returns multiple rows, you are in trouble.

How does EXISTS return things other than all rows or no rows?

I am a beginning SQL programmer - I am getting most things, but not EXISTS.
It looks to me, and looks by the documentation, that an entire EXISTS statement returns a boolean value.
However, I see specific examples where it can be used and returns part of a table as opposed to all or none of it.
SELECT DISTINCT PNAME
FROM P
WHERE EXISTS
(
SELECT *
FROM SP Join S ON SP.SNO = S.SNO
WHERE SP.PNO = P.PNO
AND S.STATUS > 25
)
This query returns to me one value, the one that meets the criteria (S.Status > 25).
However, with other queries, it seems to return the whole table I am selecting from if even one of the rows in the EXISTS subquery is true.
How does one control this?

Subqueries such as with EXISTS can either be correlated or non-correlated.
In your example you use a correlated subquery, which is usually the case with EXISTS. You look up records in SP for a given P.PNO, i.e. you do the lookup for each P record.
Without SP.PNO = P.PNO you would have a non-correlated subquery. I.e. the subquery no longer depends on the P record. It would return the same result for any P record (either a Status > 25 exists at all or not). Most often when this happens this is done by mistake (one forgot to relate the subquery to the record in question), but sometimes it is desired so.

You have actually created a Correlated subquery. Exists predicate accepts a subquery as input and returns TRUE if the subquery returns any rows and FALSE otherwise.
The outer query against table P doesn't have any filters, so all the rows from this table will be considered for which the EXISTS predicate returns TRUE.
SELECT DISTINCT PNAME -- Outer Query
FROM P
Now, the EXISTS predicate returns TRUE if the current row in table P has related rows in SP Join S ON SP.SNO = S.SNO where S.STATUS > 25
SELECT *
FROM SP Join S ON SP.SNO = S.SNO
WHERE SP.PNO = P.PNO -- Inner query
AND S.STATUS > 25
One of the benefits of using the EXISTS predicate is that it allows you to intuitively phrase English like queries. For example, this query can be read just as you would say it in ordinary English: select all unique PNAME from table P where at least one row exists in which PNO equals PNO in table SP and Status in table S > 25, provided table SP and S are joined based on SNO.

Which SQL language are you using?
Either EXISTS return allways true or false or it allways returning rows, but in WHERE EXISTS... it will check returned rows > 0 (=>true).
Oracle, MySQL, PostreSQL:
The EXISTS condition is used in combination with a subquery and is considered "to be met" if the subquery returns at least one row.
(http://www.techonthenet.com)

your condition in where clause for main query
SELECT DISTINCT PNAME FROM P
is dependent to Exist ,
if your subquery returns any rows ,
then exists returns true ,otherwise it returns false
and the main query where clause return all of records in p if Exists return true and nothing if it returns false

How do I write an SQL query to identify duplicate values in a specific field?

This is the table I'm working with:
I would like to identify only the ReviewIDs that have duplicate deduction IDs for different parameters.
For example, in the image above, ReviewID 114 has two different parameter IDs, but both records have the same deduction ID.
For my purposes, this record (ReviewID 114) has an error. There should not be two or more unique parameter IDs that have the same deduction ID for a single ReviewID.
I would like write a query to identify these types of records, but my SQL skills aren't there yet. Help?
Thanks!
Update 1: I'm using TSQL (SQL Server 2008) if that helps
Update 2: The output that I'm looking for would be the same as the image above, minus any records that do not match the criteria I've described.
Cheers!

SELECT * FROM table t1 INNER JOIN (
SELECT review_id, deduction_id FROM table
GROUP BY review_id, deduction_id
HAVING COUNT(parameter_id) > 1
) t2 ON t1.review_id = t2.review_id AND t1.deduction_id = t2.deduction_id;
http://www.sqlfiddle.com/#!3/d858f/3
If it is possible to have exact duplicates and that is ok, you can modify the HAVING clause to COUNT(DISTINCT parameter_id).

Select ReviewID, deduction_ID from Table
Group By ReviewID, deduction_ID
Having count(ReviewID) > 1
http://www.sqlfiddle.com/#!3/6e113/3 has an example

If I understand the criteria: For each combination of ReviewID and deduction_id you can have only one parameter_id and you want a query that produces a result without the ReviewIDs that break those rules (rather than identifying those rows that do). This will do that:
;WITH review_errors AS (
SELECT ReviewID
FROM test
GROUP BY ReviewID,deduction_ID
HAVING COUNT(DISTINCT parameter_id) > 1
)
SELECT t.*
FROM test t
LEFT JOIN review_errors r
ON t.ReviewID = r.ReviewID
WHERE r.ReviewID IS NULL
To explain: review_errors is a common table expression (think of it as a named sub-query that doesn't clutter up the main query). It selects the ReviewIDs that break the criteria. When you left join on it, it selects all rows from the left table regardless of whether they match the right table and only the rows from the right table that match the left table. Rows that do not match will have nulls in the columns for the right-hand table. By specifying WHERE r.ReviewID IS NULL you eliminate the rows from the left hand table that match the right hand table.
SQL Fiddle

NOT IN vs NOT EXISTS

Which of these queries is the faster?
NOT EXISTS:
SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE NOT EXISTS (
SELECT 1
FROM Northwind..[Order Details] od
WHERE p.ProductId = od.ProductId)
Or NOT IN:
SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE p.ProductID NOT IN (
SELECT ProductID
FROM Northwind..[Order Details])
The query execution plan says they both do the same thing. If that is the case, which is the recommended form?
This is based on the NorthWind database.
[Edit]
Just found this helpful article:
http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
I think I'll stick with NOT EXISTS.

I always default to NOT EXISTS.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work (even if no NULLs are actually present in the data) and the semantics of NOT IN if NULLs are present are unlikely to be the ones you want anyway.
When neither Products.ProductID or [Order Details].ProductID allow NULLs the NOT IN will be treated identically to the following query.
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
/*Not valid syntax but better reflects the plan*/
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId
If [Order Details].ProductID is NULL-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
The reason for this is that the correct semantics if [Order Details] contains any NULL ProductIds is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.
If Products.ProductID is also changed to become NULL-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)
The reason for that one is because a NULL Products.ProductId should not be returned in the results except if the NOT IN sub query were to return no results at all (i.e. the [Order Details] table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.
The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single NULL can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.
This is not the only possible execution plan for a NOT IN on a NULL-able column however. This article shows another one for a query against the AdventureWorks2008 database.
For the NOT IN on a NOT NULL column or the NOT EXISTS against either a nullable or non nullable column it gives the following plan.
When the column changes to NULL-able the NOT IN plan now looks like
It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id> to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL.
As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail does not contain any NULL ProductIDs it will double the number of seek operations required.

Also be aware that NOT IN is not equivalent to NOT EXISTS when it comes to null.
This post explains it very well
http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/
When the subquery returns even one null, NOT IN will not match any
rows.
The reason for this can be found by looking at the details of what the
NOT IN operation actually means.
Let’s say, for illustration purposes that there are 4 rows in the
table called t, there’s a column called ID with values 1..4
WHERE SomeValue NOT IN (SELECT AVal FROM t)
is equivalent to
WHERE SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND SomeValue != (SELECT AVal FROM t WHERE ID=4)
Let’s further say that AVal is NULL where ID = 4. Hence that !=
comparison returns UNKNOWN. The logical truth table for AND states
that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is
no value that can be AND’d with UNKNOWN to produce the result TRUE
Hence, if any row of that subquery returns NULL, the entire NOT IN
operator will evaluate to either FALSE or NULL and no records will be
returned

If the execution planner says they're the same, they're the same. Use whichever one will make your intention more obvious -- in this case, the second.

Actually, I believe this would be the fastest:
SELECT ProductID, ProductName
FROM Northwind..Products p
outer join Northwind..[Order Details] od on p.ProductId = od.ProductId)
WHERE od.ProductId is null

I have a table which has about 120,000 records and need to select only those which does not exist (matched with a varchar column) in four other tables with number of rows approx 1500, 4000, 40000, 200. All the involved tables have unique index on the concerned Varchar column.
NOT IN took about 10 mins, NOT EXISTS took 4 secs.
I have a recursive query which might had some untuned section which might have contributed to the 10 mins, but the other option taking 4 secs explains, atleast to me that NOT EXISTS is far better or at least that IN and EXISTS are not exactly the same and always worth a check before going ahead with code.

I was using
SELECT * from TABLE1 WHERE Col1 NOT IN (SELECT Col1 FROM TABLE2)
and found that it was giving wrong results (By wrong I mean no results). As there was a NULL in TABLE2.Col1.
While changing the query to
SELECT * from TABLE1 T1 WHERE NOT EXISTS (SELECT Col1 FROM TABLE2 T2 WHERE T1.Col1 = T2.Col2)
gave me the correct results.
Since then I have started using NOT EXISTS every where.

In your specific example they are the same, because the optimizer has figured out what you are trying to do is the same in both examples. But it is possible that in non-trivial examples the optimizer may not do this, and in that case there are reasons to prefer one to other on occasion.
NOT IN should be preferred if you are testing multiple rows in your outer select. The subquery inside the NOT IN statement can be evaluated at the beginning of the execution, and the temporary table can be checked against each value in the outer select, rather than re-running the subselect every time as would be required with the NOT EXISTS statement.
If the subquery must be correlated with the outer select, then NOT EXISTS may be preferable, since the optimizer may discover a simplification that prevents the creation of any temporary tables to perform the same function.

Database table model
Let’s assume we have the following two tables in our database, that form a one-to-many table relationship.
The student table is the parent, and the student_grade is the child table since it has a student_id Foreign Key column referencing the id Primary Key column in the student table.
The student table contains the following two records:
id
first_name
last_name
admission_score
1
Alice
Smith
8.95
2
Bob
Johnson
8.75
And, the student_grade table stores the grades the students received:
id
class_name
grade
student_id
1
Math
10
1
2
Math
9.5
1
3
Math
9.75
1
4
Science
9.5
1
5
Science
9
1
6
Science
9.25
1
7
Math
8.5
2
8
Math
9.5
2
9
Math
9
2
10
Science
10
2
11
Science
9.4
2
SQL EXISTS
Let’s say we want to get all students that have received a 10 grade in Math class.
If we are only interested in the student identifier, then we can run a query like this one:
SELECT
student_grade.student_id
FROM
student_grade
WHERE
student_grade.grade = 10 AND
student_grade.class_name = 'Math'
ORDER BY
student_grade.student_id
But, the application is interested in displaying the full name of a student, not just the identifier, so we need info from the student table as well.
In order to filter the student records that have a 10 grade in Math, we can use the EXISTS SQL operator, like this:
SELECT
id, first_name, last_name
FROM
student
WHERE EXISTS (
SELECT 1
FROM
student_grade
WHERE
student_grade.student_id = student.id AND
student_grade.grade = 10 AND
student_grade.class_name = 'Math'
)
ORDER BY id
When running the query above, we can see that only the Alice row is selected:
id
first_name
last_name
1
Alice
Smith
The outer query selects the student row columns we are interested in returning to the client. However, the WHERE clause is using the EXISTS operator with an associated inner subquery.
The EXISTS operator returns true if the subquery returns at least one record and false if no row is selected. The database engine does not have to run the subquery entirely. If a single record is matched, the EXISTS operator returns true, and the associated other query row is selected.
The inner subquery is correlated because the student_id column of the student_grade table is matched against the id column of the outer student table.
SQL NOT EXISTS
Let’s consider we want to select all students that have no grade lower than 9. For this, we can use NOT EXISTS, which negates the logic of the EXISTS operator.
Therefore, the NOT EXISTS operator returns true if the underlying subquery returns no record. However, if a single record is matched by the inner subquery, the NOT EXISTS operator will return false, and the subquery execution can be stopped.
To match all student records that have no associated student_grade with a value lower than 9, we can run the following SQL query:
SELECT
id, first_name, last_name
FROM
student
WHERE NOT EXISTS (
SELECT 1
FROM
student_grade
WHERE
student_grade.student_id = student.id AND
student_grade.grade < 9
)
ORDER BY id
When running the query above, we can see that only the Alice record is matched:
id
first_name
last_name
1
Alice
Smith
So, the advantage of using the SQL EXISTS and NOT EXISTS operators is that the inner subquery execution can be stopped as long as a matching record is found.

They are very similar but not really the same.
In terms of efficiency, I've found the left join is null statement more efficient (when an abundance of rows are to be selected that is)

If the optimizer says they are the same then consider the human factor. I prefer to see NOT EXISTS :)

It depends..
SELECT x.col
FROM big_table x
WHERE x.key IN( SELECT key FROM really_big_table );
would not be relatively slow the isn't much to limit size of what the query check to see if they key is in. EXISTS would be preferable in this case.
But, depending on the DBMS's optimizer, this could be no different.
As an example of when EXISTS is better
SELECT x.col
FROM big_table x
WHERE EXISTS( SELECT key FROM really_big_table WHERE key = x.key);
AND id = very_limiting_criteria

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Exists sub-query with a HAVING clause - sql

Related

Query with Left outer join and group by returning duplicates

I am getting this error when trying to run the Count(Subscriber) line of code: Subquery returned more than 1 value. This is not permitted [duplicate]

How does EXISTS return things other than all rows or no rows?

How do I write an SQL query to identify duplicate values in a specific field?

NOT IN vs NOT EXISTS

Categories

Resources