Difference between SQL EXISTS and IN [duplicate] - sql

What is the difference between the EXISTS and IN clause in SQL?
When should we use EXISTS, and when should we use IN?

The exists keyword can be used in that way, but really it's intended as a way to avoid counting:
--this statement needs to check the entire table
select count(*) from [table] where ...
--this statement is true as soon as one match is found
exists ( select * from [table] where ... )
This is most useful where you have if conditional statements, as exists can be a lot quicker than count.
The in is best used where you have a static list to pass:
select * from [table]
where [field] in (1, 2, 3)
When you have a table in an in statement it makes more sense to use a join, but mostly it shouldn't matter. The query optimiser should return the same plan either way. In some implementations (mostly older, such as Microsoft SQL Server 2000) in queries will always get a nested join plan, while join queries will use nested, merge or hash as appropriate. More modern implementations are smarter and can adjust the plan even when in is used.

EXISTS will tell you whether a query returned any results. e.g.:
SELECT *
FROM Orders o
WHERE EXISTS (
SELECT *
FROM Products p
WHERE p.ProductNumber = o.ProductNumber)
IN is used to compare one value to several, and can use literal values, like this:
SELECT *
FROM Orders
WHERE ProductNumber IN (1, 10, 100)
You can also use query results with the IN clause, like this:
SELECT *
FROM Orders
WHERE ProductNumber IN (
SELECT ProductNumber
FROM Products
WHERE ProductInventoryQuantity > 0)

Based on rule optimizer:
EXISTS is much faster than IN, when the sub-query results is very large.
IN is faster than EXISTS, when the sub-query results is very small.
Based on cost optimizer:
There is no difference.

I'm assuming you know what they do, and thus are used differently, so I'm going to understand your question as: When would it be a good idea to rewrite the SQL to use IN instead of EXISTS, or vice versa.
Is that a fair assumption?
Edit: The reason I'm asking is that in many cases you can rewrite an SQL based on IN to use an EXISTS instead, and vice versa, and for some database engines, the query optimizer will treat the two differently.
For instance:
SELECT *
FROM Customers
WHERE EXISTS (
SELECT *
FROM Orders
WHERE Orders.CustomerID = Customers.ID
)
can be rewritten to:
SELECT *
FROM Customers
WHERE ID IN (
SELECT CustomerID
FROM Orders
)
or with a join:
SELECT Customers.*
FROM Customers
INNER JOIN Orders ON Customers.ID = Orders.CustomerID
So my question still stands, is the original poster wondering about what IN and EXISTS does, and thus how to use it, or does he ask wether rewriting an SQL using IN to use EXISTS instead, or vice versa, will be a good idea?

EXISTS is much faster than IN when the subquery results is very large.
IN is faster than EXISTS when the subquery results is very small.
CREATE TABLE t1 (id INT, title VARCHAR(20), someIntCol INT)
GO
CREATE TABLE t2 (id INT, t1Id INT, someData VARCHAR(20))
GO
INSERT INTO t1
SELECT 1, 'title 1', 5 UNION ALL
SELECT 2, 'title 2', 5 UNION ALL
SELECT 3, 'title 3', 5 UNION ALL
SELECT 4, 'title 4', 5 UNION ALL
SELECT null, 'title 5', 5 UNION ALL
SELECT null, 'title 6', 5
INSERT INTO t2
SELECT 1, 1, 'data 1' UNION ALL
SELECT 2, 1, 'data 2' UNION ALL
SELECT 3, 2, 'data 3' UNION ALL
SELECT 4, 3, 'data 4' UNION ALL
SELECT 5, 3, 'data 5' UNION ALL
SELECT 6, 3, 'data 6' UNION ALL
SELECT 7, 4, 'data 7' UNION ALL
SELECT 8, null, 'data 8' UNION ALL
SELECT 9, 6, 'data 9' UNION ALL
SELECT 10, 6, 'data 10' UNION ALL
SELECT 11, 8, 'data 11'
Query 1
SELECT
FROM t1
WHERE not EXISTS (SELECT * FROM t2 WHERE t1.id = t2.t1id)
Query 2
SELECT t1.*
FROM t1
WHERE t1.id not in (SELECT t2.t1id FROM t2 )
If in t1 your id has null value then Query 1 will find them, but Query 2 cant find null parameters.
I mean IN can't compare anything with null, so it has no result for null, but EXISTS can compare everything with null.

If you are using the IN operator, the SQL engine will scan all records fetched from the inner query. On the other hand if we are using EXISTS, the SQL engine will stop the scanning process as soon as it found a match.

IN supports only equality relations (or inequality when preceded by NOT).
It is a synonym to =any / =some, e.g
select *
from t1
where x in (select x from t2)
;
EXISTS supports variant types of relations, that cannot be expressed using IN, e.g. -
select *
from t1
where exists (select null
from t2
where t2.x=t1.x
and t2.y>t1.y
and t2.z like '℅' || t1.z || '℅'
)
;
And on a different note -
The allegedly performance and technical differences between EXISTS and IN may result from specific vendor's implementations/limitations/bugs, but many times they are nothing but myths created due to lack of understanding of the databases internals.
The tables' definition, statistics' accuracy, database configuration and optimizer's version have all impact on the execution plan and therefore on the performance metrics.

The Exists keyword evaluates true or false, but IN keyword compare all value in the corresponding sub query column.
Another one Select 1 can be use with Exists command. Example:
SELECT * FROM Temp1 where exists(select 1 from Temp2 where conditions...)
But IN is less efficient so Exists faster.

I think,
EXISTS is when you need to match the results of query with another subquery.
Query#1 results need to be retrieved where SubQuery results match. Kind of a Join..
E.g. select customers table#1 who have placed orders table#2 too
IN is to retrieve if the value of a specific column lies IN a list (1,2,3,4,5)
E.g. Select customers who lie in the following zipcodes i.e. zip_code values lies in (....) list.
When to use one over the other... when you feel it reads appropriately (Communicates intent better).

As per my knowledge when a subquery returns a NULL value then the whole statement becomes NULL. In that cases we are using the EXITS keyword. If we want to compare particular values in subqueries then we are using the IN keyword.

Which one is faster depends on the number of queries fetched by the inner query:
When your inner query fetching thousand of rows then EXIST would be better choice
When your inner query fetching few rows, then IN will be faster
EXIST evaluate on true or false but IN compare multiple value. When you don't know the record is exist or not, your should choose EXIST

Difference lies here:
select *
from abcTable
where exists (select null)
Above query will return all the records while below one would return empty.
select *
from abcTable
where abcTable_ID in (select null)
Give it a try and observe the output.

The reason is that the EXISTS operator works based on the “at least found” principle. It returns true and stops scanning table once at least one matching row found.
On the other hands, when the IN operator is combined with a subquery, MySQL must process the subquery first, and then uses the result of the subquery to process the whole query.
The general rule of thumb is that if the subquery contains a large
volume of data, the EXISTS operator provides a better performance.
However, the query that uses the IN operator will perform faster if
the result set returned from the subquery is very small.

In certain circumstances, it is better to use IN rather than EXISTS. In general, if the selective predicate is in the subquery, then use IN. If the selective predicate is in the parent query, then use EXISTS.
https://docs.oracle.com/cd/B19306_01/server.102/b14211/sql_1016.htm#i28403

My understand is both should be the same as long as we are not dealing with NULL values.
The same reason why the query does not return the value for = NULL vs is NULL.
http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/
As for as boolean vs comparator argument goes, to generate a boolean both values needs to be compared and that is how any if condition works.So i fail to understand how IN and EXISTS behave differently
.

If a subquery returns more than one value, you might need to execute the outer query- if the values within the column specified in the condition match any value in the result set of the subquery. To perform this task, you need to use the in keyword.
You can use a subquery to check if a set of records exists. For this, you need to use the exists clause with a subquery. The exists keyword always return true or false value.

I believe this has a straightforward answer. Why don't you check it from the people who developed that function in their systems?
If you are a MS SQL developer, here is the answer directly from Microsoft.
IN:
Determines whether a specified value matches any value in a subquery or a list.
EXISTS:
Specifies a subquery to test for the existence of rows.

I found that using EXISTS keyword is often really slow (that is very true in Microsoft Access).
I instead use the join operator in this manner :
should-i-use-the-keyword-exists-in-sql

If you can use where in instead of where exists, then where in is probably faster.
Using where in or where exists
will go through all results of your parent result. The difference here is that the where exists will cause a lot of dependet sub-queries. If you can prevent dependet sub-queries, then where in will be the better choice.
Example
Assume we have 10,000 companies, each has 10 users (thus our users table has 100,000 entries). Now assume you want to find a user by his name or his company name.
The following query using were exists has an execution of 141ms:
select * from `users`
where `first_name` ='gates'
or exists
(
select * from `companies`
where `users`.`company_id` = `companies`.`id`
and `name` = 'gates'
)
This happens, because for each user a dependent sub query is executed:
However, if we avoid the exists query and write it using:
select * from `users`
where `first_name` ='gates'
or users.company_id in
(
select id from `companies`
where `name` = 'gates'
)
Then depended sub queries are avoided and the query would run in 0,012 ms

EXISTS Is Faster in Performance than IN.
If Most of the filter criteria is in subquery then better to use IN and If most of the filter criteria is in main query then better to use EXISTS.

If you are using the IN operator, the SQL engine will scan all records fetched from the inner query. On the other hand if we are using EXISTS, the SQL engine will stop the scanning process as soon as it found a match.

Related

Does PostgreSQL short-circuit its BOOL_OR() evaluation?

EXISTS is faster than COUNT(*) because it can be short-circuited
A lot of times, I like to check for existence of things in SQL. For instance, I do:
-- PostgreSQL syntax, SQL standard syntax:
SELECT EXISTS (SELECT .. FROM some_table WHERE some_boolean_expression)
-- Oracle syntax
SELECT CASE
WHEN EXISTS (SELECT .. FROM some_table WHERE some_boolean_expression) THEN 1
ELSE 0
END
FROM dual
In most databases, EXISTS is "short-circuited", i.e. the database can stop looking for rows in the table as soon as it has found one row. This is usually much faster than comparing COUNT(*) >= 1 as can be seen in this blog post.
Using EXISTS with GROUP BY
Sometimes, I'd like to do this for each group in a GROUP BY query, i.e. I'd like to "aggregate" the existence value. There's no EXISTS aggregate function, but PostgreSQL luckily supports the BOOL_OR() aggregate function, like in this statement:
SELECT something, bool_or (some_boolean_expression)
FROM some_table
GROUP BY something
The documentation mentions something about COUNT(*) being slow because of the obvious sequential scan needed to calculate the count. But unfortunately, it doesn't say anything about BOOL_OR() being short-circuited. Is it the case? Does BOOL_OR() stop aggregating new values as soon as it encounters the first TRUE value?
If you want to check for existence, I'm generally using a LIMIT/FETCH FIRST 1 ROW ONLY query:
SELECT .. FROM some_table WHERE some_boolean_expression
FETCH FIRST 1 ROW ONLY
This generally stops execution after the first hit.
The same technique can be applied using LATERAL for each row (group) from another table.
SELECT *
FROM (SELECT something
FROM some_table
GROUP BY something
) t1
LEFT JOIN LATERAL (SELECT ...
FROM ...
WHERE ...
FETCH FIRST 1 ROW ONLY) t2
ON (true)
In t2 you can use a WHERE clause that matches any row for the group. It's executed only once per group and aborted as soon as the first hit was found. However, whether this performs better or worse depends on your search predicates and indexing, of course.

SQL: how do you look for missing ids?

Suppose I have a table with lots of rows identified by a unique ID. Now I have a (rather large) user-input list of ids (not a table) that I want to check are already in the database.
So I want to output the ids that are in my list, but not in the table. How do I do that with SQL?
EDIT: I know I can do that with a temporary table, but I'd really like to avoid that if possible.
EDIT: Same comment for using an external programming language.
Try with this:
SELECT t1.id FROM your_list t1
LEFT JOIN your_table t2
ON t1.id = t2.id
WHERE t2.id IS NULL
It is hardly possible to make a single pure and general SQL query for your task, since it requires to work with a list (which is not a relational concept and standard set of list operations is too limited). For some DBMSs it is possible to write a single SQL query, but it will utilize SQL dialect of the DBMS and will be specific to the DBMS.
You haven't mentioned:
which RDBMS will be used;
what is the source of the IDs.
So I will consider PostgreSQL is used, and IDs to be checked are loaded into a (temporary) table.
Consider the following:
CREATE TABLE test (id integer, value char(1));
INSERT INTO test VALUES (1,'1'), (2,'2'), (3,'3');
CREATE TABLE temp_table (id integer);
INSERT INTO temp_table VALUES (1),(5),(10);
You can get your results like this:
SELECT * FROM temp_table WHERE NOT EXISTS (
SELECT id FROM test WHERE id = temp_table.id);
or
SELECT * FROM temp_table WHERE id NOT IN (SELECT id FROM test);
or
SELECT * FROM temp_table LEFT JOIN test USING (id) WHERE test.id IS NULL;
You can pick any option, depending on your volumes you may have different performance.
Just a note: some RDBMS may have limitation on the number of expressions specified literally inside IN() construct, keep this in mind (I hit this several times with ORACLE).
EDIT: In order to match constraints of no temp tables and no external languages you can use the following construct:
SELECT DISTINCT b.id
FROM test a RIGHT JOIN (
SELECT 1 id UNION ALL
SELECT 5 UNION ALL
SELECT 10) b ON a.id=b.id
WHERE a.id IS NULL;
Unfortunately, you'll have to generate lot's of SELECT x UNION ALL entries to make a single-column and many-rows table here. I use UNION ALL to avoid unnecessary sorting step.

SQL select from data in query where this data is not already in the database?

I want to check my database for records that I already have recorded before making a web service call.
Here is what I imagine the query to look like, I just can't seem to figure out the syntax.
SELECT *
FROM (1,2,3,4) as temp_table
WHERE temp_table.id
LEFT JOIN table ON id IS NULL
Is there a way to do this? What is a query like this called?
I want to pass in a list of id's to mysql and i want it to spit out the id's that are not already in the database?
Use:
SELECT x.id
FROM (SELECT #param_1 AS id
FROM DUAL
UNION ALL
SELECT #param_2
FROM DUAL
UNION ALL
SELECT #param_3
FROM DUAL
UNION ALL
SELECT #param_4
FROM DUAL) x
LEFT JOIN TABLE t ON t.id = x.id
WHERE x.id IS NULL
If you need to support a varying number of parameters, you can either use:
a temporary table to populate & join to
MySQL's Prepared Statements to dynamically construct the UNION ALL statement
To confirm I've understood correctly, you want to pass in a list of numbers and see which of those numbers isn't present in the existing table? In effect:
SELECT Item
FROM IDList I
LEFT JOIN TABLE T ON I.Item=T.ID
WHERE T.ID IS NULL
You look like you're OK with building this query on the fly, in which case you can do this with a numbers / tally table by changing the above into
SELECT Number
FROM (SELECT Number FROM Numbers WHERE Number IN (1,2,3,4)) I
LEFT JOIN TABLE T ON I.Number=T.ID
WHERE T.ID IS NULL
This is relatively prone to SQL Injection attacks though because of the way the query is being built. It'd be better if you could pass in '1,2,3,4' as a string and split it into sections to generate your numbers list to join against in a safer way - for an example of how to do that, see http://www.sqlteam.com/article/parsing-csv-values-into-multiple-rows
All of this presumes you've got a numbers / tally table in your database, but they're sufficiently useful in general that I'd strongly recommend you do.
SELECT * FROM table where id NOT IN (1,2,3,4)
I would probably just do:
SELECT id
FROM table
WHERE id IN (1,2,3,4);
And then process the list of results, removing any returned by the query from your list of "records to submit".
How about a nested query? This may work. If not, it may get you in the right direction.
SELECT * FROM table WHERE id NOT IN (
SELECT id FROM table WHERE 1
);

When to use EXCEPT as opposed to NOT EXISTS in Transact SQL?

I just recently learned of the existence of the new "EXCEPT" clause in SQL Server (a bit late, I know...) through reading code written by a co-worker. It truly amazed me!
But then I have some questions regarding its usage: when is it recommended to be employed? Is there a difference, performance-wise, between using it versus a correlated query employing "AND NOT EXISTS..."?
After reading EXCEPT's article in the BOL I thought it was just a shorthand for the second option, but was surprised when I rewrote a couple queries using it (so they had the "AND NOT EXISTS" syntax much more familiar to me) and then checked the execution plans - surprise! The EXCEPT version had a shorter execution plan, and executed faster, also. Is this always so?
So I'd like to know: what are the guidelines for using this powerful tool?
EXCEPT treats NULL values as matching.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE value NOT IN
(
SELECT value
FROM p
)
will return an empty rowset.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE NOT EXISTS
(
SELECT NULL
FROM p
WHERE p.value = q.value
)
will return
NULL
1
, and this one:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
EXCEPT
SELECT *
FROM p
will return:
1
Recursive reference is also allowed in EXCEPT clause in a recursive CTE, though it behaves in a strange way: it returns everything except the last row of a previous set, not everything except the whole previous set:
WITH q (value) AS
(
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
),
rec (value) AS
(
SELECT value
FROM q
UNION ALL
SELECT *
FROM (
SELECT value
FROM q
EXCEPT
SELECT value
FROM rec
) q2
)
SELECT TOP 10 *
FROM rec
---
1
2
3
-- original set
1
2
-- everything except the last row of the previous set, that is 3
1
3
-- everything except the last row of the previous set, that is 2
1
2
-- everything except the last row of the previous set, that is 3, etc.
1
SQL Server developers must just have forgotten to forbid it.
I have done a lot of analysis of except, not exists, not in and left outer join. Generally the left outer join is the fastest for finding missing rows, especially joining on a primary key. Not In can be very fast if you know it will be a small list returned in the select.
I use EXCEPT a lot to compare what is being returned when rewriting code. Run the old code saving results. Run new code saving results and then use except to capture all differences. It is a very quick and easy way to find differences, especially when needing to get all differences including null. Very good for on the fly easy coding.
But, every situation is different. I say to every developer I have ever mentored. Try it. Do timings all different ways. Try it, time it, do it.
EXCEPT compares all (paired)columns of two full-selects.
NOT EXISTS compares two or more tables accoding to the conditions specified in WHERE clause in the sub-query following NOT EXISTS keyword.
EXCEPT can be rewritten by using NOT EXISTS.
(EXCEPT ALL can be rewritten by using ROW_NUMBER and NOT EXISTS.)
Got this from here
There is no accounting for SQL server's execution plans. I have always found when having performance issues that it was utterly arbitrary (from a user's perspective, I'm sure the algorithm writers would understand why) when one syntax made a better execution plan rather than another.
In this case, something about the query parameter comparison allows SQL to figure out a shortcut that it couldn't from a straight select statement. I'm sure that is a deficiency in the algorithm. In other words, you could logically interpolate the same thing, but the algorithm doesn't make that translation on an exists query. Sometimes that is because an algorithm that could reliably figure it out would take longer to execute than the query itself, or at least the algorithm designer thought so.
If your query is fine tuned then there is no performance difference b/w using of EXCEPT clause and NOT EXIST/NOT IN.. first time when I ran EXCEPT after changing my correlated query into it.. I was surprised because it returned with the result just in 7 secs while correlated query was returning in 22 secs.. then I used distinct clause in my correlated query and reran.. it also returned in 7 secs.. so EXCEPT is good when you don't know or don't have time to fine tuned your query otherwise both are same performance wise..

What do you put in a subquery's Select part when it's preceded by Exists?

What do you put in a subquery's Select part when it's preceded by Exists?
Select *
From some_table
Where Exists (Select 1
From some_other_table
Where some_condition )
I usually use 1, I used to put * but realized it could add some useless overhead.
What do you put? is there a more efficient way than putting 1 or any other dummy value?
I think the efficiency depends on your platform.
In Oracle, SELECT * and SELECT 1 within an EXISTS clause generate identical explain plans, with identical memory costs. There is no difference. However, other platforms may vary.
As a matter of personal preference, I use
SELECT *
Because SELECTing a specific field could mislead a reader into thinking that I care about that specific field, and it also lets me copy / paste that subquery out and run it unmodified, to look at the output.
However, an EXISTS clause in a SQL statement is a bit of a code smell, IMO. There are times when they are the best and clearest way to get what you want, but they can almost always be expressed as a join, which will be a lot easier for the database engine to optimize.
SELECT *
FROM SOME_TABLE ST
WHERE EXISTS(
SELECT 1
FROM SOME_OTHER_TABLE SOT
WHERE SOT.KEY_VALUE1 = ST.KEY_VALUE1
AND SOT.KEY_VALUE2 = ST.KEY_VALUE2
)
Is logically identical to:
SELECT *
FROM
SOME_TABLE ST
INNER JOIN
SOME_OTHER_TABLE SOT
ON ST.KEY_VALUE1 = SOT.KEY_VALUE1
AND ST.KEY_VALUE2 = SOT.KEY_VALUE2
I also use 1. I've seen some devs who use null. I think 1 is efficient compared to selecting from any field as the query won't have to get the actual value from the physical loc when it executes the select clause of the subquery.
Use:
WHERE EXISTS (SELECT NULL
FROM some_other_table
WHERE ... )
EXISTS returns true if one or more of the specified criteria match - it doesn't matter if columns are actually returned in the SELECT clause. NULL just makes it explicit that there isn't a comparison while 1/etc could be a valid value previously used in an IN clause.