MySQL doesn't support the limit clause inside a subselect, how can I do that? - sql

I have the following table on a MySQL 5.1.30:
CREATE TABLE article (
article_id int(10) unsigned NOT NULL AUTO_INCREMENT,
category_id int(10) unsigned NOT NULL,
title varchar(100) NOT NULL,
PRIMARY KEY (article_id)
);
With this information:
1, 1, 'foo'
2, 1, 'bar'
3, 1, 'baz'
4, 1, 'quox'
5, 2, 'quonom'
6, 2, 'qox'
I need to obtain the first three articles in each category for all categories that have articles. Something like this:
1, 1, 'foo'
2, 1, 'bar'
3, 1, 'baz'
5, 2, 'quonom'
6, 2, 'qox'
Of course a union would work:
select * from articles where category_id = 1 limit 3
union
select * from articles where category_id = 2 limit 3
But there are an unknown number of categories in the database. Also, the order should specified by an is_sticky and a published_date columns I left out of the examples to simplify.
Is it possible to build a query that retrieves this information?
UPDATE: I've tried the following which would seemed to work except that MySQL doesn't support the limit clause inside a subselect. Do you know of a way to simulate the limit there?
select *
from articles a
where a.article_id in (select f.article_id
from articles f
where f.category_id = a.category_id
order by f.is_sticky, f.published_at
limit 3)
Thanks

SELECT ... LIMIT isn't supported in subqueries, I'm afraid, so it's time to break out the self-join magic:
SELECT article.*
FROM article
JOIN (
SELECT a0.category_id AS id, MIN(a2.article_id) AS lim
FROM article AS a0
LEFT JOIN article AS a1 ON a1.category_id=a0.category_id AND a1.article_id>a0.article_id
LEFT JOIN article AS a2 ON a2.category_id=a1.category_id AND a2.article_id>a1.article_id
GROUP BY id
) AS cat ON cat.id=article.category_id
WHERE article.article_id<=cat.lim OR cat.lim IS NULL
ORDER BY article_id;
The bit in the middle is working out the ID of the third-lowest-ID article for each category by trying to join three copies of the same table in ascending ID order. If there are fewer than three articles for a category, the left joins will ensure the limit is NULL, so the outer WHERE needs to pick up that case as well.
If your “top 3” requirement might change to “top n” at some point, this begins to get unwieldy. In that case you might want to reconsider the idea of querying the list of distinct categories first then unioning the per-category queries.
ETA: Ordering on two columns: eek, new requirements! :-)
It depends what you mean: if you're only trying to order the final results you can bang it on the end no problem. But if you need to use this ordering to select which three articles are to be picked things are a lot harder.
We are using a self-join with ‘<’ to reproduce the effect ‘ORDER BY article_id’ would have. Unfortunately, whilst you can do ‘ORDER BY a, b’, you can't do ‘(a, b)<(c, d)’... neither can you do ‘MIN(a, b)’. Plus, you'd actually be ordering by three columns, issticky, published and article_id, because you need to ensure that each ordering value is unique, to avoid getting four or more rows returned.
Whilst you could make up your own orderable value by some crude integer or string combination of columns:
LEFT JOIN article AS a1
ON a1.category_id=a0.category_id
AND HEX(a1.issticky)+HEX(a1.published_at)+HEX(a1.article_id)>HEX(a0.issticky)+HEX(a0.published_at)+HEX(a0.article_id)
this is getting unfeasibly ugly, and the calculations will scupper any chance of using the indices to make the query efficient. At which point you are better off simply doing the separate per-category LIMITed queries.

You probably should add another table containing the category_id and a description of the categories. Then you can query that table for a list of category IDs, and use a subquery or additional queries to get the articles with proper sorting and limiting. I don't have time to write this out fully now, but someone else probably will (or I'll do it in the unlikely event that no one else has responded by the time I get back).

Here's something I'm not proud of (in MS SQL - not sure if it'll work in MySQL)
select a2.article_id, a2.category_id, a2.title
from
(select distinct category_id
from article) as a1
inner join article a2 on a2.category_id = a1.category_id
where a2.article_id <= (
select top 1 a4.article_id
from (
select top 3 a3.article_id
from article a3
where a3.category_id = a1.category_id
order by a3.article_id asc
) a4
order by a4.article_id desc)
It'll depend on MySQL supporting subqueries in this manner. Basically it works out the third-largest article_id for each category and joins all articles less than or equal to that per category.
SELECT TOP n * should work the same as SELECT * LIMIT n, I hope...

Related

SQL query to find a row with a specific number of associations

Using Postgres I have a schema that has conversations and conversationUsers. Each conversation has many conversationUsers. I want to be able to find the conversation that has the exactly specified number of conversationUsers. In other words, provided an array of userIds (say, [1, 4, 6]) I want to be able to find the conversation that contains only those users, and no more.
So far I've tried this:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;
Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId" 5).
This is a case of relational-division - with the added special requirement that the same conversation shall have no additional users.
Assuming the PK of table "conversationUsers" is on ("userId", "conversationId"), which enforces uniqueness of combinations, NOT NULL and also provides the essential index for performance implicitly. Columns of the multicolumn PK in this order. Ideally, you have another index on ("conversationId", "userId"). See:
Is a composite index also good for queries on the first field?
For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:
SELECT "conversationId"
FROM "conversationUsers" c
WHERE "userId" = ANY ('{1,4,6}'::int[])
GROUP BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = c."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
Eliminating conversations with additional users with a NOT EXISTS anti-semi-join. More:
How do I (or can I) SELECT DISTINCT on multiple columns?
Alternative techniques:
Select rows which are not present in other table
There are various other, (much) faster relational-division query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.
How to filter SQL results in a has-many-through relation
For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = ('{1,4,6}'::int[])[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length(('{1,4,6}'::int[]), 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
For ease of use wrap this in a function or prepared statement. Like:
PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = $1[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = $1[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length($1, 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL($1);
Call:
EXECUTE conversations('{1,4,6}');
db<>fiddle here (also demonstrating a function)
There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...
More explanation:
Using same column multiple times in WHERE clause
Alternative: MV for sparsely written table
If the table "conversationUsers" is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW with pre-aggregated users in sorted arrays and create a plain btree index on that array column.
CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users -- sorted array
FROM (
SELECT "conversationId", "userId"
FROM "conversationUsers"
ORDER BY 1, 2
) sub
GROUP BY 1
ORDER BY 1;
CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");
The demonstrated covering index requires Postgres 11. See:
https://dba.stackexchange.com/a/207938/3684
About sorting rows in a subquery:
How to apply ORDER BY and LIMIT in combination with an aggregate function?
In older versions use a plain multicolumn index on (users, "conversationId"). With very long arrays, a hash index might make sense in Postgres 10 or later.
Then the much faster query would simply be:
SELECT "conversationId"
FROM mv_conversation_users c
WHERE users = '{1,4,6}'::int[]; -- sorted array!
db<>fiddle here
You have to weigh added costs to storage, writes and maintenance against benefits to read performance.
Aside: consider legal identifiers without double quotes. conversation_id instead of "conversationId" etc.:
Are PostgreSQL column names case-sensitive?
you can modify your query like this and it should work:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."conversationId" IN (
SELECT DISTINCT c1."conversationId"
FROM "conversationUsers" c1
WHERE c1."userId" IN (1, 4)
)
GROUP BY c."conversationId"
HAVING COUNT(DISTINCT c."userId") = 2;
This might be easier to follow. You want the conversation ID, group by it. add the HAVING clause based on the sum of matching user IDs count equal to the all possible within the group. This will work, but will be longer to process because of no pre-qualifier.
select
cu.ConversationId
from
conversationUsers cu
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )
To Simplify the list even more, have a pre-query of conversations that at least one person is in... If they are not in to begin with, why bother considering such other conversations.
select
cu.ConversationId
from
( select cu2.ConversationID
from conversationUsers cu2
where cu2.userID = 4 ) preQual
JOIN conversationUsers cu
preQual.ConversationId = cu.ConversationId
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

How can I stop joins from adding rows in my match query?

I'm having difficulty translating what I want into functional programming, since I think imperatively. Basically, I have a table of forms, and a table of expectations. In the Expectation view, I want it to look through the forms table and tell me if each one found a match. However, when I try to use joins to accomplish this, the joins are adding rows to the Expectation table when two or more forms match. I do not want this.
In an imperative fashion, I want the equivalent of this:
ForEach (row in Expectation table)
{
if (any form in the Form table matches the criteria)
{
MatchID = form.ID;
SignDate = form.SignDate;
...
}
}
What I have in SQL is this:
SELECT
e.*, match.ID, match.SignDate, ...
FROM
POFDExpectation e LEFT OUTER JOIN
(SELECT MIN(ID) as MatchID, MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount, ...
FROM Form f
GROUP BY (matching criteria columns)
) match
ON (form.[match criteria] = expectation.[match criteria])
Which works okay, but very slowly, and every time there are TWO matches, a row is added to the Expectation results. Mathematically I understand that a join is a cross multiply and this is expected, but I'm unsure how to do this without them. Subquery perhaps?
I'm not able to give too many further details about the implementation, but I'll be happy to try any suggestion and respond with the results. I have 880 Expectation rows, and 942 results being returned. If I only allow results that match one form, I get 831 results. Neither are desirable, so if yours gets me to exactly 880, yours is the accepted answer.
Edit: I am using SQL Server 2008 R2, though a generic solution would be best.
Sample code:
--DROP VIEW ExpectationView; DROP TABLE Forms; DROP TABLE Expectations;
--Create Tables and View
CREATE TABLE Forms (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100), Complete bit, SignDate datetime)
GO
CREATE TABLE Expectations (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100))
GO
CREATE VIEW ExpectationView AS select e.*, filed.MatchID, filed.SignDate, ISNULL(filed.FiledCount, 0) as FiledCount, ISNULL(name.NameCount, 0) as NameCount from Expectations e LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, Complete, Min(SignDate) as SignDate, COUNT(*) as FiledCount from Forms f GROUP BY ReportYear, Name, Complete) filed
on filed.ReportYear = e.ReportYear AND filed.Name like '%'+e.Name+'%' AND filed.Complete = 1 LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, COUNT(*) as NameCount from Forms f GROUP BY ReportYear, Name) name
on name.ReportYear = e.ReportYear AND name.Name like '%'+e.Name+'%'
GO
--Insert Text Data
INSERT INTO Forms (ReportYear, Name, Complete, SignDate)
SELECT 2011, 'Bob Smith', 1, '2012-03-01' UNION ALL
SELECT 2011, 'Bob Jones', 1, '2012-10-04' UNION ALL
SELECT 2011, 'Bob', 1, '2012-07-20'
GO
INSERT INTO Expectations (ReportYear, Name)
SELECT 2011, 'Bob'
GO
SELECT * FROM ExpectationView --Should only return 1 result, returns 9
The 'filed' shows that they have completed a form, 'name' shows that they may have started one but not finished it. My view has four different 'match criteria' - each a little more strict, and counts each. 'Name Only Matches', 'Loose Matches', 'Matches' (default), 'Tight Matches' (used if there are more than one default match.
This is how I do it when I want to keep to a JOIN-type query format:
SELECT
e.*,
match.ID,
match.SignDate,
...
FROM POFDExpectation e
OUTER APPLY (
SELECT TOP 1
MIN(ID) as MatchID,
MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount,
...
FROM Form f
WHERE form.[match criteria] = expectation.[match criteria]
GROUP BY ID (matching criteria columns)
-- Add ORDER BY here to control which row is TOP 1
) match
It usually performs better as well.
Semantically, {CROSS|OUTER} APPLY (table-expression) specifies a table-expression that is called once for each row in the preceding table expressions of the FROM clause and then joined to them. Pragmatically, however, the compiler treats it almost identically to a JOIN.
The practical difference is that unlike a JOIN table-expression, the APPLY table-expression is dynamically re-evaluated for each row. So instead of an ON clause, it relies on its own logic and WHERE clauses to limit/match its rows to the preceding table-expressions. This also allows it to make reference to the column-values of the preceding table-expressions, inside its own internal subquery expression. (This is not possible in a JOIN)
The reason that we want this here, instead of a JOIN, is that we need a TOP 1 in the sub-query to limit its returned rows, however, that means that we need to move the ON clause conditions to the internal WHERE clause so that it will get applied before the TOP 1 is evaluated. And that means that we need an APPLY here, instead of the more usual JOIN.
#RBarryYoung answered the question as I asked it, but there was a second question that I didn't make very clear. What I really wanted was a combination of his answer and this question, so for the record here's what I used:
SELECT
e.*,
...
match.ID,
match.SignDate,
match.MatchCount
FROM
POFDExpectation e
OUTER APPLY (
SELECT TOP 1
ID as MatchID,
ReportYear,
...
SignDate as MatchSignDate,
COUNT(*) as MatchCount OVER ()
FROM
Form f
WHERE
form.[match criteria] = expectation.[match criteria]
-- Add ORDER BY here to control which row is TOP 1
) match

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.
Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).
As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...
You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

What is the most effecient way to write this SQL query?

I have two lists of ids. List A and List B. Both of these lists are actually the results of SQL queries (QUERY A and QUERY B respectively).
I want to 'filter' List A, by removing the ids in List A if they appear in list B.
So for example if list A looks like this:
1, 2, 3, 4, 7
and List B looks like this:
2,7
then the 'filtered' List A should have ids 2 and 7 removed, and so should look like this:
1, 3, 4
I want to write an SQL query like this (pseudo code of course):
SELECT id FROM (QUERYA) as temp_table where id not in (QUERYB)
Using classic SQL:
select [distinct] number
from list_a
where number not in (
select distinct number from list_b
);
I've put the first "distinct" in square brackets since I'm unsure as to whether you wanted duplicates removed (remove either the brackets or the entire word). The second "distinct" should be left in just in case your DBMS doesn't optimize IN clauses.
It may be faster (measure, don't guess) with an left join along the lines of:
select [distinct] list_a.number from list_a
left join list_b on list_a.number = list_b.number
where list_b.number is null;
Same deal with the "[distinct]".
see Doing INTERSECT and MINUS in MySQL
The query:
select id
from ListA
where id not in (
select id
from ListB)
will give you the desired result.
I am not sure which way is the best. As my previous impression, the perforamnce could be very different depends on situtation and the size of the tables.
1.
select id
from ListA
where id not in (
select id
from ListB)
2.
select ListA.id
from ListA
left join ListB on ListA.id=ListB.id
where ListB.id is null
3.
select id
from ListA
where not exists (
select *
from ListB where ListB.id=ListA.id)
The 2) should be the fastest usually, as it does inner join not sub-queries.
Some people may suggest 3) rather then 1) beause it use "exists" which does not read data from table.

Difference between EXISTS and IN in SQL?

What is the difference between the EXISTS and IN clause in SQL?
When should we use EXISTS, and when should we use IN?
The exists keyword can be used in that way, but really it's intended as a way to avoid counting:
--this statement needs to check the entire table
select count(*) from [table] where ...
--this statement is true as soon as one match is found
exists ( select * from [table] where ... )
This is most useful where you have if conditional statements, as exists can be a lot quicker than count.
The in is best used where you have a static list to pass:
select * from [table]
where [field] in (1, 2, 3)
When you have a table in an in statement it makes more sense to use a join, but mostly it shouldn't matter. The query optimiser should return the same plan either way. In some implementations (mostly older, such as Microsoft SQL Server 2000) in queries will always get a nested join plan, while join queries will use nested, merge or hash as appropriate. More modern implementations are smarter and can adjust the plan even when in is used.
EXISTS will tell you whether a query returned any results. e.g.:
SELECT *
FROM Orders o
WHERE EXISTS (
SELECT *
FROM Products p
WHERE p.ProductNumber = o.ProductNumber)
IN is used to compare one value to several, and can use literal values, like this:
SELECT *
FROM Orders
WHERE ProductNumber IN (1, 10, 100)
You can also use query results with the IN clause, like this:
SELECT *
FROM Orders
WHERE ProductNumber IN (
SELECT ProductNumber
FROM Products
WHERE ProductInventoryQuantity > 0)
Based on rule optimizer:
EXISTS is much faster than IN, when the sub-query results is very large.
IN is faster than EXISTS, when the sub-query results is very small.
Based on cost optimizer:
There is no difference.
I'm assuming you know what they do, and thus are used differently, so I'm going to understand your question as: When would it be a good idea to rewrite the SQL to use IN instead of EXISTS, or vice versa.
Is that a fair assumption?
Edit: The reason I'm asking is that in many cases you can rewrite an SQL based on IN to use an EXISTS instead, and vice versa, and for some database engines, the query optimizer will treat the two differently.
For instance:
SELECT *
FROM Customers
WHERE EXISTS (
SELECT *
FROM Orders
WHERE Orders.CustomerID = Customers.ID
)
can be rewritten to:
SELECT *
FROM Customers
WHERE ID IN (
SELECT CustomerID
FROM Orders
)
or with a join:
SELECT Customers.*
FROM Customers
INNER JOIN Orders ON Customers.ID = Orders.CustomerID
So my question still stands, is the original poster wondering about what IN and EXISTS does, and thus how to use it, or does he ask wether rewriting an SQL using IN to use EXISTS instead, or vice versa, will be a good idea?
EXISTS is much faster than IN when the subquery results is very large.
IN is faster than EXISTS when the subquery results is very small.
CREATE TABLE t1 (id INT, title VARCHAR(20), someIntCol INT)
GO
CREATE TABLE t2 (id INT, t1Id INT, someData VARCHAR(20))
GO
INSERT INTO t1
SELECT 1, 'title 1', 5 UNION ALL
SELECT 2, 'title 2', 5 UNION ALL
SELECT 3, 'title 3', 5 UNION ALL
SELECT 4, 'title 4', 5 UNION ALL
SELECT null, 'title 5', 5 UNION ALL
SELECT null, 'title 6', 5
INSERT INTO t2
SELECT 1, 1, 'data 1' UNION ALL
SELECT 2, 1, 'data 2' UNION ALL
SELECT 3, 2, 'data 3' UNION ALL
SELECT 4, 3, 'data 4' UNION ALL
SELECT 5, 3, 'data 5' UNION ALL
SELECT 6, 3, 'data 6' UNION ALL
SELECT 7, 4, 'data 7' UNION ALL
SELECT 8, null, 'data 8' UNION ALL
SELECT 9, 6, 'data 9' UNION ALL
SELECT 10, 6, 'data 10' UNION ALL
SELECT 11, 8, 'data 11'
Query 1
SELECT
FROM t1
WHERE not EXISTS (SELECT * FROM t2 WHERE t1.id = t2.t1id)
Query 2
SELECT t1.*
FROM t1
WHERE t1.id not in (SELECT t2.t1id FROM t2 )
If in t1 your id has null value then Query 1 will find them, but Query 2 cant find null parameters.
I mean IN can't compare anything with null, so it has no result for null, but EXISTS can compare everything with null.
If you are using the IN operator, the SQL engine will scan all records fetched from the inner query. On the other hand if we are using EXISTS, the SQL engine will stop the scanning process as soon as it found a match.
IN supports only equality relations (or inequality when preceded by NOT).
It is a synonym to =any / =some, e.g
select *
from t1
where x in (select x from t2)
;
EXISTS supports variant types of relations, that cannot be expressed using IN, e.g. -
select *
from t1
where exists (select null
from t2
where t2.x=t1.x
and t2.y>t1.y
and t2.z like '℅' || t1.z || '℅'
)
;
And on a different note -
The allegedly performance and technical differences between EXISTS and IN may result from specific vendor's implementations/limitations/bugs, but many times they are nothing but myths created due to lack of understanding of the databases internals.
The tables' definition, statistics' accuracy, database configuration and optimizer's version have all impact on the execution plan and therefore on the performance metrics.
The Exists keyword evaluates true or false, but IN keyword compare all value in the corresponding sub query column.
Another one Select 1 can be use with Exists command. Example:
SELECT * FROM Temp1 where exists(select 1 from Temp2 where conditions...)
But IN is less efficient so Exists faster.
I think,
EXISTS is when you need to match the results of query with another subquery.
Query#1 results need to be retrieved where SubQuery results match. Kind of a Join..
E.g. select customers table#1 who have placed orders table#2 too
IN is to retrieve if the value of a specific column lies IN a list (1,2,3,4,5)
E.g. Select customers who lie in the following zipcodes i.e. zip_code values lies in (....) list.
When to use one over the other... when you feel it reads appropriately (Communicates intent better).
As per my knowledge when a subquery returns a NULL value then the whole statement becomes NULL. In that cases we are using the EXITS keyword. If we want to compare particular values in subqueries then we are using the IN keyword.
Which one is faster depends on the number of queries fetched by the inner query:
When your inner query fetching thousand of rows then EXIST would be better choice
When your inner query fetching few rows, then IN will be faster
EXIST evaluate on true or false but IN compare multiple value. When you don't know the record is exist or not, your should choose EXIST
Difference lies here:
select *
from abcTable
where exists (select null)
Above query will return all the records while below one would return empty.
select *
from abcTable
where abcTable_ID in (select null)
Give it a try and observe the output.
The reason is that the EXISTS operator works based on the “at least found” principle. It returns true and stops scanning table once at least one matching row found.
On the other hands, when the IN operator is combined with a subquery, MySQL must process the subquery first, and then uses the result of the subquery to process the whole query.
The general rule of thumb is that if the subquery contains a large
volume of data, the EXISTS operator provides a better performance.
However, the query that uses the IN operator will perform faster if
the result set returned from the subquery is very small.
In certain circumstances, it is better to use IN rather than EXISTS. In general, if the selective predicate is in the subquery, then use IN. If the selective predicate is in the parent query, then use EXISTS.
https://docs.oracle.com/cd/B19306_01/server.102/b14211/sql_1016.htm#i28403
My understand is both should be the same as long as we are not dealing with NULL values.
The same reason why the query does not return the value for = NULL vs is NULL.
http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/
As for as boolean vs comparator argument goes, to generate a boolean both values needs to be compared and that is how any if condition works.So i fail to understand how IN and EXISTS behave differently
.
If a subquery returns more than one value, you might need to execute the outer query- if the values within the column specified in the condition match any value in the result set of the subquery. To perform this task, you need to use the in keyword.
You can use a subquery to check if a set of records exists. For this, you need to use the exists clause with a subquery. The exists keyword always return true or false value.
I believe this has a straightforward answer. Why don't you check it from the people who developed that function in their systems?
If you are a MS SQL developer, here is the answer directly from Microsoft.
IN:
Determines whether a specified value matches any value in a subquery or a list.
EXISTS:
Specifies a subquery to test for the existence of rows.
I found that using EXISTS keyword is often really slow (that is very true in Microsoft Access).
I instead use the join operator in this manner :
should-i-use-the-keyword-exists-in-sql
If you can use where in instead of where exists, then where in is probably faster.
Using where in or where exists
will go through all results of your parent result. The difference here is that the where exists will cause a lot of dependet sub-queries. If you can prevent dependet sub-queries, then where in will be the better choice.
Example
Assume we have 10,000 companies, each has 10 users (thus our users table has 100,000 entries). Now assume you want to find a user by his name or his company name.
The following query using were exists has an execution of 141ms:
select * from `users`
where `first_name` ='gates'
or exists
(
select * from `companies`
where `users`.`company_id` = `companies`.`id`
and `name` = 'gates'
)
This happens, because for each user a dependent sub query is executed:
However, if we avoid the exists query and write it using:
select * from `users`
where `first_name` ='gates'
or users.company_id in
(
select id from `companies`
where `name` = 'gates'
)
Then depended sub queries are avoided and the query would run in 0,012 ms
I did a little exercise on a query that I have recently been using. I originally created it with INNER JOINS, but I wanted to see how it looked/worked with EXISTS. I converted it. I will include both version here for comparison.
SELECT DISTINCT Category, Name, Description
FROM [CodeSets]
WHERE Category NOT IN (
SELECT def.Category
FROM [Fields] f
INNER JOIN [DataEntryFields] def ON f.DataEntryFieldId = def.Id
INNER JOIN Section s ON f.SectionId = s.Id
INNER JOIN Template t ON s.Template_Id = t.Id
WHERE t.AgencyId = (SELECT Id FROM Agencies WHERE Name = 'Some Agency')
AND def.Category NOT IN ('OFFLIST', 'AGENCYLIST', 'RELTO_UNIT', 'HOSPITALS', 'EMS', 'TOWCOMPANY', 'UIC', 'RPTAGENCY', 'REP')
AND (t.Name like '% OH %')
AND (def.Category IS NOT NULL AND def.Category <> '')
)
ORDER BY 1
Here are the statistics:
Here is the converted version:
SELECT DISTINCT cs.Category, Name, Description
FROM [CodeSets] cs
WHERE NOT Exists (
SELECT * FROM [Fields] f
WHERE EXISTS (SELECT * FROM [DataEntryFields] def
WHERE def.Id = f.DataEntryFieldId
AND def.Category NOT IN ('OFFLIST', 'AGENCYLIST', 'RELTO_UNIT', 'HOSPITALS', 'EMS', 'TOWCOMPANY', 'UIC', 'RPTAGENCY', 'REP')
AND (def.Category IS NOT NULL AND def.Category <> '')
AND def.Category = cs.Category
AND EXISTS (SELECT * FROM Section s
WHERE f.SectionId = s.Id
AND EXISTS (SELECT * FROM Template t
WHERE s.Template_Id = t.Id
AND EXISTS (SELECT * FROM Agencies
WHERE Name = 'Some Agency' and t.AgencyId = Id)
AND (t.Name like '% OH %')
)
)
)
)
ORDER BY 1
The results, at least to me, were unimpressive.
If I were more technically knowledgeable about how SQL works, I could give you an answer, but take this example as you may and make your own conclusion.
The INNER JOIN and IN () is easier to read, however.
EXISTS Is Faster in Performance than IN.
If Most of the filter criteria is in subquery then better to use IN and If most of the filter criteria is in main query then better to use EXISTS.
If you are using the IN operator, the SQL engine will scan all records fetched from the inner query. On the other hand if we are using EXISTS, the SQL engine will stop the scanning process as soon as it found a match.