Most efficient SQL Statement: Exists vs IN - sql

I have three tables that I need to JOIN to get values from two columns.
These columns are GRN_STATUS and STATUS I have written some SQL that achives the desired result but I've been advised that using INis very inefficient and that I should use EXISTS instead.
I'm just wondering is this true in my situation? and what would a solution using EXISTS instead of IN look like?
SQL:
SELECT c.GRN_STATUS, a.STATUS
FROM
TableA a
INNER JOIN
TableB b
ON a.ORD_NO = b.ORD_NO
AND a.COMPANY_ID = b.COMPANY_ID
INNER JOIN
TableC c
ON b.GRN_NO = c.GRN_NO
AND b.COMPANY_ID = c.COMPANY_ID
AND a.STATUS IN ( 'B', 'C', 'D', 'E' )
AND c.GRN_STATUS = 'A';

In general, it depends on the implementation in the DBMS.
EXISTS mostly stops and returns at the first match so it COULD be more efficient, but it makes no sense when you have a list of constants.
Since SQL is a declarative language, you can't tell the DBMS the how, just the what. You describe the expected result and it is up to the server to try to find the most efficient way to fulfill your request.
The way the DBMS finds the efficient algorithm is based on several things including the amount and the distribution of the data, the actual statistics, the expected resources needed, etc.
So EXISTS may perform better on a huge table, while has no effect on smaller ones (or vica versa).
Your best bet is to actually check the estimated query plans or try them out.
My personal view is to use EXISTS when no data is required and JOIN when data is required. IN is for constant lists.

Related

Sql query optimization using IN over INNER JOIN

Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan
Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.

The purpose of SQL's EXISTS and NOT EXISTS

Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
I see them as being misleading (they're arguably harder to accurately visualize compared to conventional joins and subqueries), often misunderstood (e.g. using SELECT * will behave the same as SELECT 1 in the EXISTS/NOT EXISTS subquery), and from my limited experience, slower to execute.
Can someone describe and/or provide me an example where they are best suited or where there is no option other than to use them? Note that since their execution and performance are likely platform dependent, I'm particularly interested in their use in MySQL.
Every now and then I see these being used, but it never seems to be anything that can't be performed as equally well, if not better, by using a normal join or subquery.
This article (though SQL Server related):
IN vs. JOIN vs. EXISTS
may be of interest to you.
In a nutshell, JOIN is a set operation, while EXISTS is a predicate.
In other words, these queries:
SELECT *
FROM a
JOIN b
ON some_condition(a, b)
vs.
SELECT *
FROM a
WHERE EXISTS
(
SELECT NULL
FROM b
WHERE some_condition(a, b)
)
are not the same: the former can return more than one record from a, while the latter cannot.
Their counterparts, NOT EXISTS vs. LEFT JOIN / IS NULL, are the same logically but not performance-wise.
In fact, the former may be more efficient in SQL Server:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
if the main query returned much less rows then the table where you want to find them. example:
SELECT st.State
FROM states st
WHERE st.State LIKE 'N%' AND EXISTS(SELECT 1 FROM addresses a WHERE a.State = st.State)
doing this with a join will be much slower. or a better example, if you want to search if a item exists in 1 of multiple tables.
You can't [easily] use a join in an UPDATE statement, so WHERE EXISTS works excellently there:
UPDATE mytable t
SET columnX = 'SomeValue'
WHERE EXISTS
(SELECT 1
FROM myothertable ot
WHERE ot.columnA = t.columnY
AND ot.columnB = 'XYX'
);
Edit: Basing this on Oracle more than MySQL, and yes there are ways to do it with an inline view, but IMHO this is cleaner.

SQL: Is a query like this OK or is there a more efficient way of doing it, like using a join?

I often find myself wanting to write an SQL query like the following:
SELECT body
FROM node_revisions
where vid = (SELECT vid
FROM node
WHERE nid = 4);
I know that there are joins and stuff you could do, but they seem to make things more complicated. Are joins a better way to do it? Is it more efficient? Easier to understand?
Joins tend to be more efficient since databases are written with set operations in mind (and joins are set operations).
However, performance will vary from database to database, how the tables are structured, the amount of data in them and how much will be returned by the query.
If the amount of data is small, I would use a subquery like yours rather than a join.
Here is what a join would look like:
SELECT body
FROM node_revisions nr
INNER JOIN node n
ON nr.vid = n.vid
WHERE n.nid = 4
I would not use the query you posted, as there is chance of more than one node record with a nid = 4, which would cause it to fail.
I would use:
SELECT body
FROM node_revisions
WHERE vid IN (SELECT vid
FROM node
WHERE nid = 4);
Is this more readable or understandable? In this case, it's a matter of personal preference.
I think joins are easier to understand and can be more efficient. Your case is pretty simple, so it is probably a toss-up. Here is how I would write it:
SELECT body
FROM node_revisions
inner join node
on (node_revisions.vid = node.vid)
WHERE node.nid = 4
The answer to any performance related questions in databases is it depends, and we're short on details in the OP. Knowing no specifics about your situation... (thus, these are general rules of thumb)
Joins are better and easier to understand
If for some reason you need multiple column keys (fishy), you can continue to use a join and simply tack on another expression to the join condition.
If in the future you really do need to join auxiliary data, the join framework is already there.
It makes it more clear exactly what you're joining on and where indexes should be implemented.
Use of joins makes you better at joins and better at thinking about joins.
Joins are clear about what tables are in play
Written queries have nothing to do with effiency*
The queries you write and what actually gets run have little to do with one another. There are many ways to write a query but only so few ways to fetch the data, and it's up to the query engine to decide. This relates mostly to indexes. It's very possible to write four queries that look totally different but internally do the same thing.
(* It's possible to write a horrible query that is inefficient but it takes a special kind of crazy to do that.)
select
body
from node_revisions nr
join node n
on n.vid = nr.vid
where n.nid = 4
A join is interesting:
select body
from node_revisions nr
join node n on nr.vid = n.vid
where n.vid = 4
But you can also express a join without a join [!]:
select body
from node_revisions nr, node n
where n.nid = 4 and nr.vid = n.vid
Interestingly enough, SQL Server gives a slight different query plan on both queries, while the join has a clustered index scan, the "join without a join" has a clustered index seek in its place, which indicates it's better, at least in this case!
select
body
from node_revisions A
where exists (select 'x'
from Node B
Where A.Vid = B.Vid and B.NID=4)
I don't see anything wrong with what you wrote, and a good optimizer may even change it to a join if it sees fit.
SELECT body
FROM node_revisions
WHERE vid =
(
SELECT vid
FROM node
WHERE nid = 4
)
This query is logically equivalent to a join if and only if nid is a PRIMARY KEY or is covered by a UNIQUE constraint.
Otherwise, the queries are not equivalent: a join will always succeed, while the subquery will fail if there are more that 1 row in node with nid = 4.
If nid is a PRIMARY KEY, then the JOIN and the subquery will have same performance.
In case of a join, node will be made leading
In case of a subquery, the subquery will be executed once and transformed into a const on parsing stage.
The latest MySQL 6.x code will automatically convert that IN expression into an INNER JOIN using a semi-join subquery optimization, making the 2 statements largely equivalent:
http://forge.mysql.com/worklog/task.php?id=3740
but, actually writing it out is pretty simple to do, because INNER JOIN is the default join type, and doing this wouldn't rely on the server optimizing it away (which it might decide not to for some reason and which wouldn't be portable necessarily). all things being equal, why not go with:
select body from node_revisions r, node n where r.vid = n.vid and n.node = 4

Does the SQL spec provide for a better way to do a exclusive ORing of two sets?

I have a result set A which is 10 rows 1-10 {1,2,3,4,5,6,7,8,9,10}, and B which is 10 rows consisting of evens 1-20 {2,4,6,8,10,12,14,16,18,20}. I want to find of the elements that are in one set but not both. There are no other columns in the rows.
I know a UNION will be A + B. I can find the ones in both A and B with A INTERSECT B. I can find all of the rows in A that are not in B with A EXCEPT B.
This brings me to the question how do I find all rows that are in A or B, but not both, is there a transitive equiv of ( A EXCEPT B ) UNION ( B EXCEPT A) in the sql spec? I'm wanting a set of {1,3,5,7,9,12,14,16,18,20}. I believe this can also be written A UNION B EXCEPT ( A INTERSECT B )
Is there a mathy reason in set theory why this can't be done in one operation (that can be explained to someone who doesn't understand set theory)? Or, is it just not implemented because it is so simple to build yourself? Or, do I just not know of a better way to do it?
I'm thinking this must be in the SQL spec somewhere: I know the thing is humongous.
There's another way to do what you want, using a FULL OUTER JOIN with a WHERE clause to remove the rows that appear in both tables. This is probably more efficient than the constructs you suggested, but you should of course measure the performance of both to be sure. Here's a query you might be able to use:
SELECT COALESCE(A.id, B.id) AS id
FROM A
FULL OUTER JOIN B
ON A.id = B.id
WHERE A.id IS NULL OR B.id IS NULL
The "exclusive-or" type operation is also called symmetric set difference in set theory. Using this phrase in search, I found a page describing a number of techniques to implement the Symmetric Difference in SQL. It describes a couple of queries and how to optimise them. Although the details appear to be specific to Oracle, the general techniques are probably applicable to any DBMS.

How To Create a SQL Index to Improve ORDER BY performance

I have some SQL similar to the following, which joins four tables and then orders the results by the "status" column of the first:
SELECT *
FROM a, b, c, d
WHERE b.aid=a.id AND c.id=a.cid AND a.did=d.id AND a.did='XXX'
ORDER BY a.status
It works. However, it's slow. I've worked out this is because of the ORDER BY clause and the lack of any index on table "a".
All four tables have the PRIMARY KEYs set on the "id" column.
So, I know I need to add an index to table a which includes the "status" column but what else does it need to include? Should "bid", "cid" and "did" be in there too?
I've tried to ask this in a general SQL sense but, if it's important, the target is SQLite for use with Gears.
Thanks in advance,
Jake (noob)
I would say it's slow because the engine is doing scans all over the place instead of seeks. Did you mean to do SELECT a.* instead? That would be faster as well, SELECT * here is equivalent to a.*, b.*, c.*, d.*.
You will probably get better results if you put a separate index on each of these columns:
a.did (so that a.did = 'XXX' is a seek instead of a scan, also helps a.did = d.id)
a.cid (for a.cid = c.id)
b.aid (for a.id = b.aid)
You could try adding Status to the first and second indexes with ASCENDING order, for additional performance - it doesn't hurt.
I'd be curious as to how you worked out that the problem is 'the ORDER BY clause and the lack of any index on table "a".' I find this a little suspicious because there is an index on table a, on the primary key, you later say.
Looking at the nature of the query and what I can guess about the nature of the data, I would think that this query would generally produce relatively few results compared to the size of the tables it's using, and that thus the ORDER BY would be extremely cheap. Of course, this is just a guess.
Whether an index will even help at all is dependent on the data in the table. What indices your query optimizer will use when doing a query is dependent on a lot of different factors, one of the big ones being the expected number of results produced from a lookup.
One thing that would help a lot is if you would post the output of EXPLAINing your query.
have you tried joins?
select * from a inner join b on a.id = b.aid inner join c on a.cid = c.id inner join d on a.did=d.id where a.did='XXX'
ORDER BY a.status
the correct use of joins (left, richt, inner, outer) depends on structure of tables
hope this helps