When to use EXCEPT as opposed to NOT EXISTS in Transact SQL? - sql

I just recently learned of the existence of the new "EXCEPT" clause in SQL Server (a bit late, I know...) through reading code written by a co-worker. It truly amazed me!
But then I have some questions regarding its usage: when is it recommended to be employed? Is there a difference, performance-wise, between using it versus a correlated query employing "AND NOT EXISTS..."?
After reading EXCEPT's article in the BOL I thought it was just a shorthand for the second option, but was surprised when I rewrote a couple queries using it (so they had the "AND NOT EXISTS" syntax much more familiar to me) and then checked the execution plans - surprise! The EXCEPT version had a shorter execution plan, and executed faster, also. Is this always so?
So I'd like to know: what are the guidelines for using this powerful tool?

EXCEPT treats NULL values as matching.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE value NOT IN
(
SELECT value
FROM p
)
will return an empty rowset.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE NOT EXISTS
(
SELECT NULL
FROM p
WHERE p.value = q.value
)
will return
NULL
1
, and this one:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
EXCEPT
SELECT *
FROM p
will return:
1
Recursive reference is also allowed in EXCEPT clause in a recursive CTE, though it behaves in a strange way: it returns everything except the last row of a previous set, not everything except the whole previous set:
WITH q (value) AS
(
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
),
rec (value) AS
(
SELECT value
FROM q
UNION ALL
SELECT *
FROM (
SELECT value
FROM q
EXCEPT
SELECT value
FROM rec
) q2
)
SELECT TOP 10 *
FROM rec
---
1
2
3
-- original set
1
2
-- everything except the last row of the previous set, that is 3
1
3
-- everything except the last row of the previous set, that is 2
1
2
-- everything except the last row of the previous set, that is 3, etc.
1
SQL Server developers must just have forgotten to forbid it.

I have done a lot of analysis of except, not exists, not in and left outer join. Generally the left outer join is the fastest for finding missing rows, especially joining on a primary key. Not In can be very fast if you know it will be a small list returned in the select.
I use EXCEPT a lot to compare what is being returned when rewriting code. Run the old code saving results. Run new code saving results and then use except to capture all differences. It is a very quick and easy way to find differences, especially when needing to get all differences including null. Very good for on the fly easy coding.
But, every situation is different. I say to every developer I have ever mentored. Try it. Do timings all different ways. Try it, time it, do it.

EXCEPT compares all (paired)columns of two full-selects.
NOT EXISTS compares two or more tables accoding to the conditions specified in WHERE clause in the sub-query following NOT EXISTS keyword.
EXCEPT can be rewritten by using NOT EXISTS.
(EXCEPT ALL can be rewritten by using ROW_NUMBER and NOT EXISTS.)
Got this from here

There is no accounting for SQL server's execution plans. I have always found when having performance issues that it was utterly arbitrary (from a user's perspective, I'm sure the algorithm writers would understand why) when one syntax made a better execution plan rather than another.
In this case, something about the query parameter comparison allows SQL to figure out a shortcut that it couldn't from a straight select statement. I'm sure that is a deficiency in the algorithm. In other words, you could logically interpolate the same thing, but the algorithm doesn't make that translation on an exists query. Sometimes that is because an algorithm that could reliably figure it out would take longer to execute than the query itself, or at least the algorithm designer thought so.

If your query is fine tuned then there is no performance difference b/w using of EXCEPT clause and NOT EXIST/NOT IN.. first time when I ran EXCEPT after changing my correlated query into it.. I was surprised because it returned with the result just in 7 secs while correlated query was returning in 22 secs.. then I used distinct clause in my correlated query and reran.. it also returned in 7 secs.. so EXCEPT is good when you don't know or don't have time to fine tuned your query otherwise both are same performance wise..

Related

SQL Server - Short circuit Select Union if any returns row

I do have multiple select queries which I wanted to execute against SQL server to perform some data validation.
Select top 1 1 from tbl1 where somecondition = 1
UNION
Select top 1 1 from tbl2 where somecondition = 1
UNION
Select top 1 1 from tbl3 where somecondition = 1
And what I want is if any of these three select statement returns any record then the data validation should fail. One way is to just run them sequentially and check the results. But, to make it more performant what I am thinking is to apply UNION operator so that these queries can run in parallel and short circuit some way if any returns any row. How can I achieve that?
UNION won't short-circuit. I would suggest something more like this:
select v.*
from (values (1)) v(x)
where exists (select 1 from tbl1 where somecondition = 1) or
exists (select 1 from tbl2 where somecondition = 1) or
exists (select 1 from tbl3 where somecondition = 1) ;
exists should short-circuit evaluation if any row is found.
Trying to fine-tune how and when a DBMS does something is generally the wrong mindset. Instead, you should try to describe the result you want, in a way that the DBMS itself has a chance to optimise it.
You can then test to see if it achieves the performance you were hoping for in a specific scenario. DBMS optimisers are incredibly complex, so that you generally can't look at a query on its own and say how the DBMS will run it - the strategy used will change depending on not just the schema but the actual data involved.
So, here is my attempt to reframe your requirement:
I want to check if any one of three select queries will return at least one row. I don't need to know the result, or which query has the row.
The simplest way to express that in SQL is with EXISTS and OR; to make a standalone query, we can wrap in CASE WHEN ... END:
SELECT CASE
WHEN
EXISTS( Select 1 from tbl1 where somecondition = 1 )
OR
EXISTS( Select 1 from tbl2 where somecondition = 1 )
OR
EXISTS( Select 1 from tbl3 where somecondition = 1 )
THEN 1
ELSE 0
END AS MatchExists
Note that I've removed the TOP 1 from the queries - stopping when you find one matching row is already implied by EXISTS.
Now, I've no idea what the DBMS will do with this query. It might:
Run the three queries sequentially, in the order specified, stopping if it finds a match.
Spawn multiple threads each checking a condition, and abort if one finds a match, as you hoped.
Notice from its statistics that tbl2 has only ~10 rows, and scan that table first.
Notice that the condition on tbl3 can be read from an index, so check that one first.
Some other clever trick that hasn't occurred to either of us.
Some combination of the above.
The only way to find out is to run it with your real data and look at the query plan, and the measured performance. If it picks a bad approach, you may need to do something different - but that may just be updating statistics, or adding the right indexes, as with an query optimisation.

Assistance with SQL statement

I'm using sql-server 2005 and ASP.NET with C#.
I have Users table with
userId(int),
userGender(tinyint),
userAge(tinyint),
userCity(tinyint)
(simplified version of course)
I need to select always two fit to userID I pass to query users of opposite gender, in age range of -5 to +10 years and from the same city.
Important fact is it always must be two, so I created condition if ##rowcount<2 re-select without age and city filters.
Now the problem is that I sometimes have two returned result sets because I use first ##rowcount on a table. If I run the query.
Will it be a problem to use the DataReader object to read from always second result set? Is there any other way to check how many results were selected without performing select with results?
Can you simplify it by using SELECT TOP 2 ?
Update: I would perform both selects all the time, union the results, and then select from them based on an order (using SELECT TOP 2) as the union may have added more than two. Its important that this next select selects the rows in order of importance, ie it prefers rows from your first select.
Alternatively, have the reader logic read the next result-set if there is one and leave the SQL alone.
To avoid getting two separate result sets you can do your first SELECT into a table variable and then do your ##ROWCOUNT check. If >= 2 then just select from the table variable on its own otherwise select the results of the table variable UNION ALLed with the results of the second query.
Edit: There is a slight overhead to using table variables so you'd need to balance whether this was cheaper than Adam's suggestion just to perform the 'UNION' as a matter of routine by looking at the execution stats for both approaches
SET STATISTICS IO ON
Would something along the following lines be of use...
SELECT *
FROM (SELECT 1 AS prio, *
FROM my_table M1 JOIN my_table M2
WHERE M1.userID = supplied_user_id AND
M1.userGender <> M2.userGender AND
M1.userAge - 5 >= M2.userAge AND
M1.userAge + 15 <= M2.userAge AND
M1.userCity = M2.userCity
LIMIT TO 2 ROWS
UNION
SELECT 2 AS prio, *
FROM my_table M1 JOIN my_table M2
WHERE M1.userID = supplied_user_id AND
M1.userGender <> M2.userGender
LIMIT TO 2 ROWS)
ORDER BY prio
LIMIT TO 2 ROWS;
I haven't tried it as I have no SQL Server and there may be dialect issues.

Which of these queries is more efficient?

Which of these queries are more efficient?
select 1 as newAndClosed
from sysibm.sysdummy1
where exists (
select 1
from items
where new = 1
)
and not exists (
select 1
from status
where open = 1
)
select 1 as newAndClosed
from items
where new = 1
and not exists (
select 1
from status
where open = 1
)
Look at the explain plan and/or profiler output. Also, measure it. Measure it using bind variables and repeated runs.
I think the second one is faster because on contrary to the first one, the sysibm.sysdummy1 table does not need to parse
From a simplistic point of view I'd expect query 2 to run faster because it involves fewer queries, but as Hank points out, running it through a Profiler is the best way to be sure.
They will produce different results if items contains more than one element with new = 1. exists will only check first record which matches the condition. So I'd vote for first variant, if in actual query you don't have relation between items and status (as in your example).
P.S. Usually I use SELECT 1 WHERE 2==2 when I need only one result from nowhere. If I need more SELECT 1 UNION SELECT 2.
I would personally say the second query.
First, it queries the table Items directly and filter with the WHERE clause on a field of this table.
Second, it uses only one other subquery, instead of two.
In the end, the second example ends doing only two queries, when the first example does three.
Many queries will always be more expensive than less queries to manage for the database engine. Except if you perform EXISTS verifications instead of table joints. A joint is more expensive than an EXISTS clause.

Evaluation of CTEs in SQL Server 2005

I have a question about how MS SQL evaluates functions inside CTEs. A couple of searches didn't turn up any results related to this issue, but I apologize if this is common knowledge and I'm just behind the curve. It wouldn't be the first time :-)
This query is a simplified (and obviously less dynamic) version of what I'm actually doing, but it does exhibit the problem I'm experiencing. It looks like this:
CREATE TABLE #EmployeePool(EmployeeID int, EmployeeRank int);
INSERT INTO #EmployeePool(EmployeeID, EmployeeRank)
SELECT 42, 1
UNION ALL
SELECT 43, 2;
DECLARE #NumEmployees int;
SELECT #NumEmployees = COUNT(*) FROM #EmployeePool;
WITH RandomizedCustomers AS (
SELECT CAST(c.Criteria AS int) AS CustomerID,
dbo.fnUtil_Random(#NumEmployees) AS RandomRank
FROM dbo.fnUtil_ParseCriteria(#CustomerIDs, 'int') c)
SELECT rc.CustomerID,
ep.EmployeeID
FROM RandomizedCustomers rc
JOIN #EmployeePool ep ON ep.EmployeeRank = rc.RandomRank;
DROP TABLE #EmployeePool;
The following can be assumed about all executions of the above:
The result of dbo.fnUtil_Random() is always an int value greater than zero and less than or equal to the argument passed in. Since it's being called above with #NumEmployees which has the value 2, this function always evaluates to 1 or 2.
The result of dbo.fnUtil_ParseCriteria(#CustomerIDs, 'int') produces a one-column, one-row table that contains a sql_variant with a base type of 'int' that has the value 219935.
Given the above assumptions, it makes sense (to me, anyway) that the result of the expression above should always produce a two-column table containing one record - CustomerID and an EmployeeID. The CustomerID should always be the int value 219935, and the EmployeeID should be either 42 or 43.
However, this is not always the case. Sometimes I get the expected single record. Other times I get two records (one for each EmployeeID), and still others I get no records. However, if I replace the RandomizedCustomers CTE with a true temp table, the problem vanishes completely.
Every time I think I have an explanation for this behavior, it turns out to not make sense or be impossible, so I literally cannot explain why this would happen. Since the problem does not happen when I replace the CTE with a temp table, I can only assume it has something to do with the functions inside CTEs are evaluated during joins to that CTE. Do any of you have any theories?
SQL Server's optimizer is free to decide whether to reevaluate a CTE or not.
For instance, this query:
WITH q AS
(
SELECT NEWID() AS n
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will produce two different NEWID()'s, however, if you use cached XML plan to wrap the CTE into an Eager Spool operation, the records will be same.

What do you put in a subquery's Select part when it's preceded by Exists?

What do you put in a subquery's Select part when it's preceded by Exists?
Select *
From some_table
Where Exists (Select 1
From some_other_table
Where some_condition )
I usually use 1, I used to put * but realized it could add some useless overhead.
What do you put? is there a more efficient way than putting 1 or any other dummy value?
I think the efficiency depends on your platform.
In Oracle, SELECT * and SELECT 1 within an EXISTS clause generate identical explain plans, with identical memory costs. There is no difference. However, other platforms may vary.
As a matter of personal preference, I use
SELECT *
Because SELECTing a specific field could mislead a reader into thinking that I care about that specific field, and it also lets me copy / paste that subquery out and run it unmodified, to look at the output.
However, an EXISTS clause in a SQL statement is a bit of a code smell, IMO. There are times when they are the best and clearest way to get what you want, but they can almost always be expressed as a join, which will be a lot easier for the database engine to optimize.
SELECT *
FROM SOME_TABLE ST
WHERE EXISTS(
SELECT 1
FROM SOME_OTHER_TABLE SOT
WHERE SOT.KEY_VALUE1 = ST.KEY_VALUE1
AND SOT.KEY_VALUE2 = ST.KEY_VALUE2
)
Is logically identical to:
SELECT *
FROM
SOME_TABLE ST
INNER JOIN
SOME_OTHER_TABLE SOT
ON ST.KEY_VALUE1 = SOT.KEY_VALUE1
AND ST.KEY_VALUE2 = SOT.KEY_VALUE2
I also use 1. I've seen some devs who use null. I think 1 is efficient compared to selecting from any field as the query won't have to get the actual value from the physical loc when it executes the select clause of the subquery.
Use:
WHERE EXISTS (SELECT NULL
FROM some_other_table
WHERE ... )
EXISTS returns true if one or more of the specified criteria match - it doesn't matter if columns are actually returned in the SELECT clause. NULL just makes it explicit that there isn't a comparison while 1/etc could be a valid value previously used in an IN clause.