Different results from using IN and EXISTS - sql

I have a query that has been giving me fits. Basically I want a left outer join, but without using a join.
I started off using IN and got back about 13,000 rows. If I use EXISTS, I then get about 11,000 rows. Even if I use GROUP BY to make sure duplicates aren't counted, there's still a difference.
Here's some code
This one with exists
SELECT upper(EMAIL_ADDRESS)
FROM DATA.CRM_CONTACTS
WHERE EXISTS
(
SELECT upper(Email_address)
FROM DATA.MMBI
WHERE DATA.CRM_CONTACTS.Email_address = DATA.MMBI.Email_Address
)
group by 1
order by 1
And this is code that uses IN:
SELECT upper(EMAIL_ADDRESS)
FROM DATA.CRM_CONTACTS
WHERE upper(EMAIL_ADDRESS) IN
(
SELECT upper(Email_address)
FROM DATA.MMBI
)
group by 1
order by 1
Is there any reason that would explain why I'm getting different results?

Assuming that you're using SQL Server:
In your in case, you're making a case-insensitive comparison, uppercasing both values to be compared:
WHERE upper(EMAIL_ADDRESS) IN ( SELECT upper(Email_address)
FROM DATA.MMBI
)
In your exists case, your join criteria for the correlated subquery is this
WHERE DATA.CRM_CONTACTS.Email_address = DATA.MMBI.Email_Address
Which means it's going to use the collation in play to make the comparison, which might be case-sensitive.

Related

Why the processing time of this query skyrockets when adding this "where" clause?

I'm trying to get a query done without including some previous CTE's ids from previous queries. The query looks like this:
WITH
people AS(
SELECT
rand() AS prob,
[ STRUCT(name,
address,
id,
[CAST(FLOOR(19*RAND()) AS INT64),
CAST(FLOOR(19*RAND()) AS INT64),
CAST(FLOOR(19*RAND()) AS INT64)] AS answers) ] AS person
FROM
`table_id`
ORDER BY
prob DESC
LIMIT
53370 ),
SELECT
t1.* EXCEPT(col_1,
col_2),
prob,
foo.person
FROM
`table_id` t1
LEFT JOIN (
SELECT
*
FROM
people,
UNNEST(people.person) p) foo
ON
foo.id= t1.id
So, until here, all good. The query runs in about 3.4 seconds, but it includes all the id, so the ids of CTE people are counted twice.
I added this "innocent" line at the end of the query in order to try filtering the ids
WHERE
t1.id NOT IN (
SELECT
id
FROM
people,UNNEST(people.person))
And with this, the time goes skyrocket (easy 25 minutes and its not done). Why is this? The whole table has a size of 30.28 MB and 184.329 rows. How can I fix this and get the result I want? (exclude the ids of CTE people when joining the the table with the original table). Should I use other type of join or approach to get this done?
Just use:
where foo.id is null
You already have a left join in the query to the same data. Just check that there is no match.
The problem with your query is probably a poor implementation of NOT IN. I prefer NOT EXISTS or LEFT JOIN/WHERE -- which is what I suggest.

How to find different rows in two tables with same columns?

I have two tables that have exact same set of columns. I'd like to select all rows that don't exactly match. Is there a way to do that without joining by every column or typing every column's name in any other way (I have a large number of them)?
If the number, type and order of columns are exactly the same, you can use the EXCEPT (or in some DBMS MINUS) operator to remove all rows from the first table, that match a row from the second table (by every column).
SELECT *
FROM table1
EXCEPT
SELECT *
FROM table2;
(Use EXCEPT ALL, if you don't want or need duplicate elimination. If you want also the result when the operands are interchanged, you can use UNION (or UNION ALL to union the results of a second EXCEPT operation. In doubt use parenthesis to prioritize the operations as needed.)
use minus
select * from tableA
minus
select * from tableB
If the query returns no rows then the data is exactly the same.
You could use JOIN by PK and compare all other columns using:
SELECT *
FROM src s
FULL OUTER JOIN trg t
ON s.id = t.id
WHERE NOT EXISTS (SELECT s.col1, s.col2, s.col3, s.col4
INTERSECT
SELECT t.col1, t.col2, t.col3, t.col4);
Please note that this approach allows to compare data side-by-side.
DBFiddle Demo
EDIT:
That still requires to explicitly mention every column? I'd rather not to.
Yes, but you could use drag and drop from object explorer(SSMS/TOAD/Oracle Developer) and avoid manually typing them.
There is SELECT * EXCEPT(only Google Big Query):
SELECT *
FROM src s
FULL OUTER JOIN trg t
ON s.id = t.id
WHERE NOT EXISTS (SELECT s.* EXCEPT s.id
INTERSECT
SELECT t.* EXCEPT t.id);

getting same top 1 result in sql server

I have this query:
SELECT
IT_approvaldate
FROM
t_item
WHERE
IT_certID_fk_ind = (SELECT DISTINCT TOP 1 IT_certID_fk_ind
FROM t_item
WHERE IT_rfileID_fk = '4876')
ORDER BY
IT_typesort
Result when running this query:
I need get top 1 result. (2013-04-27 00:00:00) problem is when I select top 1, getting 2nd result.
I believe reason for that order by column value same in those two result.
please see below,
However I need get only IT_approvaldate column top 1 as result of my query.
How can I do this? Can anyone help me to solve this?
Hi use below query and check
SELECT IT_approvaldate FROM t_item WHERE IT_certID_fk_ind =(SELECT DISTINCT top 1 IT_certID_fk_ind FROM t_item WHERE IT_rfileID_fk ='4876' ) and IT_approvaldate is not null ORDER BY IT_typesort
This will remove null values from the result
If you want NULL to be the last value in the sorted list you can use ISNULL in ORDER BY clause to replace NULL by MAX value of DATETIME
Below code might help:
SELECT TOP 1 IT_approvaldate
FROM t_item
WHERE IT_certID_fk_ind = (SELECT DISTINCT top 1 IT_certID_fk_ind FROM t_item WHERE IT_rfileID_fk ='4876' )
ORDER BY IT_typesort ASC, ISNULL(IT_approvaldate,'12-31-9999 23:59:59') ASC;
TSQL Select queries are not inherently deterministic. You must add a tie-breaker or by another row that is not.
The theory is SQL Server will not presume that the NULL value is greater or lesser than your row, and because your select statement is not logically implemented until after your HAVING clause, the order depends on how the database is setup.
Understand that SQL Server may not necessarily choose the same path twice unless it thinks it is absolutely better. This is the reason for the ORDER BY clause, which will treat NULLs consistently (assuming there is a unique grouping).
UPDATE:
It seemed a good idea to add a link to MSDN's documentation on the ORDER BY. Truly, it is good practice to start from the Standard/MSDN. ORDER BY Clause - MSDN

When to use EXCEPT as opposed to NOT EXISTS in Transact SQL?

I just recently learned of the existence of the new "EXCEPT" clause in SQL Server (a bit late, I know...) through reading code written by a co-worker. It truly amazed me!
But then I have some questions regarding its usage: when is it recommended to be employed? Is there a difference, performance-wise, between using it versus a correlated query employing "AND NOT EXISTS..."?
After reading EXCEPT's article in the BOL I thought it was just a shorthand for the second option, but was surprised when I rewrote a couple queries using it (so they had the "AND NOT EXISTS" syntax much more familiar to me) and then checked the execution plans - surprise! The EXCEPT version had a shorter execution plan, and executed faster, also. Is this always so?
So I'd like to know: what are the guidelines for using this powerful tool?
EXCEPT treats NULL values as matching.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE value NOT IN
(
SELECT value
FROM p
)
will return an empty rowset.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE NOT EXISTS
(
SELECT NULL
FROM p
WHERE p.value = q.value
)
will return
NULL
1
, and this one:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
EXCEPT
SELECT *
FROM p
will return:
1
Recursive reference is also allowed in EXCEPT clause in a recursive CTE, though it behaves in a strange way: it returns everything except the last row of a previous set, not everything except the whole previous set:
WITH q (value) AS
(
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
),
rec (value) AS
(
SELECT value
FROM q
UNION ALL
SELECT *
FROM (
SELECT value
FROM q
EXCEPT
SELECT value
FROM rec
) q2
)
SELECT TOP 10 *
FROM rec
---
1
2
3
-- original set
1
2
-- everything except the last row of the previous set, that is 3
1
3
-- everything except the last row of the previous set, that is 2
1
2
-- everything except the last row of the previous set, that is 3, etc.
1
SQL Server developers must just have forgotten to forbid it.
I have done a lot of analysis of except, not exists, not in and left outer join. Generally the left outer join is the fastest for finding missing rows, especially joining on a primary key. Not In can be very fast if you know it will be a small list returned in the select.
I use EXCEPT a lot to compare what is being returned when rewriting code. Run the old code saving results. Run new code saving results and then use except to capture all differences. It is a very quick and easy way to find differences, especially when needing to get all differences including null. Very good for on the fly easy coding.
But, every situation is different. I say to every developer I have ever mentored. Try it. Do timings all different ways. Try it, time it, do it.
EXCEPT compares all (paired)columns of two full-selects.
NOT EXISTS compares two or more tables accoding to the conditions specified in WHERE clause in the sub-query following NOT EXISTS keyword.
EXCEPT can be rewritten by using NOT EXISTS.
(EXCEPT ALL can be rewritten by using ROW_NUMBER and NOT EXISTS.)
Got this from here
There is no accounting for SQL server's execution plans. I have always found when having performance issues that it was utterly arbitrary (from a user's perspective, I'm sure the algorithm writers would understand why) when one syntax made a better execution plan rather than another.
In this case, something about the query parameter comparison allows SQL to figure out a shortcut that it couldn't from a straight select statement. I'm sure that is a deficiency in the algorithm. In other words, you could logically interpolate the same thing, but the algorithm doesn't make that translation on an exists query. Sometimes that is because an algorithm that could reliably figure it out would take longer to execute than the query itself, or at least the algorithm designer thought so.
If your query is fine tuned then there is no performance difference b/w using of EXCEPT clause and NOT EXIST/NOT IN.. first time when I ran EXCEPT after changing my correlated query into it.. I was surprised because it returned with the result just in 7 secs while correlated query was returning in 22 secs.. then I used distinct clause in my correlated query and reran.. it also returned in 7 secs.. so EXCEPT is good when you don't know or don't have time to fine tuned your query otherwise both are same performance wise..

What do you put in a subquery's Select part when it's preceded by Exists?

What do you put in a subquery's Select part when it's preceded by Exists?
Select *
From some_table
Where Exists (Select 1
From some_other_table
Where some_condition )
I usually use 1, I used to put * but realized it could add some useless overhead.
What do you put? is there a more efficient way than putting 1 or any other dummy value?
I think the efficiency depends on your platform.
In Oracle, SELECT * and SELECT 1 within an EXISTS clause generate identical explain plans, with identical memory costs. There is no difference. However, other platforms may vary.
As a matter of personal preference, I use
SELECT *
Because SELECTing a specific field could mislead a reader into thinking that I care about that specific field, and it also lets me copy / paste that subquery out and run it unmodified, to look at the output.
However, an EXISTS clause in a SQL statement is a bit of a code smell, IMO. There are times when they are the best and clearest way to get what you want, but they can almost always be expressed as a join, which will be a lot easier for the database engine to optimize.
SELECT *
FROM SOME_TABLE ST
WHERE EXISTS(
SELECT 1
FROM SOME_OTHER_TABLE SOT
WHERE SOT.KEY_VALUE1 = ST.KEY_VALUE1
AND SOT.KEY_VALUE2 = ST.KEY_VALUE2
)
Is logically identical to:
SELECT *
FROM
SOME_TABLE ST
INNER JOIN
SOME_OTHER_TABLE SOT
ON ST.KEY_VALUE1 = SOT.KEY_VALUE1
AND ST.KEY_VALUE2 = SOT.KEY_VALUE2
I also use 1. I've seen some devs who use null. I think 1 is efficient compared to selecting from any field as the query won't have to get the actual value from the physical loc when it executes the select clause of the subquery.
Use:
WHERE EXISTS (SELECT NULL
FROM some_other_table
WHERE ... )
EXISTS returns true if one or more of the specified criteria match - it doesn't matter if columns are actually returned in the SELECT clause. NULL just makes it explicit that there isn't a comparison while 1/etc could be a valid value previously used in an IN clause.