SQL Count: erratic behaviour - sql

A piece of SQL that I have written is not behaving as intended. A vital piece of logic involves counting how many guests are VIPs, but the SQL seems to consistently get an incorrect answer.
The following database has 6 guests, 3 of whom are VIPs.
CREATE TABLE `guest` (
`GuestID` int(11) NOT NULL DEFAULT '0',
`fullname` varchar(255) DEFAULT NULL,
`vip` tinyint(1) DEFAULT '0',
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--
-- Dumping data for table `guest`
--
INSERT INTO `guest` (`GuestID`, `fullname`, `vip`) VALUES
(912, 'Sam', 0),
(321, 'Sev', 0),
(629, 'Joe', 0),
(103, 'Tom', 1),
(331, 'Cao', 1),
(526, 'Conor', 1);
Initially the SQL returned a value saying that there were 5 VIPs, which is incorrect as there are only 3 VIPs. This is quite a complicated database, and in generating a minimum viable example for the sake of this question (with a reproducible error) the script now states that there are only 2 VIPs. Again, this is incorrect.
The SQL in question is
SELECT slotguest.FK_SlotNo, Count(CASE WHEN guest.vip = 1 THEN 1 END) AS guest_count
FROM guest
INNER JOIN slotguest ON guest.GuestID = slotguest.FK_guest
GROUP BY slotguest.FK_SlotNo;
The slotguest structure and content is as follows
CREATE TABLE `slotguest` (
`FK_SlotNo` int(11) NOT NULL,
`FK_guest` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--
-- Dumping data for table `slotguest`
--
INSERT INTO `slotguest` (`FK_SlotNo`, `FK_guest`) VALUES
(396, 912),
(396, 321),
(396, 629),
(396, 103),
(396, 331),
(396, 526);
What is causing Count to come up with a consistently incorrect answer?

As identicated in the comments (check from users #Fábio Amorim, #Rajat), your query seems to work as intended. Since, you set a value with the CASE WHEN, it might be better to use SUM.
It might be more visible if you bring the counts for the different VIP categories to find where there might be a leakage of data.
SELECT guest.vip, slotguest.FK_SlotNo, COUNT(*) AS guest_per_category
FROM guest
INNER JOIN slotguest ON guest.GuestID = slotguest.FK_guest
GROUP BY guest.vip,slotguest.FK_SlotNo;

Smells like "explode-implode". Given
SELECT ... COUNT(*)
FROM a JOIN b ...
GROUP BY ...
The query is performed thus:
JOIN the tables. Assuming the tables are not trivially 1:1, this leads to more rows than either of the tables.
Do aggregates (such as COUNT) against that temp table.
Only then does the GROUP BY shrink back to the originally desired number of rows.
The solution is to avoid doing aggregates with more than the one table that contains the data being counted/summed. Sometimes the pattern is
SELECT ...
FROM ( SELECT x, COUNT(*) AS ct FROM a GROUP BY x ) AS b
JOIN c ON ...

To explain what's wrong, and give an answer closer to the O.P.'s query ...
(I assume the O.P. is a cut-down example of what's going wrong, and the actual query is more complex. If we knew the bigger picture, I suspect I wouldn't code it like that.)
In the O.P. query, CASE WHEN guest.vip = 1 THEN 1 END is ill-formed. That's a conditional expression; it should return a specific value for all rows retrieved by the query -- that is for rows where guest.vip <> 1.
As it is, behaviour is undefined; it produces the expected answer on some DBMS's, as the comments are telling; it doesn't on others, as per the O.P. I guess for those it is producing the expected answer, the DBMS is treating the CASE as returning Null, then the Count( ) is ignoring Nulls. This is one of the more horrible consequences of Null in SQL.
So as per #Fábio Amorim's comment, the CASE needs an ELSE, consequently Count( ) gives an unhelpful result, so get the ELSE to return 0 and Sum( ) the 1 or 0:
SELECT slotguest.FK_SlotNo, Sum(CASE WHEN guest.vip = 1 THEN 1 ELSE 0 END) AS guest_count
FROM guest
INNER JOIN slotguest ON guest.GuestID = slotguest.FK_guest
GROUP BY slotguest.FK_SlotNo;

Related

Refactoring SELECT with two correlated subqueries

Using Oracle 10g, I have a table that looks like this (syntax shortened for brevity):
CREATE TABLE "BUZINESS"."CATALOG"
( "ID" NUMBER PRIMARY KEY,
"ENTRY_ID" VARCHAR2(40 BYTE) NOT NULL,
"MSG_ID" VARCHAR2(40 BYTE) NOT NULL,
"PUBLISH_STATUS" VARCHAR2(30 BYTE) NOT NULL, /* Can be NEW or PUBLISHED */
CONSTRAINT "CATALOG_UN1" UNIQUE (ENTRY_ID, MSG_ID)
)
One process, Process A, writes Catalog entries with a PUBLISH_STATUS of 'NEW'. A second process, Process B, then comes in, grabs all 'NEW' messages, and then changes the PUBLISH_STATUS to 'PUBLISHED'.
I need to write a query that will grab all PUBLISH_STATUS='NEW' rows, BUT
I'm trying to prevent an out of order fetch, so that if Process B marks a row as PUBLISH_STATUS='PUBLISHED' with MSG_ID '1000', and then Process A writes an out of order row as PUBLISH_STATUS='NEW' with MSG_ID '999', the query will never fetch that row when grabbing all 'NEW' rows.
So, if I start with the data:
INSERT INTO BUZINESS.CATALOG VALUES (1, '1000', '999', 'NEW');
INSERT INTO BUZINESS.CATALOG VALUES (2, '1000', '1000', 'PUBLISHED');
INSERT INTO BUZINESS.CATALOG VALUES (3, '1000', '1001', 'NEW');
INSERT INTO BUZINESS.CATALOG VALUES (4, '2000', '1999', 'NEW');
INSERT INTO BUZINESS.CATALOG VALUES (5, '2000', '2000', 'PUBLISHED');
INSERT INTO BUZINESS.CATALOG VALUES (6, '2000', '2001', 'NEW');
INSERT INTO BUZINESS.CATALOG VALUES (7, '3000', '3001', 'NEW');
Then my query should grab only rows with ID:
3, 6, 7
I then have to join these rows with other data, so the result needs to be JOINable.
So far, I have a very large, ugly query UNIONing two correlated subqueries to do this. Could someone help me write a better query?
Requiring non-presence of joinable data is best solved with an outer join that filters out matching joins (leaving just the non-matches).
In your case, the join condition is a "published" row for the same entry with a later (higher) message if.
This query produces your desired output:
select t1.*
from buziness_catalog t1
left join buziness_catalog t2
on t2.entry_id = t1.entry_id
and to_number(t2.msg_id) > to_number(t1.msg_id)
and t2.publish_status = 'PUBLISHED'
where t1.publish_status = 'NEW'
and t2.id is null
order by t1.id
See live demo of this query working with your sample data to produce the your desired output. Note that is used a table name of "buziness_catalog" rather than "buziness.catalog" so the demo would run - you'll have to change the underscores back to dots.
Being a join, and not based on an exists correlated subquery, this will perform quite well.
This query would have been a little simpler had your msg_id column been a numeric type (the conversion from character to numeric would not have been needed). If your ID data is actually numeric, consider changing the datatype of entry_id and msg_id to a numeric type.
Reading between the lines, I think this might work:
select
*
from
buziness.catalog b1
where
b1.publish_status = 'NEW' and
not exists (
select
'x'
from
buziness.catalog b2
where
b1.entry_id = b2.entry_id and
b2.publish_status = 'PUBLISHED' and
to_number(b2.msg_id) > to_number(b1.msg_id) -- store numbers as numbers!
);
#Laurence 's query looks good, but just to satisfy my curiosity, do you mind EXPLAINing this query too?
I think that those numbers stored as varchar will kill your index usage capabilities when in TO_NUMBER(), but I'm not sure about Oracle, so you better check that.
In case they do, you can always add additional number columns that you update with a trigger when rows are edited — so that you don't break the original design.
SELECT *
FROM buziness b1
WHERE PUBLISH_STATUS = 'NEW'
AND TO_NUMBER(msg_id) > COALESCE((
SELECT MAX(TO_NUMBER(msg_id))
FROM buziness b2
WHERE PUBLISH_STATUS = 'PUBLISHED'
AND b2.entry_id = b1.entry_id
), 0)
Although this is a very old post i still feel the need to reply here as i suspect this is based on misconception/misunderstanding. Oracle like many other RDBMSses still holds to the principles of ACID where the I stands for Isolation. No process x will see the result of another process y before y committed and x started after y. So one proces alterering the view of another proces on the data is not possible.
If not convinced run the query that updates and dont commit. Start another session and query the data again and again until the first query changes it. It will never change for the other sessions until you commit your changes in the first session and you will read the snapshot of the data in it's state it was when you started the query before the other process committed it for all to see.

SQL NOT IN clause not working [duplicate]

This issue came up when I got different records counts for what I thought were identical queries one using a not in where constraint and the other a left join. The table in the not in constraint had one null value (bad data) which caused that query to return a count of 0 records. I sort of understand why but I could use some help fully grasping the concept.
To state it simply, why does query A return a result but B doesn't?
A: select 'true' where 3 in (1, 2, 3, null)
B: select 'true' where 3 not in (1, 2, null)
This was on SQL Server 2005. I also found that calling set ansi_nulls off causes B to return a result.
Query A is the same as:
select 'true' where 3 = 1 or 3 = 2 or 3 = 3 or 3 = null
Since 3 = 3 is true, you get a result.
Query B is the same as:
select 'true' where 3 <> 1 and 3 <> 2 and 3 <> null
When ansi_nulls is on, 3 <> null is UNKNOWN, so the predicate evaluates to UNKNOWN, and you don't get any rows.
When ansi_nulls is off, 3 <> null is true, so the predicate evaluates to true, and you get a row.
NOT IN returns 0 records when compared against an unknown value
Since NULL is an unknown, a NOT IN query containing a NULL or NULLs in the list of possible values will always return 0 records since there is no way to be sure that the NULL value is not the value being tested.
Whenever you use NULL you are really dealing with a Three-Valued logic.
Your first query returns results as the WHERE clause evaluates to:
3 = 1 or 3 = 2 or 3 = 3 or 3 = null
which is:
FALSE or FALSE or TRUE or UNKNOWN
which evaluates to
TRUE
The second one:
3 <> 1 and 3 <> 2 and 3 <> null
which evaluates to:
TRUE and TRUE and UNKNOWN
which evaluates to:
UNKNOWN
The UNKNOWN is not the same as FALSE
you can easily test it by calling:
select 'true' where 3 <> null
select 'true' where not (3 <> null)
Both queries will give you no results
If the UNKNOWN was the same as FALSE then assuming that the first query would give you FALSE the second would have to evaluate to TRUE as it would have been the same as NOT(FALSE).
That is not the case.
There is a very good article on this subject on SqlServerCentral.
The whole issue of NULLs and Three-Valued Logic can be a bit confusing at first but it is essential to understand in order to write correct queries in TSQL
Another article I would recommend is SQL Aggregate Functions and NULL.
Compare to null is undefined, unless you use IS NULL.
So, when comparing 3 to NULL (query A), it returns undefined.
I.e. SELECT 'true' where 3 in (1,2,null)
and
SELECT 'true' where 3 not in (1,2,null)
will produce the same result, as NOT (UNDEFINED) is still undefined, but not TRUE
IF you want to filter with NOT IN for a subquery containg NULLs justcheck for not null
SELECT blah FROM t WHERE blah NOT IN
(SELECT someotherBlah FROM t2 WHERE someotherBlah IS NOT NULL )
The title of this question at the time of writing is
SQL NOT IN constraint and NULL values
From the text of the question it appears that the problem was occurring in a SQL DML SELECT query, rather than a SQL DDL CONSTRAINT.
However, especially given the wording of the title, I want to point out that some statements made here are potentially misleading statements, those along the lines of (paraphrasing)
When the predicate evaluates to UNKNOWN you don't get any rows.
Although this is the case for SQL DML, when considering constraints the effect is different.
Consider this very simple table with two constraints taken directly from the predicates in the question (and addressed in an excellent answer by #Brannon):
DECLARE #T TABLE
(
true CHAR(4) DEFAULT 'true' NOT NULL,
CHECK ( 3 IN (1, 2, 3, NULL )),
CHECK ( 3 NOT IN (1, 2, NULL ))
);
INSERT INTO #T VALUES ('true');
SELECT COUNT(*) AS tally FROM #T;
As per #Brannon's answer, the first constraint (using IN) evaluates to TRUE and the second constraint (using NOT IN) evaluates to UNKNOWN. However, the insert succeeds! Therefore, in this case it is not strictly correct to say, "you don't get any rows" because we have indeed got a row inserted as a result.
The above effect is indeed the correct one as regards the SQL-92 Standard. Compare and contrast the following section from the SQL-92 spec
7.6 where clause
The result of the is a table of those rows of T for
which the result of the search condition is true.
4.10 Integrity constraints
A table check constraint is satisfied if and only if the specified
search condition is not false for any row of a table.
In other words:
In SQL DML, rows are removed from the result when the WHERE evaluates to UNKNOWN because it does not satisfy the condition "is true".
In SQL DDL (i.e. constraints), rows are not removed from the result when they evaluate to UNKNOWN because it does satisfy the condition "is not false".
Although the effects in SQL DML and SQL DDL respectively may seem contradictory, there is practical reason for giving UNKNOWN results the 'benefit of the doubt' by allowing them to satisfy a constraint (more correctly, allowing them to not fail to satisfy a constraint): without this behaviour, every constraints would have to explicitly handle nulls and that would be very unsatisfactory from a language design perspective (not to mention, a right pain for coders!)
p.s. if you are finding it as challenging to follow such logic as "unknown does not fail to satisfy a constraint" as I am to write it, then consider you can dispense with all this simply by avoiding nullable columns in SQL DDL and anything in SQL DML that produces nulls (e.g. outer joins)!
In A, 3 is tested for equality against each member of the set, yielding (FALSE, FALSE, TRUE, UNKNOWN). Since one of the elements is TRUE, the condition is TRUE. (It's also possible that some short-circuiting takes place here, so it actually stops as soon as it hits the first TRUE and never evaluates 3=NULL.)
In B, I think it is evaluating the condition as NOT (3 in (1,2,null)). Testing 3 for equality against the set yields (FALSE, FALSE, UNKNOWN), which is aggregated to UNKNOWN. NOT ( UNKNOWN ) yields UNKNOWN. So overall the truth of the condition is unknown, which at the end is essentially treated as FALSE.
SQL uses three-valued logic for truth values. The IN query produces the expected result:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE col IN (NULL, 1)
-- returns first row
But adding a NOT does not invert the results:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE NOT col IN (NULL, 1)
-- returns zero rows
This is because the above query is equivalent of the following:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE NOT (col = NULL OR col = 1)
Here is how the where clause is evaluated:
| col | col = NULL⁽¹⁾ | col = 1 | col = NULL OR col = 1 | NOT (col = NULL OR col = 1) |
|-----|----------------|---------|-----------------------|-----------------------------|
| 1 | UNKNOWN | TRUE | TRUE | FALSE |
| 2 | UNKNOWN | FALSE | UNKNOWN⁽²⁾ | UNKNOWN⁽³⁾ |
Notice that:
The comparison involving NULL yields UNKNOWN
The OR expression where none of the operands are TRUE and at least one operand is UNKNOWN yields UNKNOWN (ref)
The NOT of UNKNOWN yields UNKNOWN (ref)
You can extend the above example to more than two values (e.g. NULL, 1 and 2) but the result will be same: if one of the values is NULL then no row will match.
Null signifies and absence of data, that is it is unknown, not a data value of nothing. It's very easy for people from a programming background to confuse this because in C type languages when using pointers null is indeed nothing.
Hence in the first case 3 is indeed in the set of (1,2,3,null) so true is returned
In the second however you can reduce it to
select 'true' where 3 not in (null)
So nothing is returned because the parser knows nothing about the set to which you are comparing it - it's not an empty set but an unknown set. Using (1, 2, null) doesn't help because the (1,2) set is obviously false, but then you're and'ing that against unknown, which is unknown.
It may be concluded from answers here that NOT IN (subquery) doesn't handle nulls correctly and should be avoided in favour of NOT EXISTS. However, such a conclusion may be premature. In the following scenario, credited to Chris Date (Database Programming and Design, Vol 2 No 9, September 1989), it is NOT IN that handles nulls correctly and returns the correct result, rather than NOT EXISTS.
Consider a table sp to represent suppliers (sno) who are known to supply parts (pno) in quantity (qty). The table currently holds the following values:
VALUES ('S1', 'P1', NULL),
('S2', 'P1', 200),
('S3', 'P1', 1000)
Note that quantity is nullable i.e. to be able to record the fact a supplier is known to supply parts even if it is not known in what quantity.
The task is to find the suppliers who are known supply part number 'P1' but not in quantities of 1000.
The following uses NOT IN to correctly identify supplier 'S2' only:
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1', NULL ),
( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND 1000 NOT IN (
SELECT spy.qty
FROM sp spy
WHERE spy.sno = spx.sno
AND spy.pno = 'P1'
);
However, the below query uses the same general structure but with NOT EXISTS but incorrectly includes supplier 'S1' in the result (i.e. for which the quantity is null):
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1', NULL ),
( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND NOT EXISTS (
SELECT *
FROM sp spy
WHERE spy.sno = spx.sno
AND spy.pno = 'P1'
AND spy.qty = 1000
);
So NOT EXISTS is not the silver bullet it may have appeared!
Of course, source of the problem is the presence of nulls, therefore the 'real' solution is to eliminate those nulls.
This can be achieved (among other possible designs) using two tables:
sp suppliers known to supply parts
spq suppliers known to supply parts in known quantities
noting there should probably be a foreign key constraint where spq references sp.
The result can then be obtained using the 'minus' relational operator (being the EXCEPT keyword in Standard SQL) e.g.
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1' ),
( 'S2', 'P1' ),
( 'S3', 'P1' ) )
AS T ( sno, pno )
),
spq AS
( SELECT *
FROM ( VALUES ( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT sno
FROM spq
WHERE pno = 'P1'
EXCEPT
SELECT sno
FROM spq
WHERE pno = 'P1'
AND qty = 1000;
this is for Boy:
select party_code
from abc as a
where party_code not in (select party_code
from xyz
where party_code = a.party_code);
this works regardless of ansi settings
also this might be of use to know the logical difference between join, exists and in
http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx

Why is a value never NOT in collection with nulls? [duplicate]

This issue came up when I got different records counts for what I thought were identical queries one using a not in where constraint and the other a left join. The table in the not in constraint had one null value (bad data) which caused that query to return a count of 0 records. I sort of understand why but I could use some help fully grasping the concept.
To state it simply, why does query A return a result but B doesn't?
A: select 'true' where 3 in (1, 2, 3, null)
B: select 'true' where 3 not in (1, 2, null)
This was on SQL Server 2005. I also found that calling set ansi_nulls off causes B to return a result.
Query A is the same as:
select 'true' where 3 = 1 or 3 = 2 or 3 = 3 or 3 = null
Since 3 = 3 is true, you get a result.
Query B is the same as:
select 'true' where 3 <> 1 and 3 <> 2 and 3 <> null
When ansi_nulls is on, 3 <> null is UNKNOWN, so the predicate evaluates to UNKNOWN, and you don't get any rows.
When ansi_nulls is off, 3 <> null is true, so the predicate evaluates to true, and you get a row.
NOT IN returns 0 records when compared against an unknown value
Since NULL is an unknown, a NOT IN query containing a NULL or NULLs in the list of possible values will always return 0 records since there is no way to be sure that the NULL value is not the value being tested.
Whenever you use NULL you are really dealing with a Three-Valued logic.
Your first query returns results as the WHERE clause evaluates to:
3 = 1 or 3 = 2 or 3 = 3 or 3 = null
which is:
FALSE or FALSE or TRUE or UNKNOWN
which evaluates to
TRUE
The second one:
3 <> 1 and 3 <> 2 and 3 <> null
which evaluates to:
TRUE and TRUE and UNKNOWN
which evaluates to:
UNKNOWN
The UNKNOWN is not the same as FALSE
you can easily test it by calling:
select 'true' where 3 <> null
select 'true' where not (3 <> null)
Both queries will give you no results
If the UNKNOWN was the same as FALSE then assuming that the first query would give you FALSE the second would have to evaluate to TRUE as it would have been the same as NOT(FALSE).
That is not the case.
There is a very good article on this subject on SqlServerCentral.
The whole issue of NULLs and Three-Valued Logic can be a bit confusing at first but it is essential to understand in order to write correct queries in TSQL
Another article I would recommend is SQL Aggregate Functions and NULL.
Compare to null is undefined, unless you use IS NULL.
So, when comparing 3 to NULL (query A), it returns undefined.
I.e. SELECT 'true' where 3 in (1,2,null)
and
SELECT 'true' where 3 not in (1,2,null)
will produce the same result, as NOT (UNDEFINED) is still undefined, but not TRUE
IF you want to filter with NOT IN for a subquery containg NULLs justcheck for not null
SELECT blah FROM t WHERE blah NOT IN
(SELECT someotherBlah FROM t2 WHERE someotherBlah IS NOT NULL )
The title of this question at the time of writing is
SQL NOT IN constraint and NULL values
From the text of the question it appears that the problem was occurring in a SQL DML SELECT query, rather than a SQL DDL CONSTRAINT.
However, especially given the wording of the title, I want to point out that some statements made here are potentially misleading statements, those along the lines of (paraphrasing)
When the predicate evaluates to UNKNOWN you don't get any rows.
Although this is the case for SQL DML, when considering constraints the effect is different.
Consider this very simple table with two constraints taken directly from the predicates in the question (and addressed in an excellent answer by #Brannon):
DECLARE #T TABLE
(
true CHAR(4) DEFAULT 'true' NOT NULL,
CHECK ( 3 IN (1, 2, 3, NULL )),
CHECK ( 3 NOT IN (1, 2, NULL ))
);
INSERT INTO #T VALUES ('true');
SELECT COUNT(*) AS tally FROM #T;
As per #Brannon's answer, the first constraint (using IN) evaluates to TRUE and the second constraint (using NOT IN) evaluates to UNKNOWN. However, the insert succeeds! Therefore, in this case it is not strictly correct to say, "you don't get any rows" because we have indeed got a row inserted as a result.
The above effect is indeed the correct one as regards the SQL-92 Standard. Compare and contrast the following section from the SQL-92 spec
7.6 where clause
The result of the is a table of those rows of T for
which the result of the search condition is true.
4.10 Integrity constraints
A table check constraint is satisfied if and only if the specified
search condition is not false for any row of a table.
In other words:
In SQL DML, rows are removed from the result when the WHERE evaluates to UNKNOWN because it does not satisfy the condition "is true".
In SQL DDL (i.e. constraints), rows are not removed from the result when they evaluate to UNKNOWN because it does satisfy the condition "is not false".
Although the effects in SQL DML and SQL DDL respectively may seem contradictory, there is practical reason for giving UNKNOWN results the 'benefit of the doubt' by allowing them to satisfy a constraint (more correctly, allowing them to not fail to satisfy a constraint): without this behaviour, every constraints would have to explicitly handle nulls and that would be very unsatisfactory from a language design perspective (not to mention, a right pain for coders!)
p.s. if you are finding it as challenging to follow such logic as "unknown does not fail to satisfy a constraint" as I am to write it, then consider you can dispense with all this simply by avoiding nullable columns in SQL DDL and anything in SQL DML that produces nulls (e.g. outer joins)!
In A, 3 is tested for equality against each member of the set, yielding (FALSE, FALSE, TRUE, UNKNOWN). Since one of the elements is TRUE, the condition is TRUE. (It's also possible that some short-circuiting takes place here, so it actually stops as soon as it hits the first TRUE and never evaluates 3=NULL.)
In B, I think it is evaluating the condition as NOT (3 in (1,2,null)). Testing 3 for equality against the set yields (FALSE, FALSE, UNKNOWN), which is aggregated to UNKNOWN. NOT ( UNKNOWN ) yields UNKNOWN. So overall the truth of the condition is unknown, which at the end is essentially treated as FALSE.
SQL uses three-valued logic for truth values. The IN query produces the expected result:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE col IN (NULL, 1)
-- returns first row
But adding a NOT does not invert the results:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE NOT col IN (NULL, 1)
-- returns zero rows
This is because the above query is equivalent of the following:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE NOT (col = NULL OR col = 1)
Here is how the where clause is evaluated:
| col | col = NULL⁽¹⁾ | col = 1 | col = NULL OR col = 1 | NOT (col = NULL OR col = 1) |
|-----|----------------|---------|-----------------------|-----------------------------|
| 1 | UNKNOWN | TRUE | TRUE | FALSE |
| 2 | UNKNOWN | FALSE | UNKNOWN⁽²⁾ | UNKNOWN⁽³⁾ |
Notice that:
The comparison involving NULL yields UNKNOWN
The OR expression where none of the operands are TRUE and at least one operand is UNKNOWN yields UNKNOWN (ref)
The NOT of UNKNOWN yields UNKNOWN (ref)
You can extend the above example to more than two values (e.g. NULL, 1 and 2) but the result will be same: if one of the values is NULL then no row will match.
Null signifies and absence of data, that is it is unknown, not a data value of nothing. It's very easy for people from a programming background to confuse this because in C type languages when using pointers null is indeed nothing.
Hence in the first case 3 is indeed in the set of (1,2,3,null) so true is returned
In the second however you can reduce it to
select 'true' where 3 not in (null)
So nothing is returned because the parser knows nothing about the set to which you are comparing it - it's not an empty set but an unknown set. Using (1, 2, null) doesn't help because the (1,2) set is obviously false, but then you're and'ing that against unknown, which is unknown.
It may be concluded from answers here that NOT IN (subquery) doesn't handle nulls correctly and should be avoided in favour of NOT EXISTS. However, such a conclusion may be premature. In the following scenario, credited to Chris Date (Database Programming and Design, Vol 2 No 9, September 1989), it is NOT IN that handles nulls correctly and returns the correct result, rather than NOT EXISTS.
Consider a table sp to represent suppliers (sno) who are known to supply parts (pno) in quantity (qty). The table currently holds the following values:
VALUES ('S1', 'P1', NULL),
('S2', 'P1', 200),
('S3', 'P1', 1000)
Note that quantity is nullable i.e. to be able to record the fact a supplier is known to supply parts even if it is not known in what quantity.
The task is to find the suppliers who are known supply part number 'P1' but not in quantities of 1000.
The following uses NOT IN to correctly identify supplier 'S2' only:
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1', NULL ),
( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND 1000 NOT IN (
SELECT spy.qty
FROM sp spy
WHERE spy.sno = spx.sno
AND spy.pno = 'P1'
);
However, the below query uses the same general structure but with NOT EXISTS but incorrectly includes supplier 'S1' in the result (i.e. for which the quantity is null):
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1', NULL ),
( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND NOT EXISTS (
SELECT *
FROM sp spy
WHERE spy.sno = spx.sno
AND spy.pno = 'P1'
AND spy.qty = 1000
);
So NOT EXISTS is not the silver bullet it may have appeared!
Of course, source of the problem is the presence of nulls, therefore the 'real' solution is to eliminate those nulls.
This can be achieved (among other possible designs) using two tables:
sp suppliers known to supply parts
spq suppliers known to supply parts in known quantities
noting there should probably be a foreign key constraint where spq references sp.
The result can then be obtained using the 'minus' relational operator (being the EXCEPT keyword in Standard SQL) e.g.
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1' ),
( 'S2', 'P1' ),
( 'S3', 'P1' ) )
AS T ( sno, pno )
),
spq AS
( SELECT *
FROM ( VALUES ( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT sno
FROM spq
WHERE pno = 'P1'
EXCEPT
SELECT sno
FROM spq
WHERE pno = 'P1'
AND qty = 1000;
this is for Boy:
select party_code
from abc as a
where party_code not in (select party_code
from xyz
where party_code = a.party_code);
this works regardless of ansi settings
also this might be of use to know the logical difference between join, exists and in
http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx

SQL IN query produces strange result

Please see the table structure below:
CREATE TABLE Person (id int not null, PID INT NOT NULL, Name VARCHAR(50))
CREATE TABLE [Order] (OID INT NOT NULL, PID INT NOT NULL)
INSERT INTO Person VALUES (1,1,'Ian')
INSERT INTO Person VALUES (2,2,'Maria')
INSERT INTO [Order] values (1,1)
Why does the following query return two results:
select * from Person WHERE id IN (SELECT ID FROM [Order])
ID does not exist in Order. Why does the query above produce results? I would expect it to error because I'd does not exist in order.
This behavior, while unintuitive, is very well defined in Microsoft's Knowledge Base:
KB #298674 : PRB: Subquery Resolves Names of Column to Outer Tables
From that article:
To illustrate the behavior, use the following two table structures and query:
CREATE TABLE X1 (ColA INT, ColB INT)
CREATE TABLE X2 (ColC INT, ColD INT)
SELECT ColA FROM X1 WHERE ColA IN (Select ColB FROM X2)
The query returns a result where the column ColB is considered from table X1.
By qualifying the column name, the error message occurs as illustrated by the following query:
SELECT ColA FROM X1 WHERE ColA in (Select X2.ColB FROM X2)
Server: Msg 207, Level 16, State 3, Line 1
Invalid column name 'ColB'.
Folks have been complaining about this issue for years, but Microsoft isn't going to fix it. It is, after all, complying with the standard, which essentially states:
If you don't find column x in the current scope, traverse to the next outer scope, and so on, until you find a reference.
More information in the following Connect "bugs" along with multiple official confirmations that this behavior is by design and is not going to change (so you'll have to change yours - i.e. always use aliases):
Connect #338468 : CTE Column Name resolution in Sub Query is not validated
Connect #735178 : T-SQL subquery not working in some cases when IN operator used
Connect #302281 : Non-existent column causes subquery to be ignored
Connect #772612 : Alias error not being reported when within an IN operator
Connect #265772 : Bug using sub select
In your case, this "error" will probably be much less likely to occur if you use more meaningful names than ID, OID and PID. Does Order.PID point to Person.id or Person.PID? Design your tables so that people can figure out the relationships without having to ask you. A PersonID should always be a PersonID, no matter where in the schema it is; same with an OrderID. Saving a few characters of typing is not a good price to pay for a completely ambiguous schema.
You could write an EXISTS clause instead:
... FROM dbo.Person AS p WHERE EXISTS
(
SELECT 1 FROM dbo.[Order] AS o
WHERE o.PID = p.id -- or is it PID? See why it pays to be explicit?
);
The problem here is that you're not using Table.Column notation in your subquery, table Order doesn't have column ID and ID in subquery really means Person.ID, not [Order].ID. That's why I always insist on using aliases for tables in production code. Compare these two queries:
select * from Person WHERE id IN (SELECT ID FROM [Order]);
select * from Person as p WHERE p.id IN (SELECT o.ID FROM [Order] as o)
The first one will execute but will return incorrect results, and the second one will raise an error. It's because the outer query's columns may be referenced in a subquery, so in this case you can use Person columns inside the subquery.
Perhaps you wanted to use the query like this:
select * from Person WHERE pid IN (SELECT PID FROM [Order])
But you never know when the schema of the [Order] table changes, and if somebody drops the column PID from [Order] then your query will return all rows from the table Person. Therefore, use aliases:
select * from Person as P WHERE P.pid IN (SELECT O.PID FROM [Order] as O)
Just quick note - this is not SQL Server specific behaviour, it's standard SQL:
SQL Server demo
PostgreSQL demo
MySQL demo
Oracle demo
Order table doesnt have id column
Try these instead:
select * from Person WHERE id IN (SELECT OID FROM [Order])
OR
select * from Person WHERE pid IN (SELECT PID FROM [Order])

NULL values inside NOT IN clause

This issue came up when I got different records counts for what I thought were identical queries one using a not in where constraint and the other a left join. The table in the not in constraint had one null value (bad data) which caused that query to return a count of 0 records. I sort of understand why but I could use some help fully grasping the concept.
To state it simply, why does query A return a result but B doesn't?
A: select 'true' where 3 in (1, 2, 3, null)
B: select 'true' where 3 not in (1, 2, null)
This was on SQL Server 2005. I also found that calling set ansi_nulls off causes B to return a result.
Query A is the same as:
select 'true' where 3 = 1 or 3 = 2 or 3 = 3 or 3 = null
Since 3 = 3 is true, you get a result.
Query B is the same as:
select 'true' where 3 <> 1 and 3 <> 2 and 3 <> null
When ansi_nulls is on, 3 <> null is UNKNOWN, so the predicate evaluates to UNKNOWN, and you don't get any rows.
When ansi_nulls is off, 3 <> null is true, so the predicate evaluates to true, and you get a row.
NOT IN returns 0 records when compared against an unknown value
Since NULL is an unknown, a NOT IN query containing a NULL or NULLs in the list of possible values will always return 0 records since there is no way to be sure that the NULL value is not the value being tested.
Whenever you use NULL you are really dealing with a Three-Valued logic.
Your first query returns results as the WHERE clause evaluates to:
3 = 1 or 3 = 2 or 3 = 3 or 3 = null
which is:
FALSE or FALSE or TRUE or UNKNOWN
which evaluates to
TRUE
The second one:
3 <> 1 and 3 <> 2 and 3 <> null
which evaluates to:
TRUE and TRUE and UNKNOWN
which evaluates to:
UNKNOWN
The UNKNOWN is not the same as FALSE
you can easily test it by calling:
select 'true' where 3 <> null
select 'true' where not (3 <> null)
Both queries will give you no results
If the UNKNOWN was the same as FALSE then assuming that the first query would give you FALSE the second would have to evaluate to TRUE as it would have been the same as NOT(FALSE).
That is not the case.
There is a very good article on this subject on SqlServerCentral.
The whole issue of NULLs and Three-Valued Logic can be a bit confusing at first but it is essential to understand in order to write correct queries in TSQL
Another article I would recommend is SQL Aggregate Functions and NULL.
Compare to null is undefined, unless you use IS NULL.
So, when comparing 3 to NULL (query A), it returns undefined.
I.e. SELECT 'true' where 3 in (1,2,null)
and
SELECT 'true' where 3 not in (1,2,null)
will produce the same result, as NOT (UNDEFINED) is still undefined, but not TRUE
IF you want to filter with NOT IN for a subquery containg NULLs justcheck for not null
SELECT blah FROM t WHERE blah NOT IN
(SELECT someotherBlah FROM t2 WHERE someotherBlah IS NOT NULL )
The title of this question at the time of writing is
SQL NOT IN constraint and NULL values
From the text of the question it appears that the problem was occurring in a SQL DML SELECT query, rather than a SQL DDL CONSTRAINT.
However, especially given the wording of the title, I want to point out that some statements made here are potentially misleading statements, those along the lines of (paraphrasing)
When the predicate evaluates to UNKNOWN you don't get any rows.
Although this is the case for SQL DML, when considering constraints the effect is different.
Consider this very simple table with two constraints taken directly from the predicates in the question (and addressed in an excellent answer by #Brannon):
DECLARE #T TABLE
(
true CHAR(4) DEFAULT 'true' NOT NULL,
CHECK ( 3 IN (1, 2, 3, NULL )),
CHECK ( 3 NOT IN (1, 2, NULL ))
);
INSERT INTO #T VALUES ('true');
SELECT COUNT(*) AS tally FROM #T;
As per #Brannon's answer, the first constraint (using IN) evaluates to TRUE and the second constraint (using NOT IN) evaluates to UNKNOWN. However, the insert succeeds! Therefore, in this case it is not strictly correct to say, "you don't get any rows" because we have indeed got a row inserted as a result.
The above effect is indeed the correct one as regards the SQL-92 Standard. Compare and contrast the following section from the SQL-92 spec
7.6 where clause
The result of the is a table of those rows of T for
which the result of the search condition is true.
4.10 Integrity constraints
A table check constraint is satisfied if and only if the specified
search condition is not false for any row of a table.
In other words:
In SQL DML, rows are removed from the result when the WHERE evaluates to UNKNOWN because it does not satisfy the condition "is true".
In SQL DDL (i.e. constraints), rows are not removed from the result when they evaluate to UNKNOWN because it does satisfy the condition "is not false".
Although the effects in SQL DML and SQL DDL respectively may seem contradictory, there is practical reason for giving UNKNOWN results the 'benefit of the doubt' by allowing them to satisfy a constraint (more correctly, allowing them to not fail to satisfy a constraint): without this behaviour, every constraints would have to explicitly handle nulls and that would be very unsatisfactory from a language design perspective (not to mention, a right pain for coders!)
p.s. if you are finding it as challenging to follow such logic as "unknown does not fail to satisfy a constraint" as I am to write it, then consider you can dispense with all this simply by avoiding nullable columns in SQL DDL and anything in SQL DML that produces nulls (e.g. outer joins)!
In A, 3 is tested for equality against each member of the set, yielding (FALSE, FALSE, TRUE, UNKNOWN). Since one of the elements is TRUE, the condition is TRUE. (It's also possible that some short-circuiting takes place here, so it actually stops as soon as it hits the first TRUE and never evaluates 3=NULL.)
In B, I think it is evaluating the condition as NOT (3 in (1,2,null)). Testing 3 for equality against the set yields (FALSE, FALSE, UNKNOWN), which is aggregated to UNKNOWN. NOT ( UNKNOWN ) yields UNKNOWN. So overall the truth of the condition is unknown, which at the end is essentially treated as FALSE.
SQL uses three-valued logic for truth values. The IN query produces the expected result:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE col IN (NULL, 1)
-- returns first row
But adding a NOT does not invert the results:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE NOT col IN (NULL, 1)
-- returns zero rows
This is because the above query is equivalent of the following:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE NOT (col = NULL OR col = 1)
Here is how the where clause is evaluated:
| col | col = NULL⁽¹⁾ | col = 1 | col = NULL OR col = 1 | NOT (col = NULL OR col = 1) |
|-----|----------------|---------|-----------------------|-----------------------------|
| 1 | UNKNOWN | TRUE | TRUE | FALSE |
| 2 | UNKNOWN | FALSE | UNKNOWN⁽²⁾ | UNKNOWN⁽³⁾ |
Notice that:
The comparison involving NULL yields UNKNOWN
The OR expression where none of the operands are TRUE and at least one operand is UNKNOWN yields UNKNOWN (ref)
The NOT of UNKNOWN yields UNKNOWN (ref)
You can extend the above example to more than two values (e.g. NULL, 1 and 2) but the result will be same: if one of the values is NULL then no row will match.
Null signifies and absence of data, that is it is unknown, not a data value of nothing. It's very easy for people from a programming background to confuse this because in C type languages when using pointers null is indeed nothing.
Hence in the first case 3 is indeed in the set of (1,2,3,null) so true is returned
In the second however you can reduce it to
select 'true' where 3 not in (null)
So nothing is returned because the parser knows nothing about the set to which you are comparing it - it's not an empty set but an unknown set. Using (1, 2, null) doesn't help because the (1,2) set is obviously false, but then you're and'ing that against unknown, which is unknown.
It may be concluded from answers here that NOT IN (subquery) doesn't handle nulls correctly and should be avoided in favour of NOT EXISTS. However, such a conclusion may be premature. In the following scenario, credited to Chris Date (Database Programming and Design, Vol 2 No 9, September 1989), it is NOT IN that handles nulls correctly and returns the correct result, rather than NOT EXISTS.
Consider a table sp to represent suppliers (sno) who are known to supply parts (pno) in quantity (qty). The table currently holds the following values:
VALUES ('S1', 'P1', NULL),
('S2', 'P1', 200),
('S3', 'P1', 1000)
Note that quantity is nullable i.e. to be able to record the fact a supplier is known to supply parts even if it is not known in what quantity.
The task is to find the suppliers who are known supply part number 'P1' but not in quantities of 1000.
The following uses NOT IN to correctly identify supplier 'S2' only:
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1', NULL ),
( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND 1000 NOT IN (
SELECT spy.qty
FROM sp spy
WHERE spy.sno = spx.sno
AND spy.pno = 'P1'
);
However, the below query uses the same general structure but with NOT EXISTS but incorrectly includes supplier 'S1' in the result (i.e. for which the quantity is null):
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1', NULL ),
( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND NOT EXISTS (
SELECT *
FROM sp spy
WHERE spy.sno = spx.sno
AND spy.pno = 'P1'
AND spy.qty = 1000
);
So NOT EXISTS is not the silver bullet it may have appeared!
Of course, source of the problem is the presence of nulls, therefore the 'real' solution is to eliminate those nulls.
This can be achieved (among other possible designs) using two tables:
sp suppliers known to supply parts
spq suppliers known to supply parts in known quantities
noting there should probably be a foreign key constraint where spq references sp.
The result can then be obtained using the 'minus' relational operator (being the EXCEPT keyword in Standard SQL) e.g.
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1' ),
( 'S2', 'P1' ),
( 'S3', 'P1' ) )
AS T ( sno, pno )
),
spq AS
( SELECT *
FROM ( VALUES ( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT sno
FROM spq
WHERE pno = 'P1'
EXCEPT
SELECT sno
FROM spq
WHERE pno = 'P1'
AND qty = 1000;
this is for Boy:
select party_code
from abc as a
where party_code not in (select party_code
from xyz
where party_code = a.party_code);
this works regardless of ansi settings
also this might be of use to know the logical difference between join, exists and in
http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx