SQL Not In vs Left Join - sql

I ran into a problem today that I couldn't quite understand, so I was hoping for some outside knowledge. I was trying to find the number of items in a table where their id isn't referenced in another. I ran two different queries and seem to have conflicting results.
select count(*)
from TableA
where ID not in (select aID from TableB)
returns 0
select count(*)
from TableA a
left join TableB b on b.aID = a.ID
where b.aID is null
returns a few thousand.
All IDs in both TableA and TableB are unique. An ID from TableA never shows up in the aID column from TableB more than once. To me, it seems like I am querying the same thing but receiving different results. Where am I going wrong?

Do not use not in with a subquery. If any value in the subquery is NULL, then all rows are filtered out. These are the rules of how NULL is defined in SQL. The LEFT JOIN is correct.
The reason is that NULL means an unknown value. Almost any comparison with NULL returns NULL, which is treated as false. So, the only possibilities with NOT IN with NULL are that an element matches what you are looking for -- and the expression returns false -- or an element is NULL -- and the expression returns NULL which is treated as false.
I usually advise replacing the NOT IN with NOT EXISTS:
select count(*)
from TableA a
where not exists (select 1 from TableB b where b.aID = a.ID);
The LEFT JOIN performs correctly and usually has good performance.

We should always use the EXISTS operator if the columns involved are nullables. Also,Exist is faster than In clause.
Using IN/Not IN operator might produce an inferior plan and also can lead to misleading results if a null value is inserted in the table just like in you case.

Related

how the sub select query in the case clause get the parameter value of the main query

I have two tables both has id columns, but TableA.id is char, and TableB.id is int. now I want to join two tables, but the problem is there are some string in A.id can't be converted to int. Here is the query I wrote
SELECT
case
when Column1 is null
then (select Surname from TableB
where TableA.id = TableB.id
)
else Column1
end
FROM TableA
GO
the sub select query returns a bunch of records, so my question is that is it possible to run that subquery with the current TableB.id? I am not sure if i explained this clearly, how the subquery get the TableB.id's value of the main query. Thanks
I'm not sure I'm following you, but it sounds like the crux of the problem is that you are trying to join on ID, but they are different field types. Perhaps something like:
SELECT
COALESCE(TableA.Column1, TableB.Surname)
FROM
TableA
LEFT OUTER JOIN TableB On
TableA.ID = Cast(TableB.ID AS Char(64))
I was just taking a guess at the CHAR size, but I assume that's ample. Also I'm not sure what DB you are working on so the syntax may need a bit tweaked.
There is a feature that can do that. It's called a Correlated Subquery, although I'm not sure they work inside case statements.

SQL - Running Select Query with Having Clause

here is the query I want to run.
SELECT COUNT(tableA.ID)
FROM tableA
NATURAL JOIN tableB
NATURAL JOIN tableC
WHERE tableB.Time IS NULL
GROUP BY tableA.ID
HAVING COUNT(tableA.ID) < tableC.Quantity
This query will run perfectly fine without the HAVING clause, however the HAVING clause has an error which I can't pick out.
The purpose of the HAVING clause is that I want to return ID's that have less than the Quantity threshold (which is defined as tableC.Quantity).
How can I fix my current HAVING clause to incorporate that the query only returns ID's that are less than the tableC.Quantity.
Note: if you need more clarification, I can provide more.
I am going to assume that the error is something to the effect that tableC.quantity is not in the group by clause (and that you are not using MySQL). If so, you can fix this by using an aggregation function:
SELECT COUNT(tableA.ID)
FROM tableA NATURAL JOIN
tableB NATURAL JOIN
tableC
WHERE tableB.Time IS NULL
GROUP BY tableA.ID
HAVING COUNT(tableA.ID) < max(tableC.Quantity);
By the way, I think natural join is a dangerous operation. You could add a new column to a table and invalidate all your queries, with no error message to tell you what is going wrong.

What is causing these seemingly inconsistent query results?

I have a problem that's really confounding me. I half-expect one of you to point out some really dumb mistake that I'm overlooking but I'm really just not seeing it.
I have a table that our production processes have been feeding for something like a year and we just got some crazy tables from our client against which we are trying to match data. In the following queries, tableA is my table and tableB is the table we've just imported.
The basic problem is that
select *
from tableA
where convert(nvarchar(30),accountNum) not in (
select CisAC
from tableB
)
isn't returning any records when I believe it should be. I think that it should find any records in tableA where the accountNum matches the CisAC field in tableB. Right? CisAC is an nvarchar(30) and our accountNum field is a bigint.
To point out why I think an empty return set is wrong:
select * from tableA where convert(nvarchar(30),accountNum) = '336906210032'
returns one record but
select * from tableB where CisAC = '336906210032'
does not.
So, what gives? (And thanks for your time!)
My suspicion would be null values in tableB causing the IN to fail
I would try
select *
from tableA
left join tableB
on convert(nvarchar(30),tableA.accountNum) = tableB.CisAC
where tableB.CisAc is null
You query is correct. It's returning the expected results.
See here for the SQL Fiddle: http://sqlfiddle.com/#!6/dfb5d/1
What is probably happening is that the data you have in tableB is not matching the data in tableA.
Edit:
As #Andomar answered, if tableB has a null value, the query will fail. See here:
http://sqlfiddle.com/#!6/05bb1/1
This is probably the classic not in mistake. If table B contains any null value,
where convert(nvarchar(30),accountNum) not in (
select CisAC
from tableB
)
will never succeed. You can write it out like:
where convert(nvarchar(30),accountNum) <> null and convert(nvarchar(30),accountNum) <> ...
Since any comparison to null evaluates to unknown, this condition is never true.
Replacing the query with a join like the podiluska's answer suggests should do the trick.

SQL Method of checking that INNER / LEFT join doesn't duplicate rows

Is there a good or standard SQL method of asserting that a join does not duplicate any rows (produces 0 or 1 copies of the source table row)? Assert as in causes the query to fail or otherwise indicate that there are duplicate rows.
A common problem in a lot of queries is when a table is expected to be 1:1 with another table, but there might exist 2 rows that match the join criteria. This can cause errors that are hard to track down, especially for people not necessarily entirely familiar with the tables.
It seems like there should be something simple and elegant - this would be very easy for the SQL engine to detect (have I already joined this source row to a row in the other table? ok, error out) but I can't seem to find anything on this. I'm aware that there are long / intrusive solutions to this problem, but for many ad hoc queries those just aren't very fun to work out.
EDIT / CLARIFICATION: I'm looking for a one-step query-level fix. Not a verification step on the results of that query.
If you are only testing for linked rows rather than requiring output, then you'd use EXISTS.
More correctly, you need a "semi-join" but this isn't supported by most RDBMS unless as EXISTS
SELECT a.*
FROM TableA a
WHERE EXISTS (SELECT * FROM TableB b WHERE a.id = b.id)
Also see:
Using 'IN' with a sub-query in SQL Statements
EXISTS vs JOIN and use of EXISTS clause
SELECT JoinField
FROM MyJoinTable
GROUP BY JoinField
HAVING COUNT(*) > 1
LIMIT 1
Is that simple enough? Don't have Postgres but I think it's valid syntax.
Something along the lines of
SELECT a.id, COUNT(b.id)
FROM TableA a
JOIN TableB b ON a.id = b.id
GROUP BY a.id
HAVING COUNT(b.id) > 1
Should return rows in TableA that have more than one associated row in TableB.

SQL (any) Request for insight on a query optimization

I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected
Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.
Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.
YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.
I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.
Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?
Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).
your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help