Alternative to NOT IN() - sql

I have a table with 14,028 rows from November 2012. I also have a table with 13,959 rows from March 2013. I am using a simple NOT IN() clause to see who has left:
select * from nov_2012 where id not in(select id from mar_2013)
This returned 396 rows and I never thought anything of it, until I went to analyze who left. When I pulled all the ids for the lost members and put them in a temp table (##lost), 32 of them were actually still in the mar_2013 table. I can pull them up when I search for their ids using the following:
select * from mar_2013 where id in(select id from ##lost)
I can't figure out what is going on. I will mention that the id field I created is an IDENTITY column. Could that have any effect on the matching using NOT IN? Is there a better way to check for missing rows between tables? I have also tried:
select a.* from nov_2012 a left join mar_2013 b on b.id = a.id where b.id is NULL
And received the same results.
This is how I created the identity field;
create table id_lookup( dateofcusttable date ,sin int ,sex varchar(12) ,scid int identity(777000,1))
insert into id_lookup (sin, sex) select distinct sin, sex from [Client Raw].dbo.cust20130331 where sin <> 0 order by sin, sex
This is how I added the scid into the march table:
select scid, rowno as custrowno
into scid_20130331
from [Client Raw].dbo.cust20130331 cust
left join id_lookup scid
on scid.sin = cust.sin
and scid.sex = cust.sex
update scid_20130331
set scid = custrowno where scid is NULL --for members who don't have more than one id or sin information is not available
drop table Account_Part2_Current
select a.*, scid
into Account_Part2_Current
from Account_Part1_Current a
left join scid_20130331 b
on b.custrowno = a.rowno_custdmd_cust
I then group all the information by the scid

I would prefer this form (and here's why):
SELECT a.id --, other columns
FROM dbo.nov_2012 AS a
WHERE NOT EXISTS (SELECT 1 FROM dbo.mar_2013 WHERE id = a.id);
However this should still give the same results as what you've tried, so I suspect there is something about the data model that you're not telling us - for example, is mar_2013.id nullable?

this is logically equivalent to not in and is faster than not in.
where yourfield in
(select afield
from somewhere
minus
select
thesamefield
where you want to exclude the record
)
It probably isn't as fast as using where not exists, as per Aaron's answer so you should only use it if not exists does not provide the results you want.

Related

SQL JOIN and COUNT COMMANDS

I have two different tables, we will call table 1 and table 2. Within table 1, I have a column known as featureid which is a numerical value that corresponds to a numerical value in table 2 known as the termid. Also, within this table 2, each of the different termid corresponds to a plain text description of that termid.
What I am attempting to do is join the featureid in table 1 to the termid in table 2, but have the output be a two column display of the plain text description and occurrence of each within table 1.
I know I need to use the JOIN and COUNT syntax within SQL, but not sure how to correctly write the command.
Thanks!
You actually only need one column in the GROUP BY. It's also good practice to specify which values are coming from which table, like so:
SELECT
table2.textDesc,
COUNT(table1.featureid) AS OccurrenceCount
FROM
table1
INNER JOIN
table2 ON
table1.featureid = table2.termid
GROUP BY table2.textDesc
I think this is what you are looking for:
SELECT textDesc, COUNT(featureid)
FROM table1, table2
WHERE featureid=termid
GROUP BY featureid, textDesc
Alternatively, you can use a different syntax (with the same end result) like so:
SELECT textDesc, COUNT(featureid)
FROM table1 INNER JOIN table2
ON featureid=termid
GROUP BY featureid, textDesc

SELECT query to return a row from a table with all values set to Null

I need to make a query but get the value in every field empty. Gordon Linoff give me the clue to this need here:
SQL Empty query results
which is:
select t.*
from (select 1 as val
) v left outer join
table t
on 1 = 0;
This query wors perfectly on PostgreSQL but gets an error when trying to execute it in Microsoft Access, it says that 1 = 0 expression is not admitted. How could it be fixed to work on microsoft access?
Regards,
If the table has a numeric primary key column whose values are non-negative then the following query will work in Access. The primary key field is [ID].
SELECT t2.*
FROM
myTable AS t2
RIGHT JOIN
(
SELECT TOP 1 (ID * -1) AS badID
FROM myTable AS t1
) AS rowStubs
ON t2.ID = rowStubs.badID
This was tested with Access 2010.
I am offering this answer here, even though you didn't think it worked in my edit to your original question. What is the problem?
select t.*
from (select max(col) as maxval from table as t
) as v left join
table as t
on v.val < t.col;
You can use the following query, but it would still need a little "manual coding".
EDITS:
Actually, you do not need the SWITCH function. Modified query below.
Removed the reference to Description column from one line. Still, you would need to use a Text column name (such as Description) in the last line of the query.
For example, the following query would work for the Months table:
select Months.*
from Months
RIGHT OUTER JOIN
(select "" as DummyColumn from Months) Blank_Data
ON Months.Description = Blank_Data.DummyColumn; --hardcoded Description column

SQL IN query produces strange result

Please see the table structure below:
CREATE TABLE Person (id int not null, PID INT NOT NULL, Name VARCHAR(50))
CREATE TABLE [Order] (OID INT NOT NULL, PID INT NOT NULL)
INSERT INTO Person VALUES (1,1,'Ian')
INSERT INTO Person VALUES (2,2,'Maria')
INSERT INTO [Order] values (1,1)
Why does the following query return two results:
select * from Person WHERE id IN (SELECT ID FROM [Order])
ID does not exist in Order. Why does the query above produce results? I would expect it to error because I'd does not exist in order.
This behavior, while unintuitive, is very well defined in Microsoft's Knowledge Base:
KB #298674 : PRB: Subquery Resolves Names of Column to Outer Tables
From that article:
To illustrate the behavior, use the following two table structures and query:
CREATE TABLE X1 (ColA INT, ColB INT)
CREATE TABLE X2 (ColC INT, ColD INT)
SELECT ColA FROM X1 WHERE ColA IN (Select ColB FROM X2)
The query returns a result where the column ColB is considered from table X1.
By qualifying the column name, the error message occurs as illustrated by the following query:
SELECT ColA FROM X1 WHERE ColA in (Select X2.ColB FROM X2)
Server: Msg 207, Level 16, State 3, Line 1
Invalid column name 'ColB'.
Folks have been complaining about this issue for years, but Microsoft isn't going to fix it. It is, after all, complying with the standard, which essentially states:
If you don't find column x in the current scope, traverse to the next outer scope, and so on, until you find a reference.
More information in the following Connect "bugs" along with multiple official confirmations that this behavior is by design and is not going to change (so you'll have to change yours - i.e. always use aliases):
Connect #338468 : CTE Column Name resolution in Sub Query is not validated
Connect #735178 : T-SQL subquery not working in some cases when IN operator used
Connect #302281 : Non-existent column causes subquery to be ignored
Connect #772612 : Alias error not being reported when within an IN operator
Connect #265772 : Bug using sub select
In your case, this "error" will probably be much less likely to occur if you use more meaningful names than ID, OID and PID. Does Order.PID point to Person.id or Person.PID? Design your tables so that people can figure out the relationships without having to ask you. A PersonID should always be a PersonID, no matter where in the schema it is; same with an OrderID. Saving a few characters of typing is not a good price to pay for a completely ambiguous schema.
You could write an EXISTS clause instead:
... FROM dbo.Person AS p WHERE EXISTS
(
SELECT 1 FROM dbo.[Order] AS o
WHERE o.PID = p.id -- or is it PID? See why it pays to be explicit?
);
The problem here is that you're not using Table.Column notation in your subquery, table Order doesn't have column ID and ID in subquery really means Person.ID, not [Order].ID. That's why I always insist on using aliases for tables in production code. Compare these two queries:
select * from Person WHERE id IN (SELECT ID FROM [Order]);
select * from Person as p WHERE p.id IN (SELECT o.ID FROM [Order] as o)
The first one will execute but will return incorrect results, and the second one will raise an error. It's because the outer query's columns may be referenced in a subquery, so in this case you can use Person columns inside the subquery.
Perhaps you wanted to use the query like this:
select * from Person WHERE pid IN (SELECT PID FROM [Order])
But you never know when the schema of the [Order] table changes, and if somebody drops the column PID from [Order] then your query will return all rows from the table Person. Therefore, use aliases:
select * from Person as P WHERE P.pid IN (SELECT O.PID FROM [Order] as O)
Just quick note - this is not SQL Server specific behaviour, it's standard SQL:
SQL Server demo
PostgreSQL demo
MySQL demo
Oracle demo
Order table doesnt have id column
Try these instead:
select * from Person WHERE id IN (SELECT OID FROM [Order])
OR
select * from Person WHERE pid IN (SELECT PID FROM [Order])

How can I stop joins from adding rows in my match query?

I'm having difficulty translating what I want into functional programming, since I think imperatively. Basically, I have a table of forms, and a table of expectations. In the Expectation view, I want it to look through the forms table and tell me if each one found a match. However, when I try to use joins to accomplish this, the joins are adding rows to the Expectation table when two or more forms match. I do not want this.
In an imperative fashion, I want the equivalent of this:
ForEach (row in Expectation table)
{
if (any form in the Form table matches the criteria)
{
MatchID = form.ID;
SignDate = form.SignDate;
...
}
}
What I have in SQL is this:
SELECT
e.*, match.ID, match.SignDate, ...
FROM
POFDExpectation e LEFT OUTER JOIN
(SELECT MIN(ID) as MatchID, MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount, ...
FROM Form f
GROUP BY (matching criteria columns)
) match
ON (form.[match criteria] = expectation.[match criteria])
Which works okay, but very slowly, and every time there are TWO matches, a row is added to the Expectation results. Mathematically I understand that a join is a cross multiply and this is expected, but I'm unsure how to do this without them. Subquery perhaps?
I'm not able to give too many further details about the implementation, but I'll be happy to try any suggestion and respond with the results. I have 880 Expectation rows, and 942 results being returned. If I only allow results that match one form, I get 831 results. Neither are desirable, so if yours gets me to exactly 880, yours is the accepted answer.
Edit: I am using SQL Server 2008 R2, though a generic solution would be best.
Sample code:
--DROP VIEW ExpectationView; DROP TABLE Forms; DROP TABLE Expectations;
--Create Tables and View
CREATE TABLE Forms (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100), Complete bit, SignDate datetime)
GO
CREATE TABLE Expectations (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100))
GO
CREATE VIEW ExpectationView AS select e.*, filed.MatchID, filed.SignDate, ISNULL(filed.FiledCount, 0) as FiledCount, ISNULL(name.NameCount, 0) as NameCount from Expectations e LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, Complete, Min(SignDate) as SignDate, COUNT(*) as FiledCount from Forms f GROUP BY ReportYear, Name, Complete) filed
on filed.ReportYear = e.ReportYear AND filed.Name like '%'+e.Name+'%' AND filed.Complete = 1 LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, COUNT(*) as NameCount from Forms f GROUP BY ReportYear, Name) name
on name.ReportYear = e.ReportYear AND name.Name like '%'+e.Name+'%'
GO
--Insert Text Data
INSERT INTO Forms (ReportYear, Name, Complete, SignDate)
SELECT 2011, 'Bob Smith', 1, '2012-03-01' UNION ALL
SELECT 2011, 'Bob Jones', 1, '2012-10-04' UNION ALL
SELECT 2011, 'Bob', 1, '2012-07-20'
GO
INSERT INTO Expectations (ReportYear, Name)
SELECT 2011, 'Bob'
GO
SELECT * FROM ExpectationView --Should only return 1 result, returns 9
The 'filed' shows that they have completed a form, 'name' shows that they may have started one but not finished it. My view has four different 'match criteria' - each a little more strict, and counts each. 'Name Only Matches', 'Loose Matches', 'Matches' (default), 'Tight Matches' (used if there are more than one default match.
This is how I do it when I want to keep to a JOIN-type query format:
SELECT
e.*,
match.ID,
match.SignDate,
...
FROM POFDExpectation e
OUTER APPLY (
SELECT TOP 1
MIN(ID) as MatchID,
MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount,
...
FROM Form f
WHERE form.[match criteria] = expectation.[match criteria]
GROUP BY ID (matching criteria columns)
-- Add ORDER BY here to control which row is TOP 1
) match
It usually performs better as well.
Semantically, {CROSS|OUTER} APPLY (table-expression) specifies a table-expression that is called once for each row in the preceding table expressions of the FROM clause and then joined to them. Pragmatically, however, the compiler treats it almost identically to a JOIN.
The practical difference is that unlike a JOIN table-expression, the APPLY table-expression is dynamically re-evaluated for each row. So instead of an ON clause, it relies on its own logic and WHERE clauses to limit/match its rows to the preceding table-expressions. This also allows it to make reference to the column-values of the preceding table-expressions, inside its own internal subquery expression. (This is not possible in a JOIN)
The reason that we want this here, instead of a JOIN, is that we need a TOP 1 in the sub-query to limit its returned rows, however, that means that we need to move the ON clause conditions to the internal WHERE clause so that it will get applied before the TOP 1 is evaluated. And that means that we need an APPLY here, instead of the more usual JOIN.
#RBarryYoung answered the question as I asked it, but there was a second question that I didn't make very clear. What I really wanted was a combination of his answer and this question, so for the record here's what I used:
SELECT
e.*,
...
match.ID,
match.SignDate,
match.MatchCount
FROM
POFDExpectation e
OUTER APPLY (
SELECT TOP 1
ID as MatchID,
ReportYear,
...
SignDate as MatchSignDate,
COUNT(*) as MatchCount OVER ()
FROM
Form f
WHERE
form.[match criteria] = expectation.[match criteria]
-- Add ORDER BY here to control which row is TOP 1
) match

How do I link two values in a nested SQL SELECT query?

I have an SQL Server table JobFiles with columns JobId and FileId. This table maps which file belongs to which job. Each job can "contain" one or more files and each file can be "contained" in one or more job. For every pair such that job M contains file N there's a row (M,N) in the table.
I start with a job id and I need to get a list of all files such that they belong to that job only. I have hard time writing a request for that. So far I've crafted the following (pseudocode):
SELECT FileId FROM JobFiles WHERE JobId=JobIdICareAbout AND NOT EXISTS
(SELECT * FROM JobFiles WHERE FileId=ThatSameFileId AND JobId<>JobIdICareAbout);
The above I believe would work but I have a problem of how to map ThatSameFileId onto the FileId returned from the outer SELECT so that the database knows that those are the same.
How do I do that? How do I tell the database that FileId in the outer SELECT must be equal to the FileId in the inner SELECT?
How about using NOT IN:
SELECT FileId FROM JobFiles WHERE JobId=JobIdICareAbout AND FileID NOT IN
(SELECT FileId FROM JobFiles WHERE JobId<>JobIdICareAbout);
And a slight variation on that:
SELECT FileId FROM JobFiles WHERE JobId=JobIdICareAbout
EXCEPT
SELECT FileId FROM JobFiles WHERE JobId<>JobIdICareAbout
I'm taking another approach here but if I understood your problem correctly, it would generate the result you need.
Gist of it goes like this
Get all FileId's from the JobId you care about
JOIN back with the JobFiles tabel
Retain only those FileId's that have no other JobId using a HAVING clause.
SQL Statement
SELECT jf1.FileId
FROM JobFiles jf1
INNER JOIN (
SELECT FileId
FROM JobFiles
WHERE JobID = JobIdICareAbout
) jf2 ON jf2.FileID = jf1.FileID
GROUP BY
jf1.FileId
HAVING COUNT(*) = 1
Another approach:
LEFT JOIN will only find rows if this file is linked to another job, checking for IS NULL removes those files:
SELECT JobFiles.FileId
FROM JobFiles
LEFT JOIN JobFiles OtherJobFiles ON ( OtherJobFiles.FileId = JobFiles.FileId
AND OtherJobFiles.JobId <> JobIdICareAbout )
WHERE JobFiles.JobId=JobIdICareAbout
AND OtherJobFiles.FileId IS NULL
It seems you have a lot of (different but) working answers/queries. Here's one more using NOT EXISTS. It's only a correction of what you tried:
SELECT jf.FileId
FROM JobFiles jf
WHERE jf.JobId = JobIdICareAbout
AND NOT EXISTS
( SELECT 1
FROM JobFiles jf2
WHERE jf2.FileId = jf.FileId
AND jf2.JobId <> JobIdICareAbout
)