SQL Left Join first match only - sql

I have a query against a large number of big tables (rows and columns) with a number of joins, however one of tables has some duplicate rows of data causing issues for my query. Since this is a read only realtime feed from another department I can't fix that data, however I am trying to prevent issues in my query from it.
Given that, I need to add this crap data as a left join to my good query. The data set looks like:
IDNo FirstName LastName ...
-------------------------------------------
uqx bob smith
abc john willis
ABC john willis
aBc john willis
WTF jeff bridges
sss bill doe
ere sally abby
wtf jeff bridges
...
(about 2 dozen columns, and 100K rows)
My first instinct was to perform a distinct gave me about 80K rows:
SELECT DISTINCT P.IDNo
FROM people P
But when I try the following, I get all the rows back:
SELECT DISTINCT P.*
FROM people P
OR
SELECT
DISTINCT(P.IDNo) AS IDNoUnq
,P.FirstName
,P.LastName
...etc.
FROM people P
I then thought I would do a FIRST() aggregate function on all the columns, however that feels wrong too. Syntactically am I doing something wrong here?
Update:
Just wanted to note: These records are duplicates based on a non-key / non-indexed field of ID listed above. The ID is a text field which although has the same value, it is a different case than the other data causing the issue.

distinct is not a function. It always operates on all columns of the select list.
Your problem is a typical "greatest N per group" problem which can easily be solved using a window function:
select ...
from (
select IDNo,
FirstName,
LastName,
....,
row_number() over (partition by lower(idno) order by firstname) as rn
from people
) t
where rn = 1;
Using the order by clause you can select which of the duplicates you want to pick.
The above can be used in a left join, see below:
select ...
from x
left join (
select IDNo,
FirstName,
LastName,
....,
row_number() over (partition by lower(idno) order by firstname) as rn
from people
) p on p.idno = x.idno and p.rn = 1
where ...

Add an identity column (PeopleID) and then use a correlated subquery to return the first value for each value.
SELECT *
FROM People p
WHERE PeopleID = (
SELECT MIN(PeopleID)
FROM People
WHERE IDNo = p.IDNo
)

After careful consideration this dillema has a few different solutions:
Aggregate Everything
Use an aggregate on each column to get the biggest or smallest field value. This is what I am doing since it takes 2 partially filled out records and "merges" the data.
http://sqlfiddle.com/#!3/59cde/1
SELECT
UPPER(IDNo) AS user_id
, MAX(FirstName) AS name_first
, MAX(LastName) AS name_last
, MAX(entry) AS row_num
FROM people P
GROUP BY
IDNo
Get First (or Last record)
http://sqlfiddle.com/#!3/59cde/23
-- ------------------------------------------------------
-- Notes
-- entry: Auto-Number primary key some sort of unique PK is required for this method
-- IDNo: Should be primary key in feed, but is not, we are making an upper case version
-- This gets the first entry to get last entry, change MIN() to MAX()
-- ------------------------------------------------------
SELECT
PC.user_id
,PData.FirstName
,PData.LastName
,PData.entry
FROM (
SELECT
P2.user_id
,MIN(P2.entry) AS rownum
FROM (
SELECT
UPPER(P.IDNo) AS user_id
, P.entry
FROM people P
) AS P2
GROUP BY
P2.user_id
) AS PC
LEFT JOIN people PData
ON PData.entry = PC.rownum
ORDER BY
PData.entry

Use Cross Apply or Outer Apply, this way you can limit the amount of data to be joined from the table with the duplicates to the first hit.
Select
x.*,
c.*
from
x
Cross Apply
(
Select
Top (1)
IDNo,
FirstName,
LastName,
....,
from
people As p
where
p.idno = x.idno
Order By
p.idno //unnecessary if you don't need a specific match based on order
) As c
Cross Apply behaves like an inner join, Outer Apply like a left join
SQL Server CROSS APPLY and OUTER APPLY

Turns out I was doing it wrong, I needed to perform a nested select first of just the important columns, and do a distinct select off that to prevent trash columns of 'unique' data from corrupting my good data. The following appears to have resolved the issue... but I will try on the full dataset later.
SELECT DISTINCT P2.*
FROM (
SELECT
IDNo
, FirstName
, LastName
FROM people P
) P2
Here is some play data as requested: http://sqlfiddle.com/#!3/050e0d/3
CREATE TABLE people
(
[entry] int
, [IDNo] varchar(3)
, [FirstName] varchar(5)
, [LastName] varchar(7)
);
INSERT INTO people
(entry,[IDNo], [FirstName], [LastName])
VALUES
(1,'uqx', 'bob', 'smith'),
(2,'abc', 'john', 'willis'),
(3,'ABC', 'john', 'willis'),
(4,'aBc', 'john', 'willis'),
(5,'WTF', 'jeff', 'bridges'),
(6,'Sss', 'bill', 'doe'),
(7,'sSs', 'bill', 'doe'),
(8,'ssS', 'bill', 'doe'),
(9,'ere', 'sally', 'abby'),
(10,'wtf', 'jeff', 'bridges')
;

Try this
SELECT *
FROM people P
where P.IDNo in (SELECT DISTINCT IDNo
FROM people)

Depending on the nature of the duplicate rows, it looks like all you want is to have case-sensitivity on those columns. Setting the collation on these columns should be what you're after:
SELECT DISTINCT p.IDNO COLLATE SQL_Latin1_General_CP1_CI_AS, p.FirstName COLLATE SQL_Latin1_General_CP1_CI_AS, p.LastName COLLATE SQL_Latin1_General_CP1_CI_AS
FROM people P
http://msdn.microsoft.com/en-us/library/ms184391.aspx

Related

T-SQL pivot with count on pivoted results

I have the following data that I would like to pivot and get a count based on the pivoted results.
DECLARE #tempMusicSchoolStudent TABLE
(school VARCHAR(50),
studentname VARCHAR(50),
instrumentname VARCHAR(255),
expertise INT)
INSERT INTO #tempMusicSchoolStudent(school, studentname, instrumentname, expertise)
SELECT 'Foster','Matt','Guitar','10'
UNION
SELECT 'Foster','Jimmy','Guitar','5'
UNION
SELECT 'Foster','Jimmy','Keyboard','8'
UNION
SELECT 'Foster','Ryan','Keyboard','9'
UNION
SELECT 'Midlothean','Kyle','Keyboard','10'
UNION
SELECT 'Midlothean','Mary','Guitar','4'
UNION
SELECT 'Midlothean','Mary','Keyboard','7'
Raw data:
I'd like the results to look like the data below....
I got this data using the sql query below. The problem with this query is that I have a dynamic amount of instruments (I've only shown 2 in this example for simplicity sake). I'd like to use pivot because it will be cleaner dynamic sql. Otherwise I would have to dynamically left join the table to itself for each instrument.
SELECT
t.school, t.instrumentname, t.expertise,
t1.instrumentname, t1.expertise,
COUNT(DISTINCT t.studentname) [DistinctStudentCount]
FROM
#tempMusicSchoolStudent t
LEFT JOIN
#tempMusicSchoolStudent t1 ON t1.school = t.school
AND t1.studentname = t.studentname
AND t.instrumentname <> t1.instrumentname
GROUP BY
t.school, t.instrumentname, t.expertise, t1.instrumentname, t1.expertise
ORDER BY
t.school, t.instrumentname, t.expertise, t1.instrumentname, t1.expertise
If anyone has any ideas on how I can do this in a cleaner way than dynamically left joining the table to itself it would be much appreciated. Thanks.
You just need conditional aggregation:
SELECT t.school, t.instrumentname, t.expertise, t.instrumentname,
COUNT(DISTINCT t.studentname) as DistinctStudentCount
FROM #tempMusicSchoolStudent t
GROUP BY t.school, t.instrumentname, t.expertise, t.instrumentname;
You have rows with NULL values. It is entirely unclear where those come from. Your question is focused on some notion of "pivoting" where it seems that you only need aggregation. But it doesn't explain where the NULL rows comes from.
You can try to make it dynamic for multipe instruments. Refer
;with cte
as
(
SELECT * from
(SELECT * FROM #tempMusicSchoolStudent t) x
PIVOT
(MAX(expertise) FOR instrumentname in ([Guitar], [Keyboard])) y
)
SELECT school, studentname,
expertise = case when Guitar is not null then 'Guitar' else NULL end,
Guitar AS instrumentname,
expertise = case when Keyboard is not null then 'Keyboard' else NULL end,
Keyboard AS instrumentname,
count(distinct studentname) AS [DistinctStudentCount]
from cte
group by school,studentname, Guitar, Keyboard
OUTPUT:
Foster Jimmy Guitar 5 Keyboard 8 1
Foster Matt Guitar 10 NULL NULL 1
Foster Ryan NULL NULL Keyboard 9 1
Midlothean Kyle NULL NULL Keyboard 10 1
Midlothean Mary Guitar 4 Keyboard 7 1
Here's the solution I was looking for, I had to use unpivot + pivot.
The real thing that I was struggling with was selecting multiple values for the column that is being pivoted, instead of the max value.
So in this case I wanted multiple "expertise" numbers under a given "instrument expertise" column. Not just the maximum expertise for that instrument.
The first key to understanding the solution is that the pivot statement is doing an implicit group by on the columns being selected. So in order to achieve multiple values under your pivoted column you have to keep the integrity of the column you are grouping on by including some type of dense_rank/rank/row_number. This basically represents changes in the value of the column you are pivoting on and is then included in the implicit group by the pivot is doing, which results in getting multiple values in the pivoted column, not just the max.
So in the code below the "expertisenum" column is keeping the integrity of the expertise data.
DECLARE #tempMusicSchoolStudent TABLE
(school VARCHAR(50),
studentname VARCHAR(50),
instrumentname VARCHAR(255),
expertise INT)
INSERT INTO #tempMusicSchoolStudent(school, studentname, instrumentname, expertise)
SELECT 'Foster','Matt','Guitar','10'
UNION
SELECT 'Foster','Jimmy','Guitar','5'
UNION
SELECT 'Foster','Jimmy','Keyboard','8'
UNION
SELECT 'Foster','Ryan','Keyboard','9'
UNION
SELECT 'Midlothean','Kyle','Keyboard','10'
UNION
SELECT 'Midlothean','Mary','Guitar','4'
UNION
SELECT 'Midlothean','Mary','Keyboard','7'
SELECT school, [Guitar expertise], [Keyboard expertise], COUNT(*) [Count]
FROM
(
SELECT school,[expertiseNum],
CASE WHEN [Columns]='expertise' THEN instrumentname + ' expertise'
END [Columns1], [Values] AS [Values1]
FROM
(
SELECT school, studentname, instrumentname, DENSE_RANK() OVER(PARTITION BY school,instrumentname ORDER BY expertise) AS [expertiseNum],
CONVERT(VARCHAR(255),expertise) AS [expertise]
FROM #tempMusicSchoolStudent
) x
UNPIVOT (
[Values] FOR [Columns] IN ([expertise])
) unpvt
) p
PIVOT (
MAX([Values1]) FOR [Columns1] IN ([Guitar expertise], [Keyboard expertise])
) pvt
GROUP BY school,[Guitar expertise], [Keyboard expertise]

Get distinct individual column values (not distinct pairs) from two tables in single query

I have two tables like the following. One is for sport talents of some people and second for arts talents. One may not have a sport talent to list and same applies for art talent.
CREATE TABLE SPORT_TALENT(name varchar(10), TALENT varchar(10));
CREATE TABLE ART_TALENT(name varchar(10), TALENT varchar(10));
INSERT INTO SPORT_TALENT(name, TALENT) VALUES
('Steve', 'Footbal')
,('Steve', 'Golf')
,('Bob' , 'Golf')
,('Mary' , 'Tennnis');
INSERT INTO ART_TALENT(name, TALENT) VALUES
('Steve', 'Dancer')
, ('Steve', 'Singer')
, ('Bob' , 'Dancer')
, ('Bob' , 'Singer')
, ('John' , 'Dancer');
Now I want to list down sport talent and art talent of one person. I would like to avoid duplication. But I don't mind if there is a "null" in any output. I tried the following
select distinct sport_talent.talent as s_talent,art_talent.talent as a_talent
from sport_talent
JOIN art_talent on sport_talent.name=art_talent.name
where (sport_talent.name='Steve' or art_talent.name='Steve');
s_talent | a_talent
----------+----------
Footbal | Dancer
Golf | Singer
Footbal | Singer
Golf | Dancer
I would like to avoid redundancy and need something like the following (distinct values of sport talents + distinct values of art talents).
s_talent | a_talent
----------+----------
Footbal | Dancer
Golf | Singer
As mentioned in subject, I am not looking for distinct combinations. But at the same time, it's OK if there are some records with "null" value in one column. I am relatively new to SQL.
Try:
SELECT s_talent, a_talent
FROM (
SELECT distinct on (talent) talent as s_talent,
dense_rank() over (order by talent) as x
FROM SPORT_TALENT
WHERE name='Steve'
) x
FULL OUTER JOIN (
SELECT distinct on (talent) talent as a_talent,
dense_rank() over (order by talent) as x
FROM ART_TALENT
WHERE name='Steve'
) y
ON x.x = y.x
Demo: http://sqlfiddle.com/#!15/66e04/3
There are no duplicates in your query. Each of the four records in your query return is unique. This result may not be what you want, but seems like its problem is not the duplicate.
Postgres 9.4
... introduces unnest() with multiple arguments. Does exactly what you want, and should be fast, too. Per documentation:
The special table function UNNEST may be called with any number of
array parameters, and it returns a corresponding number of columns, as
if UNNEST (Section 9.18) had been called on each parameter separately
and combined using the ROWS FROM construct.
About ROWS FROM:
Compare result of two table functions using one column from each
SELECT *
FROM unnest(
ARRAY(SELECT DISTINCT talent FROM sport_talent WHERE name = 'Steve')
, ARRAY(SELECT DISTINCT talent FROM art_talent WHERE name = 'Steve')
) AS t(s_talent, a_talent);
Postgres 9.3 or older
SELECT s_talent, a_talent
FROM (
SELECT talent AS s_talent, row_number() OVER () AS rn
FROM sport_talent
WHERE name = 'Steve'
GROUP BY 1
) s
FULL JOIN (
SELECT talent AS a_talent, row_number() OVER () AS rn
FROM art_talent
WHERE name = 'Steve'
GROUP BY 1
) a USING (rn);
Similar previous answers with more explanation:
What type of JOIN to use
Sort columns independently, such that all nulls are last per column
This is similar to what #kordirko posted, but uses GROUP BY to get distinct talents, which is evaluated before window functions. So we only need a bare row_number() and not the more expensive dense_rank().
About the sequence of events in a SELECT query:
Best way to get result count before LIMIT was applied
SQL Fiddle.

SQL groupby multiple columns

Tablename: EntryTable
ID CharityName Title VoteCount
1 save the childrens save them 1
2 save the childrens saving childrens 3
3 cancer research support them 10
Tablename: ContestantTable
ID FirstName LastName EntryId
1 Neville Vyland 1
2 Abhishek Shukla 1
3 Raghu Nandan 2
Desired output
CharityName FullName
save the childrens Neville Vyland
Abhishek Shukla
cancer research Raghu Nandan
I tried
select LOWER(ET.CharityName) AS CharityName,COUNT(CT.FirstName) AS Total_No_Of_Contestant
from EntryTable ET
join ContestantTable CT
on ET.ID = CT.ID
group by LOWER(ET.CharityName)
Please advice.
Please have a look at this sqlfiddle.
Have a try with this query:
SELECT
e.CharityName,
c.FirstName,
c.LastName,
sq.my_count
FROM
EntryTable e
INNER JOIN ContestantTable c ON e.ID = c.EntryId
INNER JOIN (
SELECT EntryId, COUNT(*) AS my_count FROM ContestantTable GROUP BY EntryId
) sq ON e.ID = sq.EntryId
I assumed you actually wanted to join with ContestantTable's EntryId column. It made more sense to me. Either way (joining my way or yours) your sample data is faulty.
Apart from that, you didn't want repeating CharityNames. That's not the job of SQL. The database is just there to store and retrieve the data. Not to format it nicely. You want to work with the data on application layer anyways. Removing repeating data doesn't make this job easier, it makes it worse.
Most people do not realize that T-SQL has some cool ranking functions that can be used with grouping. Many things like reports can be done in T-SQL.
The first part of the code below creates two local temporary tables and loads them with data for testing.
The second part of the code creates the report. I use two common table expressions (CTE). I could have used two more local temporary tables or table variables. It really does not matter with this toy example.
The cte_RankData has two columns RowNum and RankNum. If RowNum = RankNum, we are on the first instance of charity. Print out the charity name and the total number of votes. Otherwise, print out blanks.
The name of the contestant and votes for that contestant are show on the detail lines. This is a typical report with sub totals show at the top.
I think this matches the report output that you wanted. I ordered the contestants by most votes descending.
Sincerely
John Miner
www.craftydba.com
--
-- Create the tables
--
-- Remove the tables
drop table #tbl_Entry;
drop table #tbl_Contestants;
-- The entries table
Create table #tbl_Entry
(
ID int,
CharityName varchar(25),
Title varchar(25),
VoteCount int
);
-- Add data
Insert Into #tbl_Entry values
(1, 'save the childrens', 'save them', 1),
(2, 'save the childrens', 'saving childrens', 3),
(3, 'cancer research', 'support them', 10)
-- The contestants table
Create table #tbl_Contestants
(
ID int,
FirstName varchar(25),
LastName varchar(25),
EntryId int
);
-- Add data
Insert Into #tbl_Contestants values
(1, 'Neville', 'Vyland', 1),
(2, 'Abhishek', 'Shukla', 1),
(3, 'Raghu', 'Nandan', 2);
--
-- Create the report
--
;
with cte_RankData
as
(
select
ROW_NUMBER() OVER (ORDER BY E.CharityName ASC, VoteCount Desc) as RowNum,
RANK() OVER (ORDER BY E.CharityName ASC) AS RankNum,
E.CharityName as CharityName,
C.FirstName + ' ' + C.LastName as FullName,
E.VoteCount
from #tbl_Entry E inner join #tbl_Contestants C on E.ID = C.ID
),
cte_SumData
as
(
select
E.CharityName,
sum(E.VoteCount) as TotalCount
from #tbl_Entry E
group by E.CharityName
)
select
case when RowNum = RankNum then
R.CharityName
else
''
end as rpt_CharityName,
case when RowNum = RankNum then
str(S.TotalCount, 5, 0)
else
''
end as rpt_TotalVotes,
FullName as rpt_ContestantName,
VoteCount as rpt_Votes4Contestant
from cte_RankData R join cte_SumData S
on R.CharityName = S.CharityName

Who to Insert data into ODD/EVEN rows only in SQL

I have one table with gender as one of the columns.
In gender column only M or F are allowed.
Now i want to sort the table so that while displaying the table in gender field M and F will come alternetivly.
I have Tried....
I have tried to create one(new) table with the same structure as my existing table.
Now using high leval insert i want to insert M to odd rows and F to even rows.
After that i want to join those two statements using union operator.
I am able to insert to ( new ) the table only male or female but not to the even or odd rows...
Can any body help me regarding this....
Thanks in Advance....
Don't consider a table to be "sorted". The SQL server may return the rows in any order depending on execution plan, index, joins etc. If you want a strict order you need to have an ordered column, like an identity column. Usually it is better to apply the desired sorting when selecting data.
However the interleaving of M and F is a little bit tricky, you need to use the ROW_NUMBER function.
Valid SQL Server code:
CREATE TABLE #GenderTable(
[Name] [nchar](10) NOT NULL,
[Gender] [char](1) NOT NULL
)
-- Create sample data
insert into #GenderTable (Name, Gender) values
('Adam', 'M'),
('Ben', 'M'),
('Casesar', 'M'),
('Alice', 'F'),
('Beatrice', 'F'),
('Cecilia', 'F')
SELECT * FROM #GenderTable
SELECT * FROM #GenderTable
order by ROW_NUMBER() over (partition by gender order by name), Gender
DROP TABLE #GenderTable
This gives the output
Name Gender
Adam M
Ben M
Casesar M
Alice F
Beatrice F
Cecilia F
and
Name Gender
Alice F
Adam M
Beatrice F
Ben M
Cecilia F
Casesar M
If you use another DBMS the syntax may differ.
I think the best way to do it would be to have two queries (one for M, one for F) and then join them together. The catch would be you would have to calculate the "rank" of each query and then sort accordingly.
Something like the following should do what you need:
select * from
(select
#rownum:=#rownum+1 rank,
t.*
from people_table t,
(SELECT #rownum:=0) r
where t.gender = 'M'
union
select
#rownum:=#rownum+1 rank,
t.*
from people_table t,
(SELECT #rownum:=0) r
where t.gender = 'F') joined
order by joined.rank, joined.gender;
If you are using SQL Server, you can seed your two tables with an IDENTITY column as follows. Make one odd and one even and then union and sort by this column.
Note that you can only truly alternate if there are the same number of male and female records. If there are more of one than the other, you will end up with non-alternating rows at the end.
CREATE TABLE MaleTable(Id INT IDENTITY(1,2) NOT NULL, Gender CHAR(1) NOT NULL)
INSERT INTO MaleTable(Gender) SELECT 'M'
INSERT INTO MaleTable(Gender) SELECT 'M'
INSERT INTO MaleTable(Gender) SELECT 'M'
CREATE TABLE FemaleTable(Id INT IDENTITY(2,2) NOT NULL, Gender CHAR(1) NOT NULL)
INSERT INTO FemaleTable(Gender) SELECT 'F'
INSERT INTO FemaleTable(Gender) SELECT 'F'
INSERT INTO FemaleTable(Gender) SELECT 'F'
SELECT u.Id
,u.Gender
FROM (
SELECT Id, Gender
FROM FemaleTable
UNION
SELECT Id, Gender
FROM MaleTable
) u
ORDER BY u.Id ASC
See here for a working example

Help with SQL aggregate query with detecting duplicates

I have a table which has records that contain a persons information and a filename that the information originated from, so the table looks like so:
|Table|
|id, first-name, last-name, ssn, filename|
I also have a stored procedure that provides some analytics for the files in the system and i'm trying to add information to that stored procedure to shed light into the possibility of duplicates.
Here is the current stored procedure
SELECT [filename],
COUNT([filename]) as totalRecords,
COUNT(closedleads.id) as closedRecords,
ROUND(--calcs percent of records closed in a file)
FROM table
LEFT OUTER JOIN closedleads ON closedleads.leadid = table.id
GROUP BY [filename]
What I want to add is the ability to see maybe # of possible duplicates, defined as records with matching SSNs and I am at a loss as to how I could perform a count on a sub query or join and include it in the results set. Can anyone provide some pointers?
What I'm trying to do is add something like this to my procedure above
SELECT COUNT(
SELECT COUNT(*) FROM Table T1
INNER JOIN Table T2 on T1.SSN = T2.SSN
WHERE T1.id != T2.id
) as PossibleDuplicates
What I'm looking for is merging this code with my procedure above so I can get all of the same data in one and possible have this # of duplicates across each filename, so for each filename I get a result of # of records, # of records closed and # of possible duplicates
EDIT:
I'm very close to my desired goal but I'm failing on the last little bit--getting the number of possible duplicates BY filename, here is my query
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM Table
left join closedleads on closedleads.leadid = Table.id
group by [filename]
) as [q1]
INNER JOIN (
select count([ssn]) as dups, [filename] from Table
group by [ssn], [filename]
having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]
This works but it showing multiple results for each filename with values of 2-5 instead of summing the total count of possible duplicates
Working Query
Hey everyone, thanks for all the help, this is eventually what I got to that worked exactly as I wanted
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups,
round(([q1].closed / [q1].leads), 3) as percentClosed
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM Table
left join closedleads on closedleads.leadid = Table.id
and [filename] is not null
group by [filename]
) as [q1]
INNER JOIN (
select [filename], count(*) - count(distinct [ssn]) as dups
from Table
group by [filename]
) as [q2] on [q1].[filename] = [q2].[filename]
You'll probably want to make use of a HAVING clause somewhere, eg:
LEFT JOIN (
SELECT SSN, COUNT(SSN) - 1 DupeCount FROM Table T1
GROUP BY SSN
HAVING COUNT(SSN) > 1 ) AS PossibleDuplicates
ON table.ssn = PossibleDuplicates.SSN
If you want to include 0 possible duplicates (rather than null) you actually don't need the HAVING clause, just the left join.
Edit - Updated with a better example which matches your question better
Here's an example if I understand correctly.
create table #table (id int,ssn varchar(10))
insert into #table values(1,'10')
insert into #table values(2,'10')
insert into #table values(3,'11')
insert into #table values(4,'12')
insert into #table values(5,'11')
insert into #table values(6,'13')
select sum(cnt)
from (
select count(distinct ssn) as cnt
from #table
group by ssn
having count(*)>1
) dups
You shouldn't need to self join the table if you group by ssn and then pull back only ssn's where you have more then one.
I think the existing answers don't quite understand your question. I think I do but it's not completely specified yet. Is it a duplicate if the same SSN appears in two different files or only within the same file? Because you group by filename, that becomes the grain.
The Output of your query is like
StateFarm1, 500, 50, 10%, <your new value goes here>
AllState2, 100, 90, 90% <your new value goes here>
So if you have the same SSN in those two files, you have 1 duplicate, so on which row do you show 1, on the AllState row or the Statefarm row? If you say both, invariably someone will SUM that column and get a doubling of the results.
Now What if you have a Geico row with the same SSN, is that 1 duplicate or 2? and again which row?
I know this isn't a final answer but these questions do highlight the the question as it stands is unanswerable... you fix this and I'll change the answer,
please no downvotes in the meantime
Addendum
I believe the only thing you are missing is a DISTINCT.
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM tbldata
left join closedleads on closedleads.leadid = Table.id
group by [filename]
) as [q1]
INNER JOIN (
select count( DISTINCT [ssn]) as dups, [filename] from Table '<---- here'
group by [ssn], [filename]
having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]
You don't need the outer COUNT - your inner SELECT COUNT(*)... will return you just one number, a count of records with duplicate SSN but different id.