Help with SQL aggregate query with detecting duplicates

Help with SQL aggregate query with detecting duplicates - sql

I have a table which has records that contain a persons information and a filename that the information originated from, so the table looks like so:
|Table|
|id, first-name, last-name, ssn, filename|
I also have a stored procedure that provides some analytics for the files in the system and i'm trying to add information to that stored procedure to shed light into the possibility of duplicates.
Here is the current stored procedure
SELECT [filename],
COUNT([filename]) as totalRecords,
COUNT(closedleads.id) as closedRecords,
ROUND(--calcs percent of records closed in a file)
FROM table
LEFT OUTER JOIN closedleads ON closedleads.leadid = table.id
GROUP BY [filename]
What I want to add is the ability to see maybe # of possible duplicates, defined as records with matching SSNs and I am at a loss as to how I could perform a count on a sub query or join and include it in the results set. Can anyone provide some pointers?
What I'm trying to do is add something like this to my procedure above
SELECT COUNT(
SELECT COUNT(*) FROM Table T1
INNER JOIN Table T2 on T1.SSN = T2.SSN
WHERE T1.id != T2.id
) as PossibleDuplicates
What I'm looking for is merging this code with my procedure above so I can get all of the same data in one and possible have this # of duplicates across each filename, so for each filename I get a result of # of records, # of records closed and # of possible duplicates
EDIT:
I'm very close to my desired goal but I'm failing on the last little bit--getting the number of possible duplicates BY filename, here is my query
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM Table
left join closedleads on closedleads.leadid = Table.id
group by [filename]
) as [q1]
INNER JOIN (
select count([ssn]) as dups, [filename] from Table
group by [ssn], [filename]
having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]
This works but it showing multiple results for each filename with values of 2-5 instead of summing the total count of possible duplicates
Working Query
Hey everyone, thanks for all the help, this is eventually what I got to that worked exactly as I wanted
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups,
round(([q1].closed / [q1].leads), 3) as percentClosed
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM Table
left join closedleads on closedleads.leadid = Table.id
and [filename] is not null
group by [filename]
) as [q1]
INNER JOIN (
select [filename], count(*) - count(distinct [ssn]) as dups
from Table
group by [filename]
) as [q2] on [q1].[filename] = [q2].[filename]

You'll probably want to make use of a HAVING clause somewhere, eg:
LEFT JOIN (
SELECT SSN, COUNT(SSN) - 1 DupeCount FROM Table T1
GROUP BY SSN
HAVING COUNT(SSN) > 1 ) AS PossibleDuplicates
ON table.ssn = PossibleDuplicates.SSN
If you want to include 0 possible duplicates (rather than null) you actually don't need the HAVING clause, just the left join.

Edit - Updated with a better example which matches your question better
Here's an example if I understand correctly.
create table #table (id int,ssn varchar(10))
insert into #table values(1,'10')
insert into #table values(2,'10')
insert into #table values(3,'11')
insert into #table values(4,'12')
insert into #table values(5,'11')
insert into #table values(6,'13')
select sum(cnt)
from (
select count(distinct ssn) as cnt
from #table
group by ssn
having count(*)>1
) dups
You shouldn't need to self join the table if you group by ssn and then pull back only ssn's where you have more then one.

I think the existing answers don't quite understand your question. I think I do but it's not completely specified yet. Is it a duplicate if the same SSN appears in two different files or only within the same file? Because you group by filename, that becomes the grain.
The Output of your query is like
StateFarm1, 500, 50, 10%, <your new value goes here>
AllState2, 100, 90, 90% <your new value goes here>
So if you have the same SSN in those two files, you have 1 duplicate, so on which row do you show 1, on the AllState row or the Statefarm row? If you say both, invariably someone will SUM that column and get a doubling of the results.
Now What if you have a Geico row with the same SSN, is that 1 duplicate or 2? and again which row?
I know this isn't a final answer but these questions do highlight the the question as it stands is unanswerable... you fix this and I'll change the answer,
please no downvotes in the meantime
Addendum
I believe the only thing you are missing is a DISTINCT.
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM tbldata
left join closedleads on closedleads.leadid = Table.id
group by [filename]
) as [q1]
INNER JOIN (
select count( DISTINCT [ssn]) as dups, [filename] from Table '<---- here'
group by [ssn], [filename]
having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]

You don't need the outer COUNT - your inner SELECT COUNT(*)... will return you just one number, a count of records with duplicate SSN but different id.

Related

Firebird select from table distinct one field

The question I asked yesterday was simplified but I realize that I have to report the whole story.
I have to extract the data of 4 from 4 different tables into a Firebird 2.5 database and the following query works:
SELECT
PRODUZIONE_T t.CODPRODUZIONE,
PRODUZIONE_T.NUMEROCOMMESSA as numeroco,
ANGCLIENTIFORNITORI.RAGIONESOCIALE1,
PRODUZIONE_T.DATACONSEGNA,
PRODUZIONE_T.REVISIONE,
ANGUTENTI.NOMINATIVO,
ORDINI.T_DATA,
FROM PRODUZIONE_T
LEFT OUTER JOIN ORDINI_T ON PRODUZIONE_T.CODORDINE=ORDINI_T.CODORDINE
INNER JOIN ANGCLIENTIFORNITORI ON ANGCLIENTIFORNITORI.CODCLIFOR=ORDINI_T.CODCLIFOR
LEFT OUTER JOIN ANGUTENTI ON ANGUTENTI.IDUTENTE = PRODUZIONE_T.RESPONSABILEUC
ORDER BY right(numeroco,2) DESC, left(numeroco,3) desc
rows 1 to 500;
However the query returns me double (or more) due to the REVISIONE column.
How do I select only the rows of a single NUMEROCOMMESSA with the maximum REVISIONE value?

This should work:
select COD, ORDER, S.DATE, REVISION
FROM TAB1
JOIN
(
select ORDER, MAX(REVISION) as REVISION
FROM TAB1
Group By ORDER
) m on m.ORDER = TAB1.ORDER and m.REVISION = TAB1.REVISION

Here you go - http://sqlfiddle.com/#!6/ce7cf/4
Sample Data (as u set it in your original question):
create table TAB1 (
cod integer primary key,
n_order varchar(10) not null,
s_date date not null,
revision integer not null );
alter table tab1 add constraint UQ1 unique (n_order,revision);
insert into TAB1 values ( 1, '001/18', '2018-02-01', 0 );
insert into TAB1 values ( 2, '002/18', '2018-01-31', 0 );
insert into TAB1 values ( 3, '002/18', '2018-01-30', 1 );
The query:
select *
from tab1 d
join ( select n_ORDER, MAX(REVISION) as REVISION
FROM TAB1
Group By n_ORDER ) m
on m.n_ORDER = d.n_ORDER and m.REVISION = d.REVISION
Suggestions:
Google and read the classic book: "Understanding SQL" by Martin Gruber
Read Firebird SQL reference: https://www.firebirdsql.org/file/documentation/reference_manuals/fblangref25-en/html/fblangref25.html
Here is yet one more solution using Windowed Functions introduced in Firebird 3 - http://sqlfiddle.com/#!6/ce7cf/13
I do not have Firebird 3 at hand, so can not actually check if there would not be some sudden incompatibility, do it at home :-D
SELECT * FROM
(
SELECT
TAB1.*,
ROW_NUMBER() OVER (
PARTITION BY n_order
ORDER BY revision DESC
) AS rank
FROM TAB1
) d
WHERE rank = 1
Read documentation
https://community.modeanalytics.com/sql/tutorial/sql-window-functions/
https://www.firebirdsql.org/file/documentation/release_notes/html/en/3_0/rnfb30-dml-windowfuncs.html
Which of the three (including Gordon's one) solution would be faster depends upon specific database - the real data, the existing indexes, the selectivity of indexes.
While window functions can make the join-less query, I am not sure it would be faster on real data, as it maybe can just ignore indexes on order+revision cortege and do the full-scan instead, before rank=1 condition applied. While the first solution would most probably use indexes to get maximums without actually reading every row in the table.
The Firebird-support mailing list suggested a way to break out of the loop, to only use a single query: The trick is using both windows functions and CTE (common table expression): http://sqlfiddle.com/#!18/ce7cf/2
WITH TMP AS (
SELECT
*,
MAX(revision) OVER (
PARTITION BY n_order
) as max_REV
FROM TAB1
)
SELECT * FROM TMP
WHERE revision = max_REV

If you want the max revision number in Firebird:
select t.*
from tab1 t
where t.revision = (select max(t2.revision) from tab1 t2 where t2.order = t.order);
For performance, you want an index on tab1(order, revision). With such an index, performance should be competitive with any other approach.

Find entryno set from multiple set of records

I have two SQL temp tables #Temp1 and #Temp2.
I want to get entryno which contain set of temp table two.
For example: #Temp2 has 8 records. I want to search in #Temp1 which contains a set of records from #Temp1.
CREATE TABLE #Temp1 (entryNo INT, setid INT, measurid INT,measurvalueid int)
CREATE TABLE #Temp2(setid INT, measurid INT,measurvalueid int)
INSERT INTO #Temp1 (entryNo,setid,measurid,measurvalueid )
VALUES (1,400001,1,1),
(1,400001,2,110),
(1,400001,3,1001),
(1,400001,4,1100),
(2,400002,5,100),
(2,400002,6,102),
(2,400002,7,1003),
(2,400002,8,10004),
(3,400001,1,1),
(3,400001,2,110),
(3,400001,3,1001),
(3,400001,4,1200)
INSERT INTO #Temp2 (setid,measurid,measurvalueid )
VALUES (400001,1,1),
(400001,2,110),
(400001,3,1001),
(400001,4,1100),
(400002,5,100),
(400002,6,102),
(400002,7,1003),
(400002,8,10004)
I want output
EntryNo
1
2
It contains two sets.
One is:
(400001,1,1),
(400001,2,110),
(400001,3,1001),
(400001,4,1100)
The second is:
(400002,5,100),
(400002,6,102),
(400002,7,1003),
(400002,8,10004)

Try this:
WITH DataSourceInialData AS
(
SELECT *
,COUNT(*) OVER (PARTITION BY [entryNo], [setid]) AS [GroupCount]
FROM #Temp1
), DataSourceFilteringData AS
(
SELECT *
,COUNT(*) OVER (PARTITION BY [setid]) AS [GroupCount]
FROM #Temp2
)
SELECT A.[entryNo]
FROM DataSourceInialData A
INNER JOIN DataSourceFilteringData B
ON A.[setid] = B.[setid]
AND A.[measurid] = B.[measurid]
AND A.[measurvalueid] = B.[measurvalueid]
-- we are interested in groups which are passed completely by the filtering groups
AND A.[GroupCount] = B.[GroupCount]
GROUP BY A.[entryNo]
-- aftering joining the rows, the filtered rows must match the filtering rows
HAVING COUNT(A.[setid]) = MAX(B.[GroupCount]);
The algorithm is simple:
we count how many rows exists per data group
we count how many rows exists per filtering group
we join the initial data and the filtering data
after the join we count how many rows are left in the initial data and if there count is equal to the filtering count for the given group
and the result is:
Note, that I am checking for each match. For example, if in your sample data, there is one more row for entryNo = 1 it won't be included in the result. In order to change this behavior, comment this row:
-- we are interested in groups which are passed completely by the filtering groups
AND A.[GroupCount] = B.[GroupCount]

SQL Left Join first match only

I have a query against a large number of big tables (rows and columns) with a number of joins, however one of tables has some duplicate rows of data causing issues for my query. Since this is a read only realtime feed from another department I can't fix that data, however I am trying to prevent issues in my query from it.
Given that, I need to add this crap data as a left join to my good query. The data set looks like:
IDNo FirstName LastName ...
-------------------------------------------
uqx bob smith
abc john willis
ABC john willis
aBc john willis
WTF jeff bridges
sss bill doe
ere sally abby
wtf jeff bridges
...
(about 2 dozen columns, and 100K rows)
My first instinct was to perform a distinct gave me about 80K rows:
SELECT DISTINCT P.IDNo
FROM people P
But when I try the following, I get all the rows back:
SELECT DISTINCT P.*
FROM people P
OR
SELECT
DISTINCT(P.IDNo) AS IDNoUnq
,P.FirstName
,P.LastName
...etc.
FROM people P
I then thought I would do a FIRST() aggregate function on all the columns, however that feels wrong too. Syntactically am I doing something wrong here?
Update:
Just wanted to note: These records are duplicates based on a non-key / non-indexed field of ID listed above. The ID is a text field which although has the same value, it is a different case than the other data causing the issue.

distinct is not a function. It always operates on all columns of the select list.
Your problem is a typical "greatest N per group" problem which can easily be solved using a window function:
select ...
from (
select IDNo,
FirstName,
LastName,
....,
row_number() over (partition by lower(idno) order by firstname) as rn
from people
) t
where rn = 1;
Using the order by clause you can select which of the duplicates you want to pick.
The above can be used in a left join, see below:
select ...
from x
left join (
select IDNo,
FirstName,
LastName,
....,
row_number() over (partition by lower(idno) order by firstname) as rn
from people
) p on p.idno = x.idno and p.rn = 1
where ...

Add an identity column (PeopleID) and then use a correlated subquery to return the first value for each value.
SELECT *
FROM People p
WHERE PeopleID = (
SELECT MIN(PeopleID)
FROM People
WHERE IDNo = p.IDNo
)

After careful consideration this dillema has a few different solutions:
Aggregate Everything
Use an aggregate on each column to get the biggest or smallest field value. This is what I am doing since it takes 2 partially filled out records and "merges" the data.
http://sqlfiddle.com/#!3/59cde/1
SELECT
UPPER(IDNo) AS user_id
, MAX(FirstName) AS name_first
, MAX(LastName) AS name_last
, MAX(entry) AS row_num
FROM people P
GROUP BY
IDNo
Get First (or Last record)
http://sqlfiddle.com/#!3/59cde/23
-- ------------------------------------------------------
-- Notes
-- entry: Auto-Number primary key some sort of unique PK is required for this method
-- IDNo: Should be primary key in feed, but is not, we are making an upper case version
-- This gets the first entry to get last entry, change MIN() to MAX()
-- ------------------------------------------------------
SELECT
PC.user_id
,PData.FirstName
,PData.LastName
,PData.entry
FROM (
SELECT
P2.user_id
,MIN(P2.entry) AS rownum
FROM (
SELECT
UPPER(P.IDNo) AS user_id
, P.entry
FROM people P
) AS P2
GROUP BY
P2.user_id
) AS PC
LEFT JOIN people PData
ON PData.entry = PC.rownum
ORDER BY
PData.entry

Use Cross Apply or Outer Apply, this way you can limit the amount of data to be joined from the table with the duplicates to the first hit.
Select
x.*,
c.*
from
x
Cross Apply
(
Select
Top (1)
IDNo,
FirstName,
LastName,
....,
from
people As p
where
p.idno = x.idno
Order By
p.idno //unnecessary if you don't need a specific match based on order
) As c
Cross Apply behaves like an inner join, Outer Apply like a left join
SQL Server CROSS APPLY and OUTER APPLY

Turns out I was doing it wrong, I needed to perform a nested select first of just the important columns, and do a distinct select off that to prevent trash columns of 'unique' data from corrupting my good data. The following appears to have resolved the issue... but I will try on the full dataset later.
SELECT DISTINCT P2.*
FROM (
SELECT
IDNo
, FirstName
, LastName
FROM people P
) P2
Here is some play data as requested: http://sqlfiddle.com/#!3/050e0d/3
CREATE TABLE people
(
[entry] int
, [IDNo] varchar(3)
, [FirstName] varchar(5)
, [LastName] varchar(7)
);
INSERT INTO people
(entry,[IDNo], [FirstName], [LastName])
VALUES
(1,'uqx', 'bob', 'smith'),
(2,'abc', 'john', 'willis'),
(3,'ABC', 'john', 'willis'),
(4,'aBc', 'john', 'willis'),
(5,'WTF', 'jeff', 'bridges'),
(6,'Sss', 'bill', 'doe'),
(7,'sSs', 'bill', 'doe'),
(8,'ssS', 'bill', 'doe'),
(9,'ere', 'sally', 'abby'),
(10,'wtf', 'jeff', 'bridges')
;

Try this
SELECT *
FROM people P
where P.IDNo in (SELECT DISTINCT IDNo
FROM people)

Depending on the nature of the duplicate rows, it looks like all you want is to have case-sensitivity on those columns. Setting the collation on these columns should be what you're after:
SELECT DISTINCT p.IDNO COLLATE SQL_Latin1_General_CP1_CI_AS, p.FirstName COLLATE SQL_Latin1_General_CP1_CI_AS, p.LastName COLLATE SQL_Latin1_General_CP1_CI_AS
FROM people P
http://msdn.microsoft.com/en-us/library/ms184391.aspx

SQL Server: row present in one query, missing in another

Ok so I think I must be misunderstanding something about SQL queries. This is a pretty wordy question, so thanks for taking the time to read it (my problem is right at the end, everything else is just context).
I am writing an accounting system that works on the double-entry principal -- money always moves between accounts, a transaction is 2 or more TransactionParts rows decrementing one account and incrementing another.
Some TransactionParts rows may be flagged as tax related so that the system can produce a report of total VAT sales/purchases etc, so it is possible that a single Transaction may have two TransactionParts referencing the same Account -- one VAT related, and the other not. To simplify presentation to the user, I have a view to combine multiple rows for the same account and transaction:
create view Accounting.CondensedEntryView as
select p.[Transaction], p.Account, sum(p.Amount) as Amount
from Accounting.TransactionParts p
group by p.[Transaction], p.Account
I then have a view to calculate the running balance column, as follows:
create view Accounting.TransactionBalanceView as
with cte as
(
select ROW_NUMBER() over (order by t.[Date]) AS RowNumber,
t.ID as [Transaction], p.Amount, p.Account
from Accounting.Transactions t
inner join Accounting.CondensedEntryView p on p.[Transaction]=t.ID
)
select b.RowNumber, b.[Transaction], a.Account,
coalesce(sum(a.Amount), 0) as Balance
from cte a, cte b
where a.RowNumber <= b.RowNumber AND a.Account=b.Account
group by b.RowNumber, b.[Transaction], a.Account
For reasons I haven't yet worked out, a certain transaction (ID=30) doesn't appear on an account statement for the user. I confirmed this by running
select * from Accounting.TransactionBalanceView where [Transaction]=30
This gave me the following result:
RowNumber Transaction Account Balance
-------------------- ----------- ------- ---------------------
72 30 23 143.80
As I said before, there should be at least two TransactionParts for each Transaction, so one of them isn't being presented in my view. I assumed there must be an issue with the way I've written my view, and run a query to see if there's anything else missing:
select [Transaction], count(*)
from Accounting.TransactionBalanceView
group by [Transaction]
having count(*) < 2
This query returns no results -- not even for Transaction 30! Thinking I must be an idiot I run the following query:
select [Transaction]
from Accounting.TransactionBalanceView
where [Transaction]=30
It returns two rows! So select * returns only one row and select [Transaction] returns both. After much head-scratching and re-running the last two queries, I concluded I don't have the faintest idea what's happening. Any ideas?
Thanks a lot if you've stuck with me this far!
Edit:
Here are the execution plans:
select *
select [Transaction]
1000 lines each, hence finding somewhere else to host.
Edit 2:
For completeness, here are the tables I used:
create table Accounting.Accounts
(
ID smallint identity primary key,
[Name] varchar(50) not null
constraint UQ_AccountName unique,
[Type] tinyint not null
constraint FK_AccountType foreign key references Accounting.AccountTypes
);
create table Accounting.Transactions
(
ID int identity primary key,
[Date] date not null default getdate(),
[Description] varchar(50) not null,
Reference varchar(20) not null default '',
Memo varchar(1000) not null
);
create table Accounting.TransactionParts
(
ID int identity primary key,
[Transaction] int not null
constraint FK_TransactionPart foreign key references Accounting.Transactions,
Account smallint not null
constraint FK_TransactionAccount foreign key references Accounting.Accounts,
Amount money not null,
VatRelated bit not null default 0
);

Demonstration of possible explanation.
Create table Script
SELECT *
INTO #T
FROM master.dbo.spt_values
CREATE NONCLUSTERED INDEX [IX_T] ON #T ([name] DESC,[number] DESC);
Query one (Returns 35 results)
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY NAME) AS rn
FROM #T
)
SELECT c1.number,c1.[type]
FROM cte c1
JOIN cte c2 ON c1.rn=c2.rn AND c1.number <> c2.number
Query Two (Same as before but adding c2.[type] to the select list makes it return 0 results)
;
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY NAME) AS rn
FROM #T
)
SELECT c1.number,c1.[type] ,c2.[type]
FROM cte c1
JOIN cte c2 ON c1.rn=c2.rn AND c1.number <> c2.number
Why?
row_number() for duplicate NAMEs isn't specified so it just chooses whichever one fits in with the best execution plan for the required output columns. In the second query this is the same for both cte invocations, in the first one it chooses a different access path with resultant different row_numbering.
Suggested Solution
You are self joining the CTE on ROW_NUMBER() over (order by t.[Date])
Contrary to what may have been expected the CTE will likely not be materialised which would have ensured consistency for the self join and thus you assume a correlation between ROW_NUMBER() on both sides that may well not exist for records where a duplicate [Date] exists in the data.
What if you try ROW_NUMBER() over (order by t.[Date], t.[id]) to ensure that in the event of tied dates the row_numbering is in a guaranteed consistent order. (Or some other column/combination of columns that can differentiate records if id won't do it)

If the purpose of this part of the view is just to make sure that the same row isn't joined to itself
where a.RowNumber <= b.RowNumber
then how does changing this part to
where a.RowNumber <> b.RowNumber
affect the results?

It seems you read dirty entries. (Someone else deletes/insertes new data)
try SET TRANSACTION ISOLATION LEVEL READ COMMITTED.
i've tried this code (seems equal to yours)
IF object_id('tempdb..#t') IS NOT NULL DROP TABLE #t
CREATE TABLE #t(i INT, val INT, acc int)
INSERT #t
SELECT 1, 2, 70
UNION ALL SELECT 2, 3, 70
;with cte as
(
select ROW_NUMBER() over (order by t.i) AS RowNumber,
t.val as [Transaction], t.acc Account
from #t t
)
select b.RowNumber, b.[Transaction], a.Account
from cte a, cte b
where a.RowNumber <= b.RowNumber AND a.Account=b.Account
group by b.RowNumber, b.[Transaction], a.Account
and got two rows
RowNumber Transaction Account
1 2 70
2 3 70

Using the distinct function in SQL

I have a SQL query I am running. What I was wanting to know is that is there a way of selecting the rows in a table where the value in on one of those columns is distinct? When I use the distinct function, It returns all of the distinct rows so...
select distinct teacher from class etc.
This works fine, but I am selecting multiple columns, so...
select distinct teacher, student etc.
but I don't want to retrieve the distinct rows, I want the distinct rows where the teacher is distinct. So this query would probably return the same teacher's name multiple times because the student value is different but what I would like is to return rows where the teachers are distinct, even if it means returning the teacher and one student name (because I don't need all the students).
I hope what I am trying to ask is clear but is it possible to use the distinct function on a single column even when selecting multiple columns or is there any other solution to this problem? Thanks.
The above is just an example I am giving. I don't know if using 'distinct' is the solution to my problem. I am not using teacher etc. that was just an example to get the idea accross. I am selecting multiple columns (about 10) from different tables. I have a query to get the tabled result I want. Now I want to query that table to find the unique values in one particular column. So using the teacher example again, say I have wrote a query and I have all the teachers and all the pupils they teach. Now I want to go through each row in this table and email the teacher a message. But I don't want to email the teacher numerous times, just the once, so I want to return all the columns from the table I have, where only the teacher value is distinct.
Col A Col B Col C Col D
a b c d
a c d b
b a a c
b c c c
A query I have produces the above table. Now I want only those rows where Col A values are unique. How would I go about it?

You have misunderstood the DISTINCT keyword. It is not a function and it does not modify a column. You cannot SELECT a, DISTINCT(b), c, DISTINCT(d) FROM SomeTable. DISTINCT is a modifier for the query itself, i.e. you don't select a distinct column, you make a SELECT DISTINCT query.
In other words: DISTINCT tells the server to go through the whole result set and remove all duplicate rows after the query has been performed.
If you need a column to contain every value once, you need to GROUP BY that column. Once you do that, the server now needs to do which student to select with each teacher, if there are multiple, so you need to provide a so-called aggregate function like COUNT(). Example:
SELECT teacher, COUNT(student) AS amountStudents
FROM ...
GROUP BY teacher;

One option is to use a GROUP BY on Col A. Example:
SELECT * FROM table_name
GROUP BY Col A
That should return you:
abcd
baac

Based on the limited details you provided in your question (you should explain how/why your data is in different tables, what DB server you are using, etc) you can approach this from 2 different directions.
Reduce the number of columns in your query to only return the "teacher" and "email" columns but using the existing WHERE criteria. The problem you have with your current attempt is both DISTINCT and GROUP BY don't understand that you one want 1 row for each value of the column that you are trying to be distinct about. From what I understand, MySQL has support for what you are doing using GROUP BY but MSSQL does not support result columns not included in the GROUP BY statement. If you don't need the "student" columns, don't put them in your result set.
Convert your existing query to use column based sub-queries so that you only return a single result for non-grouped data.
Example:
SELECT t1.a
, (SELECT TOP 1 b FROM Table1 t2 WHERE t1.a = t2.a) AS b
, (SELECT TOP 1 c FROM Table1 t2 WHERE t1.a = t2.a) AS c
, (SELECT TOP 1 d FROM Table1 t2 WHERE t1.a = t2.a) AS d
FROM dbo.Table1 t1
WHERE (your criteria here)
GROUP BY t1.a
This query will not be fast if you have a lot of data, but it will return a single row per teacher with a somewhat random value for the remaining columns. You can also add an ORDER BY to each sub-query to further tweak the values returned for the additional columns.

I'm not sure if I am understanding this right but couldn't you do
SELECT * FROM class WHERE teacher IN (SELECT DISTINCT teacher FROM class)
This would return all of the data in each row where the teacher is distinct

distinct requires a unique result-set row. This means that whatever values you select from your table will need to be distinct together as a row from any other row in the result-set.
Using distinct can return the same value more than once from a given field as long as the other corresponding fields in the row are distinct as well.

As soulmerge and Shiraz have mentioned you'll need to use a GROUP BY and subselect. This worked for me.
DECLARE #table TABLE (
[Teacher] [NVarchar](256) NOT NULL ,
[Student] [NVarchar](256) NOT NULL
)
INSERT INTO #table VALUES ('Teacher 1', 'Student 1')
INSERT INTO #table VALUES ('Teacher 1', 'Student 2')
INSERT INTO #table VALUES ('Teacher 2', 'Student 3')
INSERT INTO #table VALUES ('Teacher 2', 'Student 4')
SELECT
T.[Teacher],
(
SELECT TOP 1 T2.[Student]
FROM #table AS T2
WHERE T2.[Teacher] = T.[Teacher]
) AS [Student]
FROM #table AS T
GROUP BY T.[Teacher]
Results
Teacher 1, Student 1
Teacher 2, Student 3

You need to do it with a sub select where you take TOP 1 of student where the teacher is the same.

You may try "GROUP BY teacher" to return what you need.

What is the question your query is trying to answer?
Do you need to know which classes have only one teacher?
select class_name, count(teacher)
from class group by class_name having count(teacher)=1
Or are you looking for teachers with only one student?
select teacher, count(student)
from class group by teacher having count(student)=1
Or is it something else? The question you've posed assumes that using DISTINCT is the correct approach to the query you're trying to construct. It seems likely this is not the case. Could you describe the question you're trying to answer with DISTINCT?

You will need to say how your data is stored in-memory for us to say how you can query it.
But you could do a separate query to just get the distinct teachers.
select distinct teacher from class

I am struggling to understand exactly what you wish to do.. but you can do something like this:
SELECT DISTINCT ColA FROM Table WHERE ...
If you only select a singular column, the distinct will only grab those.
If you could clarify a little more, I could try to help a bit more.

You could use GROUP BY to separate the return values based on a single column value.

All you have to do is select just the columns you want the first one and do a select Distinct
Select Distinct column1 -- where your criteria...

The following might help you get to your solution. The other poster did point to this but his syntax for group by was incorrect.
Get all teachers that teach any classes.
Select teacher_id, count(*)
from teacher_table inner join classes_table
on teacher_table.teacher_id = classes_table.teacher_id
group by teacher_id

Noone seems to understand what you want. I will take another guess.
Select * from tbl
Where ColA in (Select ColA from tbl Group by ColA Having Count(ColA) = 1)
This will return all data from rows where ColA is unique -i.e. there isn't another row with the same ColA value. Of course, that means zero rows from the sample data you provided.

select cola,colb,colc
from yourtable
where cola in
(
select cola from yourtable where your criteria group by cola having count(*) = 1
)

declare #temp as table (colA nchar, colB nchar, colC nchar, colD nchar, rownum int)
insert #temp (colA, colB, colC, colD, rownum)
select Test.ColA, Test.ColB, Test.ColC, Test.ColD, ROW_NUMBER() over (order by ColA) as rownum
from Test
select t1.ColA, ColB, ColC, ColD
from #temp as t1
join (
select ColA, MIN(rownum) [min]
from #temp
group by Cola)
as t2 on t1.Cola = t2.Cola and t1.rownum = t2.[min]
This will return a single row for each value of the colA.

CREATE FUNCTION dbo.DistinctList
(
#List VARCHAR(MAX),
#Delim CHAR
)
RETURNS
VARCHAR(MAX)
AS
BEGIN
DECLARE #ParsedList TABLE
(
Item VARCHAR(MAX)
)
DECLARE #list1 VARCHAR(MAX), #Pos INT, #rList VARCHAR(MAX)
SET #list = LTRIM(RTRIM(#list)) + #Delim
SET #pos = CHARINDEX(#delim, #list, 1)
WHILE #pos > 0
BEGIN
SET #list1 = LTRIM(RTRIM(LEFT(#list, #pos - 1)))
IF #list1 <> ''
INSERT INTO #ParsedList VALUES (CAST(#list1 AS VARCHAR(MAX)))
SET #list = SUBSTRING(#list, #pos+1, LEN(#list))
SET #pos = CHARINDEX(#delim, #list, 1)
END
SELECT #rlist = COALESCE(#rlist+',','') + item
FROM (SELECT DISTINCT Item FROM #ParsedList) t
RETURN #rlist
END
GO

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Help with SQL aggregate query with detecting duplicates - sql

You don't need the outer COUNT - your inner SELECT COUNT(*)... will return you just one number, a count of records with duplicate SSN but different id.

Related

Firebird select from table distinct one field

Find entryno set from multiple set of records

SQL Left Join first match only

SQL Server: row present in one query, missing in another

Using the distinct function in SQL

Categories

Resources