SQL Server Duplicate Checking - sql

What is the best way to determine duplicate records in a SQL Server table?
For instance, I want to find the last duplicate email received in a table (table has primary key, receiveddate and email fields).
Sample data:
1 01/01/2008 stuff#stuff.com
2 02/01/2008 stuff#stuff.com
3 01/12/2008 noone#stuff.com

something like this
select email ,max(receiveddate) as MaxDate
from YourTable
group by email
having count(email) > 1

Try something like:
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ReceivedDate, Email ORDER BY ReceivedDate, Email DESC) AS RowNumber
FROM EmailTable
) a
WHERE RowNumber = 1
See http://www.technicaloverload.com/working-with-duplicates-in-sql-server/

Couldn't you join the list on the e-mail field and then see what nulls you get in your result?
Or better yet, count the instances of each e-mail address? And only return the ones with count > 1
Or even take the email and id fields. And return the entries where the e-mail is the same, and the IDs are different. (To avoid duplicates don't use != but rather either < or >.)

SELECT [id], [receivedate], [email]
FROM [mytable]
WHERE [email] IN ( SELECT [email]
FROM [myTable]
GROUP BY [email]
HAVING COUNT([email]) > 1 )

Do you want a list of the last items? If so you could use:
SELECT [info] FROM [table] t WHERE NOT EXISTS (SELECT * FROM [table] tCheck WHERE t.date > tCheck.date)
If you want a list of all duplicate email address use GROUP BY to collect similar data, then a HAVING clause to make sure the quantity is more than 1:
SELECT [info] FROM [table] GROUP BY [email] HAVING Count(*) > 1 DESC
If you want the last duplicate e-mail (a single result) you simply add a "TOP 1" and "ORDER BY":
SELECT TOP 1 [info] FROM [table] GROUP BY [email] HAVING Count(*) > 1 ORDER BY Date DESC

If you have surrogate key, it is relatively easy to use the group by syntax mentioned in SQLMenance's post. Essentially, group by all the fields that make two or more rows 'the same'.
Example pseudo-code to delete duplicate records.
Create table people (ID(PK), Name, Address, DOB)
Delete from people where id not in (
Select min(ID) from people group by name, address, dob
)

Try this
select * from table a, table b
where a.email = b.email

Related

SQL SELECT Full Row with Duplicated Data in One Column

I am using Microsoft SQL Server 2014.
I am able to list emails which are duplicated.
But I am unable to list the entire row, which contain other fields such as EmployeeId, Username, FirstName, LastName, etc.
SELECT Email,
COUNT(Email) AS NumOccurrences
FROM EmployeeProfile
GROUP BY Email
HAVING ( COUNT(Email) > 1 )
May I know how can I list all field in the rows that contains Email appearing more than once in the table?
Thank you.
Try this:
WITH DataSource AS
(
SELECT *
,COUNT(*) OVER (PARTITION BY email) count_calc
FROM EmployeeProfile
)
SELECT *
FROM DataSource
WHERE count_calc > 1
select distinct * from EmployeeProfile where email in (SELECT
Email
FROM EmployeeProfile
GROUP BY Email
HAVING COUNT(*) > 1 )
SQL Fiddle
with cte as (
select *
, count(1) over (partition by email) noDuplicates
from Demo
)
select *
from cte
where noDuplicates > 1
order by Email, EmployeeId
Explanation:
I've used a common table expression (cte) here; but you could equally use a subquery; it makes no difference.
This cte/subquery fetches every row, and includes a new field called noDuplicates which says how many records have that same email address (including the record itself; so noDuplicates=1 actually means there are no duplicates; whilst noDuplicates=2 means the record itself and 1 duplicate, or 2 records with this email address). This field is calculated using an aggregate function over a window. You can read up on window functions here: https://learn.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-2017
In out outer query we're then selecting only those records with noDuplicates greater than 1; i.e. where there are multiple records with the same mail address.
Finally I've sorted by Email and EmployeeId; so that duplicates are listed alongside one another, and are presented in the sequence in which they were (presumably) created; just to make whoever's then dealing with these results life easy.
If EmployeeId is unique, then you can EXISTS :
SELECT ep.*
FROM EmployeeProfile ep
WHERE EXISTS (SELECT 1
FROM EmployeeProfile ep1
WHERE ep1.Email = ep.Email AND ep1.EmployeeId <> ep.EmployeeId
);

using group by operators in sql

i have two columns - email id and customer id, where an email id can be associated with multiple customer ids. Now, I need to list only those email ids (along with their corresponding customer ids) which are having a count of more than 1 customer id. I tried using grouping sets, rollup and cube operators, however, am not getting the desired result.
Any help or pointers would be appreciated.
SELECT emailid
FROM
( SELECT emailid, count(custid)
FROM table
Group by emailid
Having count(custid) > 1
)
I think this will get you what you want, if I am understanding you question correctly
select emailid, customerid from tablename where emailid in
(
select emailid from tablename group by emailid having count(emailid) > 1
)
Sounds like you would need to use HAVING
e.g
SELECT email_id, COUNT(customer_id)
From sometable
GROUP BY email_id
HAVING COUNT(customer_id) > 1
HAVING allows you to filter following the grouping of a particular column.
WITH email_ids AS (
SELECT email_id, COUNT(customer_id) customer_count
FROM Table
GROUP BY email_id
HAVING count(customer_id) > 1
)
SELECT t.email_id, t.customer_id
FROM Table t
INNER JOIN email_ids ei
ON ei.email_id = t.email_id
If you need a comma separated list of all of their customer id's returned with the single email id, you could use GROUP_CONCAT for that.
This would find all email_id's with at least 1 customer_id, and give you a comma separated list of all customer_ids for that email_id:
SELECT email_id, GROUP_CONCAT(customer_id)
FROM your_table
GROUP BY email_id
HAVING count(customer_id) > 1;
Assuming email_id #1 was assigned to customer_ids 1, 2, & 3, your output would look like:
email_id | customer_id
1 | 1,2,3
I didn't realize you were using MS SQL, there's a thread here about simulating GROUP_CONCAT in MS SQL: Simulating group_concat MySQL function in Microsoft SQL Server 2005?
SELECT t1.email, t1.customer
FROM table t1
INNER JOIN (
SELECT email, COUNT(customer)
FROM table
GROUP BY email
HAVING COUNT(customer)>1
) t2 on t1.email = t2.email
This should get you what your looking for.
Basically, as other ppl have stated, you can filter group by results with HAVING. But since you want the customerids afterwards, join the entire select back to your original table to get your results. Could probably be done prettier but this is easy to understand.
SELECT
email_id,
STUFF((SELECT ',' + CONVERT(VARCHAR,customer_id) FROM cust_email_table T1 WHERE T1.email_id = T2.email_id
FOR
XML PATH('')
),1,1,'') AS customer_ids
FROM
cust_email_table T2
GROUP BY email_id
HAVING COUNT(*) > 1
this would give you a single row per email id and comma seperated list of customer id's.

Select a NON-DISTINCT column in a query that return distincts rows

The following query returns the results that I need but I have to add the ID of the row to then update it. If I add the ID directly in the select statement it will return me more results then I need because each ID is unique so the DISTINCT statement see the line as unique.
SELECT DISTINCT ucpse.MemberID, ucpse.ProductID, ucpse.UserID
FROM UserCustomerProductSalaryExceptions as ucpse
WHERE EXISTS (SELECT NULL
FROM UserCustomerProductSalaryExceptions as upcse2
WHERE ucpse.userid = upcse2.userid AND ucpse.MemberID = upcse2.MemberID AND ucpse.ProductID = upcse2.ProductID
GROUP BY upcse2.UserID, upcse2.memberid, upcse2.productid
HAVING COUNT(UserID) >= 2
)
So basically I need to add ucpse.ID in the Select statement while keeping DISTINCT values for MemberID,ProductID and UserID.
Any Ideas ?
Thank you
According to you comment:
If the data has been duplicated 67 times for a given employee with a given product and a given client, I need to keep only one of thoses records. It's not important which one, so this is why I use DISTINC to obtain unique combinaison of given employee with a given product and a given client.
You can use MIN() or MAX() and GROUP BY instead of DISTINCT
SELECT MAX(ucpse.ID) AS ID, ucpse.MemberID, ucpse.ProductID, ucpse.UserID
FROM UserCustomerProductSalaryExceptions as ucpse
WHERE EXISTS (SELECT NULL
FROM UserCustomerProductSalaryExceptions as upcse2
WHERE ucpse.userid = upcse2.userid AND ucpse.MemberID = upcse2.MemberID AND ucpse.ProductID = upcse2.ProductID
GROUP BY upcse2.UserID, upcse2.memberid, upcse2.productid
HAVING COUNT(UserID) >= 2
)
GROUP BY ucpse.MemberID, ucpse.ProductID, ucpse.UserID
UPDATE:
From you comments I think the below query is what you need
DELETE FROM UserCustomerProductSalaryExceptions
WHERE ID NOT IN ( SELECT MAX(ucpse.ID) AS ID
FROM #UserCustomerProductSalaryExceptions
GROUP BY ucpse.MemberID, ucpse.ProductID, ucpse.UserID
HAVING COUNT(ucpse.ID) >= 2
)
If all you want is to delete the duplicates, this will do it:
WITH X AS
(SELECT ID,
ROW_NUMBER() OVER (PARTITION BY MemberID, ProductID, UserID ORDER BY ID) AS DupRowNum<br
FROM UserCustomerProductSalaryExceptions
)
DELETE X WHERE DupRowNum > 1
ID's not necessary - try:
UPDATE uu SET
<your settings here>
FROM UserCustomerProductSalaryExceptions uu
JOIN ( <paste your entire query above here>
) uc ON uc.MemberID=uu.MemberId AND uc.ProductID=uu.ProductId AND uc.UserID=uu.UserId
From the sound of your data structure (which I would STRONGLY advise normalizing as soon as possible), it sounds like you should be updating all the records. It sounds as if each duplicate is important because it contains some information about an employee's relation to a customer or product.
I would probably update all the records. Try this:
UPDATE UCPSE
SET
--Do your updates here
FROM UserCustomerProductSalaryExceptions as ucpse
JOIN
(
SELECT UserID, MemberID, ProductID
FROM UserCustomerProductSalaryExceptions
GROUP BY UserID, MemberID, ProductID
HAVING COUNT(UserID) >= 2
) T
ON ucpse.UserID = T.UserID AND ucpse.MemberID = T.MemberID AND ucpse.ProductID = T.ProductID

Find duplicated rows that are not exactly same

Can i select all rows that have same column value (for example SSN field) but display them all separably. ?
I've searched for this answer but they all have "count(*) and group by" section that demands the rows to be exactly same.
Try This:
SELECT A, B FROM MyTable
WHERE A IN
(
SELECT A FROM MyTable GROUP BY A HAVING COUNT(*)>1
)
I have done with SQL server. But hope this is what you need
Here is another approach, which only references the table once, using an analytic function instead of a subquery to get the duplicate counts It might be faster; it also might not, depending on the particular data.
SELECT * FROM (
SELECT col1, col2, col3, ssn, COUNT(*) OVER (PARTITION BY ssn) ssn_dup_count
)
WHERE ssn_dup_count > 1
ORDER BY ssn_dup_count DESC
SELECT
*
FROM
MyTable
WHERE
EXISTS
(
SELECT
NULL
FROM
MyTable MT
WHERE
MyTable.SameColumnName = MT.SameColumnName
AND MyTable.DifferentColumnName <> MT.DifferentColumnName)
This will fetch the required data and show them in order so that we can see the grouped data together.
SELECT * FROM TABLENAME
WHERE SSN IN
(
SELECT SSN FROM TABLENAMEGROUP BY SSN HAVING COUNT(SSN)>1
)
ORDER BY SSN
Here SSN is the column names fro which similar value check is done.

sql query - filtering duplicate values to create report

I am trying to list all the duplicate records in a table. This table does not have a Primary Key and has been specifically created only for creating a report to list out duplicates. It comprises of both unique and duplicate values.
The query I have so far is:
SELECT [OfficeCD]
,[NewID]
,[Year]
,[Type]
FROM [Test].[dbo].[Duplicates]
GROUP BY [OfficeCD]
,[NewID]
,[Year]
,[Type]
HAVING COUNT(*) > 1
This works right and gives me all the duplicates - that is the number of times it occurs.
But I want to display all the values in my report of all the columns. How can I do that without querying for each record separately?
For example:
Each table has 10 fields and [NewID] is the field which is occuring multiple times.I need to create a report with all the data in all the fields where newID has been duplicated.
Please help.
Thank you.
You need a subquery:
SELECT * FROM yourtable
WHERE NewID IN (
SELECT NewID FROM yourtable
GROUP BY OfficeCD,NewID,Year,Type
HAVING Count(*)>1
)
Additionally you might want to check your tags: You tagged mysql, but the Syntax lets me think you mean sql-server
Try this:
SELECT * FROM [Duplicates] WHERE NewID IN
(
SELECT [NewID] FROM [Duplicates] GROUP BY [NewID] HAVING COUNT(*) > 1
)
select d.*
from Duplicates d
inner join (
select NewID
from Duplicates
group by NewID
having COUNT(*) > 1
) dd on d.NewID = dd.NewID