Issue deleting duplicate records in SQL - sql

so I seem to be having a moment and can't figure out why certain dups in a table are not getting deleted. I have a test table called QUERY that has names, addresses, DOB's, phones, etc. I am looking to delete the dups, but want to keep the most recent record (preferable but the below code doesn't represent that) where the phone is not empty. My code below just isn't working always giving 0 results. An example of a row would be:
+----------------------------------------------------------------+
| first,last,DOB,address,city,state,phonenumber,validitydate |
+----------------------------------------------------------------+
| steve,smith,19710922, 123 Here St, Miami, FL,9545551212,201902 |
| steve,smith,19710922, 123 Here St, Miami, FL,,202009 |
| steve,smith,19710922, 123 Here St, Miami, FL,9545551212,201802 |
+----------------------------------------------------------------+
WITH Records AS
(
SELECT lastname, firstname, address, state, dateofbirth, phonenumber
,ROW_NUMBER() OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber order by validitydate) AS RecordInstance
,ROW_NUMBER() OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber order by
CASE
WHEN phonenumber ='' THEN 0
WHEN phonenumber IS NOT NULL THEN 1
Else 0
END) as [ToInclude]
FROM query
)
delete
FROM records
WHERE
RecordInstance > 1
and ToInclude = 0
Anyone see anything I am doing wrong?? Thanks in advance

I think here is what you want to do :
SELECT
RANK() OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber order by validitydate) AS RecordInstance
,COUNT(*) OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber) DUPS
FROM QUERY
WHERE DUPS > 2
AND RecordInstance = 1
AND phonenumber <> '' -- maybe? if you really don't want to delete duplicates with no phone number

If your required result is to delete all the duplicate records where phone number is not null or empty then you can try below query:
WITH Records AS
(
SELECT lastname, firstname, address, state, dateofbirth, phonenumber
,ROW_NUMBER() OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber order by validitydate) AS RecordInstance
FROM query where phonenumber<>'' and phonenumber is not null
)
delete
FROM records
WHERE
RecordInstance > 1

Try
RecordInstance = ROW_NUMBER() OVER(PARTITION BY LastName, FirstName, Address, City, DateOfBirth ORDER BY PhoneNumber DESC, ValidityDate)

Related

Is there a way to equate null values to non null values in a partition in SQL?

I have a database table of people records with columns for UserID, FirstName, LastName, DOB, and Email address. FirstName, LastName, and Email are required values, but DOB can be null if the person didn't give that information, so a few rows could look like this:
FirstName LastName DOB Email UserID
John Doe 1990-01-01 johndoe#gmail.com 1
Jane Doe 1990-02-01 janedoe#gmail.com 2
John Doe NULL johndoe#gmail.com 3
Paul Blart 1985-01-01 mallcop#gmail.com 4
Clark Kent NULL ImNotSuperman#gmail.com 5
Paul Blart 1985-01-01 mallcop#gmail.com 6
And I am trying to write a query (that is part of a bigger program) to identify duplicate people records in the database. The requirements are that FirstName, LastName, and Email must be identical, and if there is a value for DOB then it must be identical, but if it is null it can still be labeled as a duplicate. So in the above table, the two John Doe's and the two Paul Blart's would be selected. I want to do this in a partition statement. So my initial attempt is:
SELECT COUNT(UserID) OVER (Partition BY FirstName, LastName, DOB, Email) AS Count,
DENSE_RANK() OVER (ORDER BY FirstName, LastName, DOB, Email) AS RANK,
UserID, FirstName, LastName, DOB, Email
FROM People
where COUNT(UserID) OVER (Partition BY FirstName, LastName, DOB, Email) > 1
Which correctly selects the Paul Blart's as duplicates but not the John Doe's because one has a null value for DOB. Is there any way to make it so those records are properly selected?
This might be simpler expressed with exists:
select t.*
from mytable t
where exists (
select 1
from mytable t1
where
t1.id <> t.id
and t1.firstname = t.firstname
and t1.lastname = t.lastname
and t1.email = t.email
and (t1.dob = t.dob or t1.dob is null or t.dob is null)
)
You can do this using window functions:
select t.*
from (select t.*,
count(*) over (partition by firstname, lastname, email, dob) as cnt,
sum(case when dob is null then 1 else 0 end) over (partition by firstname, lastname, email) as cnt_null
from t
) t
where cnt > 1 or
(dob is not null and cnt_null > 0);

Single grand total ROLLUP with multiple columns

I am looking to add a single grand total for salaries to my table, which is also based on a selection of multiple columns. The code I'm stuck on is below:
SELECT country, state1, city, street, ID, lastname + ', ' + firstname AS 'Name', SUM(salary) AS 'AnnualSalary'
FROM geography1 JOIN address ON street = streetname JOIN employee ON ID = PID
WHERE termdate IS NULL
GROUP BY country, state1, city, street, gender, lastname, firstname
UNION ALL
SELECT COALESCE(country,'TOTAL'), NULL AS state1, NULL AS city, NULL AS street, NULL AS gender, NULL AS lastname, NULL AS lastname, SUM(salary) AS 'AnnualSalary'
FROM geography1 JOIN address ON street = streetname JOIN employee ON ID = PID
WHERE termdate IS NULL
GROUP BY ROLLUP(country);
The query above executes to include the grand total and additional rows that group by country totals, but the other columns that follow are null. Is there a way to rewrite this so that there is only a single grand total row?
I apologize in advance for being so new to this. I've looked at other questions and this is what I've been able to piece together. Thanks!
You can control the groupings using grouping sets. If you want the groups that you have plus the total for country and the overall total, then:
SELECT country, state1, city, street, ID, lastname + ', ' + firstname AS Name,
SUM(salary) AS 'AnnualSalary'
FROM geography1 JOIN
address
ON street = streetname JOIN
employee ON ID = PID
WHERE termdate IS NULL
GROUP BY GROUPING SETS ( (country, state1, city, street, gender, lastname, firstname), (country), () );

Advanced table deduping

have a question... I have a table that has over 2 billion rows. Many are duplicates but there is a column (varchar) that has a validity date in a format such as 201806.
I want to dedupe the table BUT keep the most current date.
ex.
ID,fname, lname, addrees, city, state, zip, validitydate
1,steve,smith, pob 123, miami, fl. 33081,201709
2,steve,smith, pob 123, miami, fl. 33081,201010
3,steve,smith, pob 123, miami, fl. 33081,201809
4.steve,smith, pob 123, miami, fl. 33081,201201
I only want to keep: steve,smith, pob 123, miami, fl. 33081,201809 as it is the most current. If I run the below, it dedups, but it's a crap-shoot which one is left in the table as I cannot add the validityDate as the tsql will then look as all of them as unique.
How can I make it so it dedups but calculates to keep the most current date as the final entry?
thanks in advance.
WITH Records AS
(
SELECT fname, lname, addrees, city,
ROW_NUMBER() OVER (
PARTITION BY fname, lname, addrees, city, state, zip,
validitydate by ID) AS RecordInstance
FROM PEOPLE where lastname like 'S%'
)
DELETE
FROM Records
WHERE
RecordInstance > 1
Order by month (descending) so the RecordInstance will be 1 for the most current one:
WITH Records AS (
SELECT fname, lname, addrees, city,
ROW_NUMBER() OVER (
PARTITION BY fname, lname, addrees, city, state, zip
ORDER BY validitydate DESC -- Add this to order correctly!
) AS RecordInstance
FROM PEOPLE where lastname like 'S%'
)
DELETE FROM Records WHERE RecordInstance > 1
The delete will also work with just the ROW_NUMBER in the CTE. Which is ordered by the descending validitydate. So that the most recent month will have row_number 1 and you can delete those > 1
WITH CTE AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fname, lname, addrees, city, state, zip ORDER BY validitydate DESC, ID DESC) AS rn
FROM PEOPLE
WHERE lname like 'S%'
)
DELETE
FROM CTE
WHERE rn > 1;
A test can be found here
Here is a link to an article I wrote regarding this very issue.
https://sqlfundamentals.wordpress.com/delete-duplicate-rows-in-t-sql/
Hope this helps.

Deleting duplicates in a table based on a criteria only in SQL

Let's say I have a table with columns:
CustomerNumber
Lastname
Firstname
PurchaseDate
...and other columns that do not change anything in the question if they're not shown here.
In this table I could have many rows for the same customer with different purchase dates (I know, poorly designed... I'm only trying to fix an issue for reporting, not really trying to fix the root of the problem).
How, in SQL, can I keep one record per customer with the latest date, and delete the rest? A group by doesn't seem to be working for my case
;with a as
(
select row_number() over (partition by CustomerNumber, Lastname, Firstname order by PurchaseDate desc) rn
from <table>
)
delete from a where rn > 1
This worked for me (on DB2):
DELETE FROM my_table
WHERE (CustomerNumber, Lastname, Firstname, PurchaseDate)
NOT IN (
SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate)
FROM my_table
GROUP BY CustomerNumber, Lastname, FirstName
)
SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate) LatestPurchaseDate
FROM Table
GROUP BY CustomerNumber, Lastname, Firstname
The MAX will select the highest (latest) date and show that date for each unique combination of the GROUP BY columns.
EDIT: I misunderstood that you wanted to delete records for all but the latest purchase date.
WITH Keep AS
(
SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate) LatestPurchaseDate
FROM Table
GROUP BY CustomerNumber, Lastname, Firstname
)
DELETE FROM Table
WHERE NOT EXISTS
(
SELECT *
FROM Keep
WHERE Table.CustomerNumber = Keep.CustomerNumber
AND Table.Lastname = Keep.Lastname
AND Table.Firstname = Keep.Firstname
AND Table.PurchaseDate = Keep.LastPurchaseDate
)

Get distinct name, address, max(date) while preserving ID

I have the following table structure:
ID | fname | lname | street | date
I'm trying to grab the distinct fname, lname, street and max(date) but also preserve the id of the matching row. So there might be multiple lines of matching fname, lname, street but all with different IDs Seems like a simple thing but evidently it's escaped me to this point.
I found some solutions that almost fit this but not quite. My apologies if this has been covered.
Thanks.
Try the following:
;WITH CTE AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY fname, lname, street ORDER BY [Date] DESC) RN
FROM yourTable
)
SELECT ID, fname, lname, street, [date]
FROM CTE
WHERE RN = 1
Assuming max(date) is in the max(id):
select max(ID), fname, lname, street, max(date)
from tablename
group by fname, lname, street