Advanced table deduping

Advanced table deduping - sql

have a question... I have a table that has over 2 billion rows. Many are duplicates but there is a column (varchar) that has a validity date in a format such as 201806.
I want to dedupe the table BUT keep the most current date.
ex.
ID,fname, lname, addrees, city, state, zip, validitydate
1,steve,smith, pob 123, miami, fl. 33081,201709
2,steve,smith, pob 123, miami, fl. 33081,201010
3,steve,smith, pob 123, miami, fl. 33081,201809
4.steve,smith, pob 123, miami, fl. 33081,201201
I only want to keep: steve,smith, pob 123, miami, fl. 33081,201809 as it is the most current. If I run the below, it dedups, but it's a crap-shoot which one is left in the table as I cannot add the validityDate as the tsql will then look as all of them as unique.
How can I make it so it dedups but calculates to keep the most current date as the final entry?
thanks in advance.
WITH Records AS
(
SELECT fname, lname, addrees, city,
ROW_NUMBER() OVER (
PARTITION BY fname, lname, addrees, city, state, zip,
validitydate by ID) AS RecordInstance
FROM PEOPLE where lastname like 'S%'
)
DELETE
FROM Records
WHERE
RecordInstance > 1

Order by month (descending) so the RecordInstance will be 1 for the most current one:
WITH Records AS (
SELECT fname, lname, addrees, city,
ROW_NUMBER() OVER (
PARTITION BY fname, lname, addrees, city, state, zip
ORDER BY validitydate DESC -- Add this to order correctly!
) AS RecordInstance
FROM PEOPLE where lastname like 'S%'
)
DELETE FROM Records WHERE RecordInstance > 1

The delete will also work with just the ROW_NUMBER in the CTE. Which is ordered by the descending validitydate. So that the most recent month will have row_number 1 and you can delete those > 1
WITH CTE AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fname, lname, addrees, city, state, zip ORDER BY validitydate DESC, ID DESC) AS rn
FROM PEOPLE
WHERE lname like 'S%'
)
DELETE
FROM CTE
WHERE rn > 1;
A test can be found here

Here is a link to an article I wrote regarding this very issue.
https://sqlfundamentals.wordpress.com/delete-duplicate-rows-in-t-sql/
Hope this helps.

Related

Issue deleting duplicate records in SQL

so I seem to be having a moment and can't figure out why certain dups in a table are not getting deleted. I have a test table called QUERY that has names, addresses, DOB's, phones, etc. I am looking to delete the dups, but want to keep the most recent record (preferable but the below code doesn't represent that) where the phone is not empty. My code below just isn't working always giving 0 results. An example of a row would be:
+----------------------------------------------------------------+
| first,last,DOB,address,city,state,phonenumber,validitydate |
+----------------------------------------------------------------+
| steve,smith,19710922, 123 Here St, Miami, FL,9545551212,201902 |
| steve,smith,19710922, 123 Here St, Miami, FL,,202009 |
| steve,smith,19710922, 123 Here St, Miami, FL,9545551212,201802 |
+----------------------------------------------------------------+
WITH Records AS
(
SELECT lastname, firstname, address, state, dateofbirth, phonenumber
,ROW_NUMBER() OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber order by validitydate) AS RecordInstance
,ROW_NUMBER() OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber order by
CASE
WHEN phonenumber ='' THEN 0
WHEN phonenumber IS NOT NULL THEN 1
Else 0
END) as [ToInclude]
FROM query
)
delete
FROM records
WHERE
RecordInstance > 1
and ToInclude = 0
Anyone see anything I am doing wrong?? Thanks in advance

I think here is what you want to do :
SELECT
RANK() OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber order by validitydate) AS RecordInstance
,COUNT(*) OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber) DUPS
FROM QUERY
WHERE DUPS > 2
AND RecordInstance = 1
AND phonenumber <> '' -- maybe? if you really don't want to delete duplicates with no phone number

If your required result is to delete all the duplicate records where phone number is not null or empty then you can try below query:
WITH Records AS
(
SELECT lastname, firstname, address, state, dateofbirth, phonenumber
,ROW_NUMBER() OVER (PARTITION BY lastname, firstname, address, state, dateofbirth, phonenumber order by validitydate) AS RecordInstance
FROM query where phonenumber<>'' and phonenumber is not null
)
delete
FROM records
WHERE
RecordInstance > 1

Try
RecordInstance = ROW_NUMBER() OVER(PARTITION BY LastName, FirstName, Address, City, DateOfBirth ORDER BY PhoneNumber DESC, ValidityDate)

Oracle query: how do I limit the returned records to only those having a count > 1 but show full results?

I need to show all the users who have more than one ID but not return the users who do. I tried group by having but I need to list the IDs and not just count them so could not get that to work for me. I ended up with using a the code below but it returns all the records.
select id,fname,lname,ssn,dob
count(id) over partition by fname,lname,ssn,dob) as cnt
from TABLE
order by cnt desc;

Use a subquery:
select id, fname, lname, ssn, dob
from (select id, fname, lname, ssn, dob,
count(id) over (partition by fname, lname, ssn, dob) as cnt
from TABLE
) t
where cnt >= 2
order by cnt;

WITH CTE (FNAME, LNAME, TALLY) AS
(
SELECT FNAME, LNAME, COUNT(ID) AS TALLY
FROM TABLE
HAVING COUNT(ID) > 1
)
SELECT T.ID, C.FNAME,C.LNAME FROM CTE C
JOIN TABLE T
ON C.FNAME = T.FNAME
AND C.LNAME = T.LNAME

Deleting duplicates in a table based on a criteria only in SQL

Let's say I have a table with columns:
CustomerNumber
Lastname
Firstname
PurchaseDate
...and other columns that do not change anything in the question if they're not shown here.
In this table I could have many rows for the same customer with different purchase dates (I know, poorly designed... I'm only trying to fix an issue for reporting, not really trying to fix the root of the problem).
How, in SQL, can I keep one record per customer with the latest date, and delete the rest? A group by doesn't seem to be working for my case

;with a as
(
select row_number() over (partition by CustomerNumber, Lastname, Firstname order by PurchaseDate desc) rn
from <table>
)
delete from a where rn > 1

This worked for me (on DB2):
DELETE FROM my_table
WHERE (CustomerNumber, Lastname, Firstname, PurchaseDate)
NOT IN (
SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate)
FROM my_table
GROUP BY CustomerNumber, Lastname, FirstName
)

SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate) LatestPurchaseDate
FROM Table
GROUP BY CustomerNumber, Lastname, Firstname
The MAX will select the highest (latest) date and show that date for each unique combination of the GROUP BY columns.
EDIT: I misunderstood that you wanted to delete records for all but the latest purchase date.
WITH Keep AS
(
SELECT CustomerNumber, Lastname, Firstname, MAX(PurchaseDate) LatestPurchaseDate
FROM Table
GROUP BY CustomerNumber, Lastname, Firstname
)
DELETE FROM Table
WHERE NOT EXISTS
(
SELECT *
FROM Keep
WHERE Table.CustomerNumber = Keep.CustomerNumber
AND Table.Lastname = Keep.Lastname
AND Table.Firstname = Keep.Firstname
AND Table.PurchaseDate = Keep.LastPurchaseDate
)

Get distinct name, address, max(date) while preserving ID

I have the following table structure:
ID | fname | lname | street | date
I'm trying to grab the distinct fname, lname, street and max(date) but also preserve the id of the matching row. So there might be multiple lines of matching fname, lname, street but all with different IDs Seems like a simple thing but evidently it's escaped me to this point.
I found some solutions that almost fit this but not quite. My apologies if this has been covered.
Thanks.

Try the following:
;WITH CTE AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY fname, lname, street ORDER BY [Date] DESC) RN
FROM yourTable
)
SELECT ID, fname, lname, street, [date]
FROM CTE
WHERE RN = 1

Assuming max(date) is in the max(id):
select max(ID), fname, lname, street, max(date)
from tablename
group by fname, lname, street

SQL DISTINCT [Alternative Using]

I have a simple query on Oracle.
SELECT DISTINCT City, Name, Surname FROM Persons
Is there any alternative sql query for the same query without DISTINCT ?

Have a look at this article
Example as;
select City
from (
select City,
row_number() over
(partition by City
order by City) rownumber
from Persons
) t
where rownumber = 1

SELECT City, Name, Surname FROM Persons
UNION
SELECT City, Name, Surname FROM Persons

SELECT First(City), First(Name), First(Surname)
FROM Persons
GROUP BY City, Name, Surname

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Advanced table deduping - sql

Here is a link to an article I wrote regarding this very issue. https://sqlfundamentals.wordpress.com/delete-duplicate-rows-in-t-sql/ Hope this helps.

Related

Issue deleting duplicate records in SQL

Oracle query: how do I limit the returned records to only those having a count > 1 but show full results?

Deleting duplicates in a table based on a criteria only in SQL

Get distinct name, address, max(date) while preserving ID

SQL DISTINCT [Alternative Using]

Categories

Resources