Multi-Pass Duplication Identification with Exclusions

Multi-Pass Duplication Identification with Exclusions - sql

I have a customer table with several hundred thousand records. There are a LOT of duplicates of varying degrees. I am trying to identify duplicate records with level of possibility of being a duplicate.
My source table has 7 fields and looks like this:
I look for duplicates, and put them into an intermediate table with the level of possibility, table name, and the customer number.
Intermediate Table
CREATE TABLE DataCheck (
id int identity(1,1),
reason varchar(100) DEFAULT NULL,
tableName varchar(100) DEFAULT NULL,
tableID varchar(100) DEFAULT NULL
)
Here is my code to identify and insert:
-- Match on Company, Contact, Address, City, and Phone
-- DUPE
INSERT INTO DataCheck
SELECT 'Duplicate','CUSTOMER',tcd.uid
FROM #tmpCoreData tcd
INNER JOIN
(SELECT
company,
fname,
lname,
add1,
city,
phone1,
COUNT(*) AS count
FROM #tmpCoreData
WHERE company <> ''
GROUP BY company, fname, lname, add1, city, phone1
HAVING COUNT(*) > 1) dl
ON dl.company = tcd.company
ORDER BY tcd.company
In this example, it would insert ids 101, 102
The problem is when I perform the next pass:
-- Match on Company, Address, City, Phone (Diff Contacts)
-- LIKELY DUPE
INSERT INTO DataCheck
SELECT 'Likely Duplicate','CUSTOMER',tcd.uid
FROM #tmpCoreData tcd
INNER JOIN
(SELECT
company,
add1,
city,
phone1,
COUNT(*) AS count
FROM #tmpCoreData
WHERE company <> ''
GROUP BY company, add1, city, phone1
HAVING COUNT(*) > 1) dl
ON dl.company = tcd.company
ORDER BY tcd.companyc
This pass would then insert, 101, 102 & 103.
The next pass drops the phone so it would insert 101, 102, 103, 104
The next pass would look for company only which would insert all 5.
I now have 14 entries into my intermediate table for 5 records.
How can I add an exclusion so the 2nd pass groups on the same Company, Address, City, Phone but DIFFERENT fname and lname. Then it should only insert 101 and 103
I considered adding a NOT IN (SELECT tableID FROM DataCheck) to ensure IDs aren't added multiple times, but on the 3rd of 4th pass it may find a duplicate and entered 700 records after the row it's a duplicate of, so you lose the context of it's a dupe of.
My output uses:
SELECT
dc.reason,
dc.tableName,
tcd.*
FROM DataCheck dc
INNER JOIN #tmpCoreData tcd
ON tcd.uid = dc.tableID
ORDER BY dc.id
And looks something like this, which is a bit confusing:

I'm going to challenge your perception of your issue, and instead propose that you calculate a simple "confidence score", which will also help you vastly simplify your results table:
WITH FirstCompany AS (SELECT custNo, company, fname, lname, add1, city, phone1
FROM(SELECT custNo, company, fname, lname, add1, city, phone1,
ROW_NUMBER() OVER(PARTITION BY company ORDER BY custNo) AS ordering
FROM CoreData) FC
WHERE ordering = 1)
SELECT RankMapping.description, Duplicate.custNo, Duplicate.company, Duplicate.fname, Duplicate.lname, Duplicate.add1, Duplicate.city, Duplicate.phone1
FROM (SELECT FirstCompany.custNo AS originalCustNo, Duplicate.*,
CASE WHEN FirstCompany.custNo = Duplicate.custNo THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.fname = Duplicate.fname AND FirstCompany.lname = Duplicate.lname THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.add1 = Duplicate.add1 AND FirstCompany.city = Duplicate.city THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.phone1 = Duplicate.phone1 THEN 1 ELSE 0 END
AS ranking
FROM FirstCompany
JOIN CoreData Duplicate
ON Duplicate.custNo >= FirstCompany.custNo
AND Duplicate.company = FirstCompany.company) Duplicate
JOIN (VALUES (4, 'original'),
(3, 'duplicate'),
(2, 'likely dupe'),
(1, 'possible dupe'),
(0, 'not likely dupe')) RankMapping(score, description)
ON RankMapping.score = Duplicate.ranking
ORDER BY Duplicate.originalCustNo, Duplicate.ranking DESC
SQL Fiddle Example
... which generates results that look like this:
| description | custNo | company | fname | lname | add1 | city | phone1 |
|-----------------|--------|----------|---------|--------|--------------|--------------|------------|
| original | 101 | ACME INC | JOHN | DOE | 123 ACME ST | LOONEY HILLS | 1231234567 |
| duplicate | 102 | ACME INC | JOHN | DOE | 123 ACME ST | LOONEY HILLS | 1231234567 |
| likely dupe | 103 | ACME INC | JANE | SMITH | 123 ACME ST | LOONEY HILLS | 1231234567 |
| possible dupe | 104 | ACME INC | BOB | DOLE | 123 ACME ST | LOONEY HILLS | 4564567890 |
| not likely dupe | 105 | ACME INC | JESSICA | RABBIT | 456 ROGER LN | WARNER | 4564567890 |
This code baselessly assumes that the smallest custNo is the "original", and assumes matches will be equivalent to solely that one, but it's completely possible to get other matches as well (just unnest the subquery in the CTE, and remove the row number).

Related

SQL Query Find Exact and Near Dupes

I have a SQL table with FirstName, LastName, Add1 and other fields. I am working to get this data cleaned up. There are a few instances of likely dupes -
All 3 columns are the exact same for more than 1 record
The First and Last are the same, only 1 has an address, the other is blank
The First and Last are similar (John | Doe vs John C. | Doe) and the address is the same or one is blank
I'm wanting to generate a query I can provide to the users, so they can check these records out, compare their related records and then delete the one they don't need.
I've been looking at similarity functions, soundex, and such, but it all seems so complicated. Is there an easy way to do this?
Thanks!
Edit:
So here is some sample data:
FirstName | LastName | Add1
John | Doe | 1 Main St
John | Doe |
John A. | Doe |
Jane | Doe | 2 Union Ave
Jane B. | Doe | 2 Union Ave
Alex | Smith | 3 Broad St
Chris | Anderson | 4 South Blvd
Chris | Anderson | 4 South Blvd
I really like Critical Error's query for identifying all different types of dupes. That would give me the above sample data, with the Alex Smith result not included, because there are no dupes for that.
What I want to do is take that result set and identify which are dupes for Jane Doe. She should only have 2 dupes. John Doe has 3, and Chris Anderson has 2. Can I get at that sub-result set?
Edit:
I figured it out! I will be marking Critical Error's answer as the solution, since it totally got me where I needed to go. Here is the solution, in case it might help others. Basically, this is what we are doing.
Selecting the records from the table where there are dupes
Adding a WHERE EXISTS sub-query to look in the same table for exact dupes, where the ID from the main query and sub-query do not match
Adding a WHERE EXISTS sub-query to look in the same table for similar dupes, using a Difference factor between duplicative columns, where the ID from the main query and sub-query do not match
Adding a WHERE EXISTS sub-query to look in the same table for dupes on 2 fields where a 3rd may be null for one of the records, where the ID from the main query and sub-query do not match
Each subquery is connected with an OR, so that any kind of duplicate is found
At the end of each sub-query add a nested requirement that either the main query or sub-query be the ID of the record you are looking to identify duplicates for.
DECLARE #CID AS INT
SET ANSI_NULLS ON
SET NOCOUNT ON;
SET #CID = 12345
BEGIN
SELECT
*
FROM #Customers c
WHERE
-- Exact duplicates.
EXISTS (
SELECT * FROM #Customers x WHERE
x.FirstName = c.FirstName
AND x.LastName = c.LastName
AND x.Add1 = c.Add1
AND x.Id <> c.Id
AND (x.ID = #CID OR c.ID = #CID)
)
-- Match First/Last name are same/similar and the address is same.
OR EXISTS (
SELECT * FROM #Customers x WHERE
DIFFERENCE( x.FirstName, c.FirstName ) = 4
AND DIFFERENCE( x.LastName, c.LastName ) = 4
AND x.Add1 = c.Add1
AND x.Id <> c.Id
AND (x.ID = #CID OR c.ID = #CID)
)
-- Match First/Last name and one address exists.
OR EXISTS (
SELECT * FROM #Customers x WHERE
x.FirstName = c.FirstName
AND x.LastName = c.LastName
AND x.Id <> c.Id
AND (
x.Add1 IS NULL AND c.Add1 IS NOT NULL
OR
x.Add1 IS NOT NULL AND c.Add1 IS NULL
)
AND (x.ID = #CID OR c.ID = #CID)
);

Assuming you have a unique id between records, you can give this a try:
DECLARE #Customers table ( FirstName varchar(50), LastName varchar(50), Add1 varchar(50), Id int IDENTITY(1,1) );
INSERT INTO #Customers ( FirstName, LastName, Add1 ) VALUES
( 'John', 'Doe', '123 Anywhere Ln' ),
( 'John', 'Doe', '123 Anywhere Ln' ),
( 'John', 'Doe', NULL ),
( 'John C.', 'Doe', '123 Anywhere Ln' ),
( 'John C.', 'Doe', '15673 SW Liar Dr' );
SELECT
*
FROM #Customers c
WHERE
-- Exact duplicates.
EXISTS (
SELECT * FROM #Customers x WHERE
x.FirstName = c.FirstName
AND x.LastName = c.LastName
AND x.Add1 = c.Add1
AND x.Id <> c.Id
)
-- Match First/Last name are same/similar and the address is same.
OR EXISTS (
SELECT * FROM #Customers x WHERE
DIFFERENCE( x.FirstName, c.FirstName ) = 4
AND DIFFERENCE( x.LastName, c.LastName ) = 4
AND x.Add1 = c.Add1
AND x.Id <> c.Id
)
-- Match First/Last name and one address exists.
OR EXISTS (
SELECT * FROM #Customers x WHERE
x.FirstName = c.FirstName
AND x.LastName = c.LastName
AND x.Id <> c.Id
AND (
x.Add1 IS NULL AND c.Add1 IS NOT NULL
OR
x.Add1 IS NOT NULL AND c.Add1 IS NULL
)
);
Returns
+-----------+----------+-----------------+----+
| FirstName | LastName | Add1 | Id |
+-----------+----------+-----------------+----+
| John | Doe | 123 Anywhere Ln | 1 |
| John | Doe | 123 Anywhere Ln | 2 |
| John | Doe | NULL | 3 |
| John C. | Doe | 123 Anywhere Ln | 4 |
+-----------+----------+-----------------+----+
Initial resultset:
+-----------+----------+------------------+----+
| FirstName | LastName | Add1 | Id |
+-----------+----------+------------------+----+
| John | Doe | 123 Anywhere Ln | 1 |
| John | Doe | 123 Anywhere Ln | 2 |
| John | Doe | NULL | 3 |
| John C. | Doe | 123 Anywhere Ln | 4 |
| John C. | Doe | 15673 SW Liar Dr | 5 |
+-----------+----------+------------------+----+

Questions with SQL - separating a column in two

I don't know how I can answer this question. Because the name and last name are in one column. I'm not allowed to change the columns.
"Get the average spending (per customer) of all customers who share a last name with another customer"
I thought to say in sqlite3
SELECT avg_spending
FROM customer
JOIN customer on WHERE name is name;
This is how the table is defined:
CREATE TABLE customer
(
cuid INTEGER,
name STRING,
age INTEGER,
avg_spending REAL,
PRIMARY KEY(cuid)
);
So those values are having the same last name
INSERT INTO customer VALUES (4, "Henk Krom", 65, 24);
INSERT INTO customer VALUES (9, "Bob Krom", 66, 4);

From the sample data you posted I guess the format of the column name is:
FirstName LastName
so you need to extract the LastName and use group by to get the average:
select
substr(name, instr(name, ' ') + 1) lastname,
avg(avg_spending) avg_spending
from customer
group by lastname
having count(*) > 1
The having clause restricts the results to those customer names that have at least 1 other customer name with the same last name.
See the demo.
For the sample data:
> cuid | name | age | avg_spending
> :--- | :-------- | :-- | :-----------
> 4 | Henk Krom | 65 | 24
> 9 | Bob Krom | 66 | 4
> 5 | Jack Doe | 66 | 4
> 7 | Jill Doe | 66 | 6
> 1 | Alice No | 66 | 44
you get results:
> lastname | avg_spending
> :------- | :-----------
> Doe | 5
> Krom | 14

As mentioned in the comments, the crux of this is to find a rule how to reliably extract the surname from the name. Apart from that you merely need an exists clause, because you want to select customers where another customer with the same surname exists.
("Get the average spending (per customer)" simply means get a row from the table, because each row contains exactly one customer and their average spending.)
If all names were in the format first name - blank - last name, that would be:
select *
from customer c
where exists
(
select *
from customer other
where other.cuid <> c.cuid
and substr(other.name, instr(other.name, ' ') + 1) = substr(c.name, instr(c.name, ' ') + 1)
);

You were correct in joining the customer table to itself but you also need to parse out the last name to compare and remove duplicates once a match was found since if nameA equals nameB then nameB has to equal nameA.
with custs AS
(
select distinct
a.name as name_1 ,
b.name as name_2
from customer a
join customer b
on substr(a.name, instr(a.name, ' ') + 1) = substr(b.name, instr(b.name, ' ') + 1)
where a.name like '%Krom%' and a.name <> b.name
)
select * from customer where name in (select name_1 from custs)
union
select * from customer where name in (select name_2 from custs)

Sum Decode statement SQL

I am trying to sum a few Decode statements and column names, but am having difficulties.
currently it is showing as
rank | name | points
----------------------
0 | john | 0
0 | john | 40
1 | john | 30
2 | tom | 22
0 | tom | 0
I expect to have this result:
rank | name | points
----------------------
1 | john | 70
2 | tom | 22
Query:
Select Rank, Name, Code, Points
From
(select
decode(Table.name, 'condition1', Table.value) As Points,
decode(Table.name, 'Condition2', Table.value) As Rank,
Employee.name as Name,
Employee.GA1 as Code
from Table
inner Join Employee
on Empolyee.positionseq = name.positionseq
where Table.name IN ('Condition1', 'Condition2')
);

Select MAX(Rank), Name, Code, SUM(Points)
From
(select
decode(Table.name, 'condition1', Table.value) As Points
decode(Table.name, 'Condition2', Table.value) As Rank
,Employee.name as Name
,Employee.GA1 as Code
from Table
inner Join Employee
on Employee.positionseq = name.positionseq
where Table.name IN( 'Condition1', 'Condition2'))
GROUP BY Employee.id;
I added the SUM, MAX (for rank) and GROUP BY statements. Also corrected some misspellings (Empolyee)

I may be understanding your question incorrectly, however, it seems like you are trying to do the following (omitting inner join for simplicity):
Select MAX(rank), name, SUM(points)
FROM UserRanks
GROUP BY name
Based on your data set above, you should get the following results:
rank name points
1 john 70
2 tom 22

How to get the previous row-text

First of all: I am a SQL beginner and I use SQL Server 2008.
The tables as it is now, is written as:
SELECT
Transaction.description, Person.name
FROM
Transaction, Person, SystemUser
WHERE
Person.personnumber = SystemUser.personnumber
AND Transaction.art_ID = SystemUser.art_ID
ORDER BY
Transaction.description
where personnumber is PK nvarchar (could look like N0890) where the last numbers of it grows with +1 for every new person.
art_ID (Transaction) is PK smallint, art_ID (SystemUser) is smallint, description is nvarchar.
I want to get the text from the previous row, in the same column, so that I can manipulate the text to be clear and make the result-table look more simple.
Example as it is now:
|Transactions | Persons |
|-------------------|----------|
|Statistic | Ursula |
|Statistic | Peter |
|Statistic | Alan |
|Settlement | Christie |
|Settlement | Tania |
|Deptor department | Jack |
|Economy department | Rickie |
|Economy department | Annie |
|Economy department | Tom |
|Economy department | Seth |
How I want it to be:
|Transactions | Persons |
|-------------------|----------|
|Statistic | Ursula |
| | Peter |
| | Alan |
|Settlement | Christie |
| | Tania |
|Deptor department | Jack |
|Economy department | Rickie |
| | Annie |
| | Tom |
| | Seth |
as in select case when description = description - 1 row then ''
I have searched for examples and every one of them are based on integers, not varchar/nvarchar), and I keep getting errors when i try to do it with varchars. Such as With CTE, min() and max().
Do you have any ideas of what function I can use or how to set up the select-statement to do as I want?

First use a rank function to identify just one of them:
SELECT Transaction.description, Person.name,
RANK() OVER (PARTITION BY Transaction.description ORDER BY Person.name) As R
FROM Transaction, Person, SystemUser
WHERE Person.personnumber = SystemUser.personnumber
AND Transaction.art_ID = SystemUser.art_ID
ORDER BY Transaction.description, Person.name
Notice the lines you want to see have 1 against them? Use that:
SELECT
CASE WHEN R=1 THEN Transaction.description ELSE '' END description,
Person.name
FROM
(
SELECT Transaction.description, Person.name,
RANK() OVER (PARTITION BY Transaction.description ORDER BY Person.name) As R
FROM Transaction, Person, SystemUser
WHERE Person.personnumber = SystemUser.personnumber
AND Transaction.art_ID = SystemUser.art_ID
) Subtable
ORDER BY Transaction.description, Person.name

I think following SQL should work
CREATE TABLE #TempTable (rowrank INT, description VARCHAR(256), name VARCHAR(256));
INSERT INTO #TempTable (rowrank, description, name)
VALUES
Select RANK() OVER (ORDER BY Transaction.description)
,Transaction.description
,name
FROM Transaction, Person, SystemUser
WHERE Person.personnumber = SystemUser.personnumber
AND Transaction.art_ID = SystemUser.art_ID
ORDER BY Transaction.description
SELECT
CASE
WHEN prev.RANK = TT.RANK
THEN ""
ELSE TT.Description
END AS Description,
name
FROM #TempTable TT
LEFT JOIN #TempTable prev ON prev.rownum = TT.rownum - 1

Find Min Value and value of a corresponding column for that result

I have a table of user data in my SQL Server database and I am attempting to summarize the data. Basically, I need some min, max, and sum values and to group by some columns
Here is a sample table:
Member ID | Name | DateJoined | DateQuit | PointsEarned | Address
00001 | Leyth | 1/1/2013 | 9/30/2013 | 57 | 123 FirstAddress Way
00002 | James | 2/1/2013 | 7/21/2013 | 34 | 4 street road
00001 | Leyth | 2/1/2013 | 10/15/2013| 32 | 456 LastAddress Way
00003 | Eric | 2/23/2013 | 4/14/2013 | 15 | 5 street road
I'd like the summarized table to show the results like this:
Member ID | Name | DateJoined | DateQuit | PointsEarned | Address
00001 | Leyth | 1/1/2013 | 10/15/2013 | 89 | 123 FirstAddress Way
00002 | James | 2/1/2013 | 7/21/2013 | 34 | 4 street road
00003 | Eric | 2/23/2013 | 4/14/2013 | 15 | 5 street road
Here is my query so far:
Select MemberID, Name, Min(DateJoined), Max(DateQuit), SUM(PointsEarned), Min(Address)
From Table
Group By MemberID
The Min(Address) works this time, it retrieves the address that corresponds to the earliest DateJoined. However, if we swapped the two addresses in the original table, we would retrieve "123 FirstAddress Way" which would not correspond to the 1/1/2013 date joined.

For almost everything you can use a simple groupby, but as you need "the same address than the row where the minimum datejoined is" is a little bit tricker and you can solve it in several ways, one is a subquery searching the address each time
SELECT
X.*,
(select Address
from #tmp t2
where t2.MemberID = X.memberID and
t2.DateJoined = (select MIN(DateJoined)
from #tmp t3
where t3.memberID = X.MemberID))
FROM
(select MemberID,
Name,
MIN(DateJoined) as DateJoined,
MAX(DateQuit) as DateQuit,
SUM(PointsEarned) as PointEarned
from #tmp t1
group by MemberID,Name
) AS X
`
Or other is a subquery with a Join
SELECT
X.*,
J.Address
FROM
(select
MemberID,
Name,
MIN(DateJoined) as DateJoined,
MAX(DateQuit) as DateQuit,
SUM(PointsEarned) as PointEarned
from #tmp t1
group by MemberID,Name
) AS X
JOIN #tmp J ON J.MemberID = X.MemberID AND J.DateJoined = X.DateJoined

You could rank your rows according to the date, and select the minimal one:
SELECT t.member_id,
name,
date_joined,
date_quit,
points_earned
address AS address
FROM (SELECT member_id
name,
MIN (date_joined) AS date_joined,
MAX (date_quit) AS date_quit,
SUM (points_earned) AS points_earned,
FROM my_table
GROUP BY member_id, name) t
JOIN (SELECT member_id,
address,
RANK() OVER (PARTITION BY member_id ORDER BY date_joined) AS rk
FROM my_table) addr ON addr.member_id = t.member_id AND rk = 1

SELECT DISTINCT st.memberid, st.name, m1.datejoined, m2.datequit, SUM(st.pointsearned), m1.Address
from SAMPLEtable st
LEFT JOIN ( SELECT memberid
, name
, MIN(datejoined)
, datequit
FROM sampletable
) m1 ON st.memberid = m1.memberid
LEFT JOIN ( SELECT memberid
, name
, datejoined
, MAX(datequit)
FROM sampletable
) m2 ON m1.memberid = m2.memberid

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Multi-Pass Duplication Identification with Exclusions - sql

Related

SQL Query Find Exact and Near Dupes

Questions with SQL - separating a column in two

Sum Decode statement SQL

How to get the previous row-text

Find Min Value and value of a corresponding column for that result

Categories

Resources