Reduce Duplicate Records In Multi-Value Join SQL Query - sql
Backstory:
I have created a bunch of stored procedures that analyze my client's data. I am reviewing a list of vendors and trying to identify possible duplicates. It works pretty well, but each record has 2 possible addresses, and I'm getting duplicate results when matches are found in both addresses. Ideally I'd just need the records to appear in the results once.
Process:
I created a "clean" version of the address where I remove special characters and normalize to USPS standards. This helps me match West v W v W. or PO Box v P.O. Box v P O Box etc. I then take all of the distinct address values from both addresses ([cleanAddress] and [cleanRemit_Address]) and put into a master list. I then compare to the source table with a HAVING COUNT(*) > 1 to determine which addresses appear more than once. Lastly I take that final list of addresses that appear more than once and combine it with the source data for output.
Problem:
If you view the results near the bottom you'll see that I have 2 sets of dupes that are nearly identical except for some slight differences in the addresses. Both the Address and Remit_Address are essentially the same so it finds a match on BOTH the [cleanAddress] and [cleanRemit_Address] values for "SouthWestern Medical" and "NERO CO" so both sets of dupes appear twice in the list instead of once (see the desired results at the bottom).
I need to match [cleanAddress] OR [cleanRemit_Address] but I don't know how to limit each record appearing once in the results.
SSMS 18
SQL Server 2019
Queries:
--SQL (Address List): Combines all addresses for a master list of all addresses in the table
SELECT * INTO [address_list] FROM (
SELECT DISTINCT [NewAdd] FROM
(
SELECT DISTINCT [cleanAddress] AS [NewAdd]
FROM [sample_data]
WHERE
( [cleanAddress] IS NOT NULL AND [cleanAddress] <> '' ) AND
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [cleanAddress]
UNION
SELECT DISTINCT [cleanRemit_Address] AS [NewAdd]
FROM [sample_data]
WHERE
( [cleanRemit_Address] IS NOT NULL AND [cleanRemit_Address] <> '' ) AND
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [cleanRemit_Address]
) q1
) q2
ORDER BY
[NewAdd]
--SQL (Address Dupes): Determines which addresses appear in the data more than once
SELECT * INTO [dupe_addresses] FROM (
SELECT [NewAdd]
FROM [address_list] n
LEFT JOIN [sample_data] pv ON
(
( n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) ) OR
( n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
)
WHERE
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [NewAdd]
HAVING COUNT(*) > 1
) q1
ORDER BY [NewAdd]
--SQL (Address Query): Outputs the information of the matched addresses
SELECT
'Address Match' AS [Reason],
pv.[Supplier_No],
pv.[Name],
pv.[Address],
pv.[City],
pv.[State],
pv.[Zip],
pv.[Country],
pv.[Remit_Address],
pv.[Remit_City],
pv.[Remit_State],
pv.[Remit_Zip],
pv.[Remit_Country]
FROM
[dupe_addresses] n
LEFT JOIN [sample_data] pv
ON (
(n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) )
OR
(n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
)
WHERE
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
Sample Data:
CREATE TABLE [sample_data] (
[Supplier_No] varchar(255),
[Name] varchar(255),
[Address] varchar(255),
[City] varchar(255),
[State] varchar(255),
[Zip] varchar(255),
[Country] varchar(255),
[Remit_Address] varchar(255),
[Remit_City] varchar(255),
[Remit_State] varchar(255),
[Remit_Zip] varchar(255),
[Remit_Country] varchar(255),
[cleanAddress] varchar(255),
[cleanRemit_Address] varchar(255),
CONSTRAINT [suppliers_pk] PRIMARY KEY ([Supplier_No])
)
INSERT INTO [sample_data] VALUES
('1039104','Geez Companies','100 Aero Hudson Rd','Streetsboro','OH','44241','','100 Aero Hudson Road','Streetsboro','OH','44241','USA','100 Aero Hudson Rd','100 Aero Hudson Rd'),
('1218409','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 W Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
('1243789','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 West Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
('1243636','SIRI SYSTEMS','15 BRAD ROAD','WEXFORD','PA','15090','','','','','','','15 BRAD RD',''),
('1152482','FLEETWOOD MACK','22 WINDSOCK CT','ADDISON','IL','60101','','PO BOX 951','CHICAGO','IL','60694-5124','','22 WINDSOCK CT','PO BOX 951'),
('1224483','Aerospace Junction','211500 Communicate Ave','Mingo Junction','OH','43939','USA','P O Box 99','Mingo Junction','OH','43939','USA','211500 Communicate Ave','PO Box 99'),
('1243397','Squeezy Felt','SCHREIBER DIST','NEW KENSINGTON','PA','15068','','','','','','','SCHREIBER DIST',''),
('1230895','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
('1243782','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
('1135880','RICHARD PRYOR SEMINARS','PO BOX 2194','KANSAS CITY','MO','64121-9468','USA','RICHARD PRYOR SEMINARS P O BOX 2194','KANSAS CITY','MO','64121-9468','USA','PO BOX 2194','RICHARD PRYOR SEMINARS PO BOX 2194'),
('1241328','INFINITY AND BEYOND','P.O. BOX 169','GASTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
('1259522','MILES STONES','PO BOX 169','GASSTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
('1255253','AT&T','PO Box 50221','Carol Stream','IL','60197','USA','','','','','','PO Box 50221',''),
('1135513','AT&T','PO Box 50221','Carol Stream','IL','60197-5080','USA','','','','','','PO Box 50221',''),
('1119161','Machine Co, Inc','3306 N Thorne Blvd','Chattanooga','TN','','','PO BOX 5301','CHATTANOOGA','TN','37406','USA','3306 N Thorne Blvd','PO BOX 5301'),
('1176587','Topsy Turvy','365 Welmington Road','Chicago','IL','60606','USA','','','','','','365 Welmington Rd',''),
('2156671','Topsy Turvvy, Inc.','P.O. Box 55217','Columbus','OH','43081','','365 Welmington Road','Chicago','IL','60606','USA','','365 Welmington Rd')
Current Results:
Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA
Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA
Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA
Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA
Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA Southern Lawn Care1004 W Hale Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Desired Results:
Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA
Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA
Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA
Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA
Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA
Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
Just add a row_number per supplier to the final resultset and filter out only row number 1 only.
Note the row_number function requires an order by clause which is used to determine which of the duplicate rows you wish to keep. Change that to suit your circumstances.
WITH cte AS (
SELECT
'Address Match' AS [Reason],
pv.[Supplier_No],
pv.[Name],
pv.[Address],
pv.[City],
pv.[State],
pv.[Zip],
pv.[Country],
pv.[Remit_Address],
pv.[Remit_City],
pv.[Remit_State],
pv.[Remit_Zip],
pv.[Remit_Country]
, ROW_NUMBER() OVER (PARTITION BY pv.[Supplier_No] ORDER BY pv.[Name]) rn
FROM dupe_addresses n
LEFT JOIN sample_data pv
ON (
(n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ))
OR (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL))
)
WHERE ([Supplier_No] IS NOT NULL AND [Supplier_No] <> '')
)
SELECT *
FROM cte
WHERE rn = 1
ORDER BY Supplier_No, [Name];
Related
SQL Select inner join with multiple value
How do I get all the values in one select statement? SELECT CLIENT.name, CLIENT.province_id, CANADA.name as province_name, CLIENT.city_id, CANADA.name as city_name FROM ((CLIENT INNER JOIN CANADA as ON CLIENT.province_id = CANADA.id) INNER JOIN CANADA as ON CLIENT.city_id = CANADA.id) WHERE CLIENT Province_name & city_name refer to the same column and identify using ID. CANADA: CANADA_id name id parent_id 1 Canada 33 0 2 (Province) Alberta 1100 33 3 (Province) British Columbia 1200 33 4 (city) Banff 1101 1100 5 (city) Calgary 1102 1100 6 (city) Victory 1201 1200 7 (city) Vancouver 1202 1200 I would like to return: name province_id province_name city_id city_name John 1100 Alberta 1102 Calgery
SELECT CLIENT.name, CLIENT.province_id, CANADA_province.name as province_name, CLIENT.city_id, CANADA_city.name as city_name FROM CLIENT INNER JOIN CANADA as CANADA_province ON CLIENT.province_id = CANADA_province.id INNER JOIN CANADA as CANADA_city ON CLIENT.city_id = CANADA_city.id WHERE CLIENT.name IS NOT NULL;
Extract suburb name from address string in Bigquery
I have a table of addresses (property) from which I need to extract just the suburb name. I have another table (suburbs) that contains all of the suburb names. I'm having a problem with the multi-word suburb names, where a match is found on one, and both words. I need it to match with the longest suburb name, eg. an address with "North Bondi" should only match to suburb "North Bondi" and not suburb "Bondi". I've found some examples online that use the MAX function in the join but Bigquery won't let me use that function in the join. Would appreciate if someone could please suggest corrections, or provide guidance on other solutions (eg. sorting the suburb table and retrieving only one result?) Thank you! Table: property address 12 Smith Street Surry Hills NSW 34 Jones Street Bondi NSW 15 Sunny Road North Bondi NSW Table: suburbs suburb state Surry Hills NSW Bondi NSW North Bondi NSW Current code used: Select * from ( SELECT p.address, s.suburb FROM `property` p JOIN `suburbs` s ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%') GROUP BY p.address, s.suburb ) x join `property` p ON p.address = x.address where p.address is not null; Actual result: address suburb 12 Smith Street Surry Hills NSW Surry Hills 34 Jones Street Bondi NSW Bondi 15 Sunny Road North Bondi NSW Bondi 15 Sunny Road North Bondi NSW North Bondi Desired result: address suburb 12 Smith Street Surry Hills NSW Surry Hills 34 Jones Street Bondi NSW Bondi 15 Sunny Road North Bondi NSW North Bondi
Try this: select v1.address, string_agg(v1.addr_part, " ") as suburb, from ( select t.address, addr_part, from property t cross join unnest(split(t.address, " ")) addr_part WITH OFFSET AS ofst where ofst > 2 qualify row_number() over(partition by t.address order by ofst desc) > 1 ) v1 group by v1.address ; But! This approach assumes: The first 3 words in every address are not belong to suburb name; Every state is one word.
I've found some examples online that use the MAX function in the join but Bigquery won't let me use that function in the join. Using a window function Instead of a MAX function, Select * from ( SELECT p.address, s.suburb FROM `property` p JOIN `suburbs` s ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%') ) x QUALIFY RANK() OVER (PARTITION BY address ORDER BY LENGTH(suburb) DESC) = 1 Query results: address suburb 12 Smith Street Surry Hills NSW Surry Hills 15 Sunny Road North Bondi NSW North Bondi 34 Jones Street Bondi NSW Bondi
Consider below approach SELECT p.address, STRING_AGG(s.suburb ORDER BY LENGTH(s.suburb) DESC LIMIT 1) suburb FROM `property` p JOIN `suburbs` s ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%') GROUP BY p.address if applied to sample data in your question - output is
SQL query combine rows based on common id
I have a table with the below structure: MID FromCountry FromState FromCity FromAddress FromNumber FromApartment ToCountry ToCity ToAddress ToNumber ToApartment 123 USA Texas Houston Well Street 1 Japan Tokyo 6 ET3 123 Germany Bremen Bremen Nice Street 4 Poland Warsaw 9 ET67 456 France Corsica Corsica Amz Street 3 Italy Milan 8 AEC784 456 UK UK London G Street 2 Portugal Lisbon 1 LP400 The desired outcome is: MID FromCountry FromState FromCity FromAddress FromNumber FromApartment ToCountry ToCity ToAddress ToNumber ToApartment FromCountry1 FromState1 FromCity1 FromAddress1 FromNumber1 FromApartment1 ToCountry1 ToCity1 ToAddress1 ToNumber1 ToApartment1 123 USA Texas Houston Well Street 1 Japan Tokyo 6 ET3 Germany Bremen Bremen Nice Street 4 Poland Warsaw 9 ET67 456 France Corsica Corsica Amz Street 3 Italy Milan 8 AEC784 UK UK London G Street 2 Portugal Lisbon 1 LP400 What I am trying to achieve is to bring multiple rows in 1 table, which have the same MID, under 1 row, regardless if there are columns with empty values. I think that i over complicated the solution to this as I was trying something like this (and of course the outcome is not the desired one): select [MID], STUFF( (select concat('', [FromCountry]) FROM test i where i.[MID] = o.[MID] for xml path ('')),1,1,'') as FromCountry ,stuff ( (select concat('', [FromState]) FROM test i where i.[MID] = o.[MID] for xml path ('')),1,1,'') as FromState ,stuff ( (select concat('', [FromCity]) FROM test i where i.[MID] = o.[MID] for xml path ('')),1,1,'') as FromCity ,stuff ( (select concat('', [FromAddress]) FROM test i where i.[MID] = o.[MID] for xml path ('')),1,1,'') as FromAddress FROM test o group by [MID] ... Is there any way to achieve this?
On the assumption there are no more than 2 rows per MID then you can implement a simple row_number() solution. You need to join one row for each MID to the other, so assign a unique value to each using row_number - there's nothing I can immediately see that indicates which row should be the "second" row - this is assigning row numbers based on the FromCountry - amend as necessary. I'm not reproducing all the columns here but you get the idea, rinse and repeat for each column. with m as ( select *, Row_Number() over(partition by Mid order by FromCountry) seq from t ) select m.Mid, m.fromcountry, m.fromstate, m2.fromcountry FromCountry1, m2.fromstate FromState1 from m join m m2 on m.mid = m2.mid and m2.seq = 2 where m.seq = 1; See example fiddle
SELECT with multiple PRIMARY KEY
I have 3 table: nation (name PRIMARY KEY); city (name PRIMARY KEY, nation REFERENCES nation(name)) overflight (number, city, PRIMARY KEY (number, city)) The overflight table content is something like below: AA11 city1 AA11 city2 BB22 city1 BB22 city3 etc. I need to select only overflight that doesn't have city from a certain nation in the city field. I've tried with: SELECT number FROM overflight JOIN city ON overflight.city = city.name WHERE overflight.city NOT IN ( SELECT name FROM city WHERE nation = some_nation ) GROUP BY number; but it doesn't work because it doesn't list the row of overflight that have city from some_nation but can happen that the same overflight have another row in the table that doesn't have city in some_nation. How can I display only the overflight that doesn't have city in some_nation at all? Hope that I've explained my problem as clear as possible. EDIT This is exact content of overflight table: AZ 7255 Rome AZ 7255 Milan AZ 608 Rome AZ 608 New York AA 1 New York AA 1 Los Angeles BA 2430 New York BA 2430 Los Angeles Suppose that I want to show the overflight that doesn't fly over any city in Italy. I need that the result is like this AA 1 New York AA 1 Los Angeles BB 2430 New York BB 2430 Los Angeles
Join the tables to get the overflight numbers that do have a city from the nation that you want to exclude and use the operator NOT IN to select all the other oveflights: SELECT * FROM overflight WHERE number NOT IN ( SELECT o.number FROM overflight o INNER JOIN city c ON o.city = c.name WHERE c.nation = 'Italy' ) See the demo. Results: number city AA 1 New York AA 1 Los Angeles BA 2430 New York BA 2430 Los Angeles
Remove duplicate address values where length of second column is less than the length of the greatest matching address
I'm not sure if I worded the title properly so I apologize. I feel this is best explained by showing my data. Address 1 Address 2 City State AddressInfo# -------------------------------- ------------------ ------------ ----- -------------- 1 Main St #100 Burbville, CA, 99999 1 Main St #100 Burbville CA 1001 1 Main St #100 Burbville, CA, 99999 1 Main St Burbville CA 1001 1 Main St #100 Burbville, CA, 99999 1 Main st Burbville CA 1001 ... 4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004 4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004 ... 8 New Blvd #800 NewCity, MT, 88888 8 New Blvd #800 NewCity MT 1008 8 New Blvd #800 NewCity, MT, 88888 8 New Blvd NewCity MT 1008 8 New Blvd #800 NewCity, MT, 88888 8 New Blvd NewCity MT 1008 I would like to find a way to remove all records where Address 2 is missing the full street address or simply contains an exact duplicate like AddressInfo# 1004. Expected Output: Address 1 Address 2 City State AddressInfo# -------------------------------- ------------------ ------------ ----- -------------- 1 Main St #100 Burbville, CA, 99999 1 Main St #100 Burbville CA 1001 ... 4 Old Ave Ste 401 Southtown, OH, 44444 4 Old Ave Ste 401 Southtown OH 1004 ... 8 New Blvd #800 NewCity, MT, 88888 8 New Blvd #800 NewCity MT 1008
You could rebuild your data into a new table using select address_1,max(address_2) as address_2, addressinfo from table1 group by address_1,addressinfo http://sqlfiddle.com/#!6/3d22c/2 Edit 1: To select city and state as well you need to include it as a group by expression: select address_1,max(address_2) as address_2, addressinfo, city, state from table1 group by address_1,addressinfo, city, state http://sqlfiddle.com/#!6/4527c/1 Edit 2: The max function does deliver the longest value here as needed. This works if the shorter values are true starts of the longer values. Here is an example of this: http://sqlfiddle.com/#!6/3fba8/1
This may have syntax errors but this is a valid approach with cte as ( select address1, address2, city, state, ROW_NUMBER() OVER(partition by AddressInfo# order by len(address2) desc) as 'alen' ) select * from cte where alen = 1
SELECT DISTINCT Address1 , Address2 , [AddressInfo#] , City , State -- + any other fields FROM dbo.Table1 AS t WHERE NOT EXISTS ( SELECT * FROM dbo.Table1 AS x WHERE x.Address1 = t.Address1 -- + any other criteria for "uniqueness" AND LEFT( x.Address2, LEN( t.Address2 ) ) = t.Address2 AND LEN( x.Address2 ) > LEN( t.Address2 ) ); This query will first get all the rows where there is not another row with the same Address1 and an Address2 matching the current value up to the length of the field, but at least one character longer. The DISTINCT is then applied to eliminate exact duplicates. (This assumes no NULL values.) A similar query could use the LIKE operator, but this would need to account for special characters in the data, such as "%", "_", or brackets.
Some form of: UPDATE A SET Address2 = CASE WHEN Address1 = Address2 THEN NULL ELSE CASE WHEN CHARINDEX(',',Address2,CHARINDEX(',',Address2)) = 0 THEN NULL ELSE Address2 END END FROM Address AS A