Extract suburb name from address string in Bigquery - sql

I have a table of addresses (property) from which I need to extract just the suburb name. I have another table (suburbs) that contains all of the suburb names.
I'm having a problem with the multi-word suburb names, where a match is found on one, and both words. I need it to match with the longest suburb name, eg. an address with "North Bondi" should only match to suburb "North Bondi" and not suburb "Bondi".
I've found some examples online that use the MAX function in the join but Bigquery won't let me use that function in the join.
Would appreciate if someone could please suggest corrections, or provide guidance on other solutions (eg. sorting the suburb table and retrieving only one result?) Thank you!
Table: property
address
12 Smith Street Surry Hills NSW
34 Jones Street Bondi NSW
15 Sunny Road North Bondi NSW
Table: suburbs
suburb
state
Surry Hills
NSW
Bondi
NSW
North Bondi
NSW
Current code used:
Select * from ( SELECT p.address, s.suburb
FROM `property` p
JOIN `suburbs` s
ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%')
GROUP BY p.address, s.suburb
) x
join `property` p
ON p.address = x.address
where p.address is not null;
Actual result:
address
suburb
12 Smith Street Surry Hills NSW
Surry Hills
34 Jones Street Bondi NSW
Bondi
15 Sunny Road North Bondi NSW
Bondi
15 Sunny Road North Bondi NSW
North Bondi
Desired result:
address
suburb
12 Smith Street Surry Hills NSW
Surry Hills
34 Jones Street Bondi NSW
Bondi
15 Sunny Road North Bondi NSW
North Bondi

Try this:
select v1.address,
string_agg(v1.addr_part, " ") as suburb,
from (
select t.address,
addr_part,
from property t
cross join unnest(split(t.address, " ")) addr_part WITH OFFSET AS ofst
where ofst > 2
qualify row_number() over(partition by t.address order by ofst desc) > 1
) v1
group by v1.address
;
But! This approach assumes:
The first 3 words in every address are not belong to suburb name;
Every state is one word.

I've found some examples online that use the MAX function in the join but Bigquery won't let me use that function in the join.
Using a window function Instead of a MAX function,
Select * from ( SELECT p.address, s.suburb
FROM `property` p
JOIN `suburbs` s
ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%')
) x
QUALIFY RANK() OVER (PARTITION BY address ORDER BY LENGTH(suburb) DESC) = 1
Query results:
address
suburb
12 Smith Street Surry Hills NSW
Surry Hills
15 Sunny Road North Bondi NSW
North Bondi
34 Jones Street Bondi NSW
Bondi

Consider below approach
SELECT p.address,
STRING_AGG(s.suburb ORDER BY LENGTH(s.suburb) DESC LIMIT 1) suburb
FROM `property` p
JOIN `suburbs` s
ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%')
GROUP BY p.address
if applied to sample data in your question - output is

Related

Unique table from multiple ones having same and different columns (SQL)

I have multiple datasets having different rows and fields.
dataset1
Customer_ID Date Category Address City School
4154124 1/2/2021 A balboa st. Canterbury Middleton
2145124 1/2/2012 A somewhere world St. Augustine
1621573 1/2/2012 A my_street somewhere St. Augustine
dataset2
Customer_ID Date Category Country Zipcode
14123 12/12/2020 B UK EW
416412 14/12/2020 B ES
dataset3
Customer_ID Date Category School University
4124123 07/12/2020 C Middleton Oxford
I would like a final dataset which includes all the columns (keeping only one copy of the common ones):
Customer_ID Date Category Address City School Country Zipcode University
4154124 1/2/2021 A balboa st. Canterbury Middleton
2145124 1/2/2012 A somewhere world St. Augustine
1621573 1/2/2012 A my_street somewhere St. Augustine
14123 12/12/2020 B UK EW
416412 14/12/2020 B ES
4124123 07/12/2020 C Middleton Oxford
would a left join be the best way to get the expected output? How I can keep Customer_ID Date and Category and duplicates column (e.g., School) only once?
You can achieve this using UNION ALL.
SELECT Customer_ID, Date, Category, Address, City, School, '' AS Country, '' AS ZipCode, '' AS university FROM dataset1
UNION ALL
SELECT Customer_ID, Date, Category, '', '', '', Country, Zipcode, '' FROM dataset2
UNION ALL
SELECT Customer_ID, Date, Category, '', '', School, '', '', University FROM dataset3

Reduce Duplicate Records In Multi-Value Join SQL Query

Backstory:
I have created a bunch of stored procedures that analyze my client's data. I am reviewing a list of vendors and trying to identify possible duplicates. It works pretty well, but each record has 2 possible addresses, and I'm getting duplicate results when matches are found in both addresses. Ideally I'd just need the records to appear in the results once.
Process:
I created a "clean" version of the address where I remove special characters and normalize to USPS standards. This helps me match West v W v W. or PO Box v P.O. Box v P O Box etc. I then take all of the distinct address values from both addresses ([cleanAddress] and [cleanRemit_Address]) and put into a master list. I then compare to the source table with a HAVING COUNT(*) > 1 to determine which addresses appear more than once. Lastly I take that final list of addresses that appear more than once and combine it with the source data for output.
Problem:
If you view the results near the bottom you'll see that I have 2 sets of dupes that are nearly identical except for some slight differences in the addresses. Both the Address and Remit_Address are essentially the same so it finds a match on BOTH the [cleanAddress] and [cleanRemit_Address] values for "SouthWestern Medical" and "NERO CO" so both sets of dupes appear twice in the list instead of once (see the desired results at the bottom).
I need to match [cleanAddress] OR [cleanRemit_Address] but I don't know how to limit each record appearing once in the results.
SSMS 18
SQL Server 2019
Queries:
--SQL (Address List): Combines all addresses for a master list of all addresses in the table
SELECT * INTO [address_list] FROM (
SELECT DISTINCT [NewAdd] FROM
(
SELECT DISTINCT [cleanAddress] AS [NewAdd]
FROM [sample_data]
WHERE
( [cleanAddress] IS NOT NULL AND [cleanAddress] <> '' ) AND
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [cleanAddress]
UNION
SELECT DISTINCT [cleanRemit_Address] AS [NewAdd]
FROM [sample_data]
WHERE
( [cleanRemit_Address] IS NOT NULL AND [cleanRemit_Address] <> '' ) AND
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [cleanRemit_Address]
) q1
) q2
ORDER BY
[NewAdd]
--SQL (Address Dupes): Determines which addresses appear in the data more than once
SELECT * INTO [dupe_addresses] FROM (
SELECT [NewAdd]
FROM [address_list] n
LEFT JOIN [sample_data] pv ON
(
( n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) ) OR
( n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
)
WHERE
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [NewAdd]
HAVING COUNT(*) > 1
) q1
ORDER BY [NewAdd]
--SQL (Address Query): Outputs the information of the matched addresses
SELECT
'Address Match' AS [Reason],
pv.[Supplier_No],
pv.[Name],
pv.[Address],
pv.[City],
pv.[State],
pv.[Zip],
pv.[Country],
pv.[Remit_Address],
pv.[Remit_City],
pv.[Remit_State],
pv.[Remit_Zip],
pv.[Remit_Country]
FROM
[dupe_addresses] n
LEFT JOIN [sample_data] pv
ON (
(n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) )
OR
(n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
)
WHERE
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
Sample Data:
CREATE TABLE [sample_data] (
[Supplier_No] varchar(255),
[Name] varchar(255),
[Address] varchar(255),
[City] varchar(255),
[State] varchar(255),
[Zip] varchar(255),
[Country] varchar(255),
[Remit_Address] varchar(255),
[Remit_City] varchar(255),
[Remit_State] varchar(255),
[Remit_Zip] varchar(255),
[Remit_Country] varchar(255),
[cleanAddress] varchar(255),
[cleanRemit_Address] varchar(255),
CONSTRAINT [suppliers_pk] PRIMARY KEY ([Supplier_No])
)
INSERT INTO [sample_data] VALUES
('1039104','Geez Companies','100 Aero Hudson Rd','Streetsboro','OH','44241','','100 Aero Hudson Road','Streetsboro','OH','44241','USA','100 Aero Hudson Rd','100 Aero Hudson Rd'),
('1218409','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 W Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
('1243789','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 West Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
('1243636','SIRI SYSTEMS','15 BRAD ROAD','WEXFORD','PA','15090','','','','','','','15 BRAD RD',''),
('1152482','FLEETWOOD MACK','22 WINDSOCK CT','ADDISON','IL','60101','','PO BOX 951','CHICAGO','IL','60694-5124','','22 WINDSOCK CT','PO BOX 951'),
('1224483','Aerospace Junction','211500 Communicate Ave','Mingo Junction','OH','43939','USA','P O Box 99','Mingo Junction','OH','43939','USA','211500 Communicate Ave','PO Box 99'),
('1243397','Squeezy Felt','SCHREIBER DIST','NEW KENSINGTON','PA','15068','','','','','','','SCHREIBER DIST',''),
('1230895','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
('1243782','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
('1135880','RICHARD PRYOR SEMINARS','PO BOX 2194','KANSAS CITY','MO','64121-9468','USA','RICHARD PRYOR SEMINARS P O BOX 2194','KANSAS CITY','MO','64121-9468','USA','PO BOX 2194','RICHARD PRYOR SEMINARS PO BOX 2194'),
('1241328','INFINITY AND BEYOND','P.O. BOX 169','GASTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
('1259522','MILES STONES','PO BOX 169','GASSTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
('1255253','AT&T','PO Box 50221','Carol Stream','IL','60197','USA','','','','','','PO Box 50221',''),
('1135513','AT&T','PO Box 50221','Carol Stream','IL','60197-5080','USA','','','','','','PO Box 50221',''),
('1119161','Machine Co, Inc','3306 N Thorne Blvd','Chattanooga','TN','','','PO BOX 5301','CHATTANOOGA','TN','37406','USA','3306 N Thorne Blvd','PO BOX 5301'),
('1176587','Topsy Turvy','365 Welmington Road','Chicago','IL','60606','USA','','','','','','365 Welmington Rd',''),
('2156671','Topsy Turvvy, Inc.','P.O. Box 55217','Columbus','OH','43081','','365 Welmington Road','Chicago','IL','60606','USA','','365 Welmington Rd')
Current Results:
Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA
Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA
Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA
Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA
Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA Southern Lawn Care1004 W Hale Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Desired Results:
Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA
Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA
Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA
Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA
Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA
Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
Just add a row_number per supplier to the final resultset and filter out only row number 1 only.
Note the row_number function requires an order by clause which is used to determine which of the duplicate rows you wish to keep. Change that to suit your circumstances.
WITH cte AS (
SELECT
'Address Match' AS [Reason],
pv.[Supplier_No],
pv.[Name],
pv.[Address],
pv.[City],
pv.[State],
pv.[Zip],
pv.[Country],
pv.[Remit_Address],
pv.[Remit_City],
pv.[Remit_State],
pv.[Remit_Zip],
pv.[Remit_Country]
, ROW_NUMBER() OVER (PARTITION BY pv.[Supplier_No] ORDER BY pv.[Name]) rn
FROM dupe_addresses n
LEFT JOIN sample_data pv
ON (
(n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ))
OR (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL))
)
WHERE ([Supplier_No] IS NOT NULL AND [Supplier_No] <> '')
)
SELECT *
FROM cte
WHERE rn = 1
ORDER BY Supplier_No, [Name];

Count occurrences with exclude criteria

I have a Table
City ID
Austin 123
Austin 123
Austin 123
Austin 145
Austin 145
Chicago 12
Chicago 12
Houston 24
Houston 45
Houston 45
Now I want to count the occurrences of all Citis with different ids so since Chicago has only one id (=12) I am not interested in Chicago and it should not appear in the resultset that should looks like this:
city Id Occurrences
Austin 123 3
Austin 145 2
Houston 34 1
Houston 45 2
I am able to get myself an overview with
select city, Id from Table
group by city, Id
But I am not sure how to only select the once having different ids and to count them.
Could anyone help me out here?
You can use window functions and aggregation:
select city, id, occurences
from (
select city, id, count(*) occurences, count(*) over(partition by city) cnt_city
from mytable
group by city, id
) t
where cnt_city > 1

How to show blank when value is repeated

SELECT * FROM Cities ORDER BY Country;
This is the result.
COUNTRY CITY PLACE
Italy Milan Zone_A
Italy Rome Zone_A
Italy Rome Zone_B
USA New York Zone_Q
USA Atlanta Zone_A
I would like to create a Stored Procedure that shows "blank" when the item is repeated. The final result should be the following. (Note that this rule is applied only in the first 2 columns, not in the third).
COUNTRY CITY PLACE
Italy Milan Zone_A
Rome Zone_A
Zone_B
USA New York Zone_Q
Atlanta Zone_A
If your version of maria DB supports window functions, you can use lag():
select
case when lag(country) over(order by country, city, place) = country
then null
else country
end country,
case when lag(city) over(order by country, city, place) = city
then null
else city
end city,
place
from cities
order by
country,
city,
place

subquery with NOT IN

I have a table of cities like:
state city
----- ----
texas houston
texas austin
texas dallas
texas san antonio
texas beaumont
texas brownsville
texas el paso
california anaheim
california san diego
california los angeles
california oakland
california simi valley
california san francisco
I need a query to find the states that don't have a city named 'houston' or 'dallas'. My first thought was this
select distinct state from cities where city not in ('houston', 'dallas');
but that won't work. I think I need a subquery and a NOT IN of some sort..
A way you can do this is with a NOT EXISTS clause:
Select Distinct State
From Cities C1
Where Not Exists
(
Select *
From Cities C2
Where C2.City In ('Houston', 'Dallas')
And C1.State = C2.State
)
select distinct state from cities where state not in (SELECT state FROM cities WHERE city in ('houston', 'dallas'));
Another method, may be slightly faster:
select distinct state from cities where state not in (select state from cities where city in ('houston', 'dallas'));
Select State
from Cities
group by State
having count(case when Cities in ('houston', 'dallas') then cities end) = 0
This will return all states where the number of cities associated with that state and matching your criteria is 0 (i.e. there are no such cities associated with the state).