Extract suburb name from address string in Bigquery - sql
I have a table of addresses (property) from which I need to extract just the suburb name. I have another table (suburbs) that contains all of the suburb names.
I'm having a problem with the multi-word suburb names, where a match is found on one, and both words. I need it to match with the longest suburb name, eg. an address with "North Bondi" should only match to suburb "North Bondi" and not suburb "Bondi".
I've found some examples online that use the MAX function in the join but Bigquery won't let me use that function in the join.
Would appreciate if someone could please suggest corrections, or provide guidance on other solutions (eg. sorting the suburb table and retrieving only one result?) Thank you!
Table: property
address
12 Smith Street Surry Hills NSW
34 Jones Street Bondi NSW
15 Sunny Road North Bondi NSW
Table: suburbs
suburb
state
Surry Hills
NSW
Bondi
NSW
North Bondi
NSW
Current code used:
Select * from ( SELECT p.address, s.suburb
FROM `property` p
JOIN `suburbs` s
ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%')
GROUP BY p.address, s.suburb
) x
join `property` p
ON p.address = x.address
where p.address is not null;
Actual result:
address
suburb
12 Smith Street Surry Hills NSW
Surry Hills
34 Jones Street Bondi NSW
Bondi
15 Sunny Road North Bondi NSW
Bondi
15 Sunny Road North Bondi NSW
North Bondi
Desired result:
address
suburb
12 Smith Street Surry Hills NSW
Surry Hills
34 Jones Street Bondi NSW
Bondi
15 Sunny Road North Bondi NSW
North Bondi
Try this:
select v1.address,
string_agg(v1.addr_part, " ") as suburb,
from (
select t.address,
addr_part,
from property t
cross join unnest(split(t.address, " ")) addr_part WITH OFFSET AS ofst
where ofst > 2
qualify row_number() over(partition by t.address order by ofst desc) > 1
) v1
group by v1.address
;
But! This approach assumes:
The first 3 words in every address are not belong to suburb name;
Every state is one word.
I've found some examples online that use the MAX function in the join but Bigquery won't let me use that function in the join.
Using a window function Instead of a MAX function,
Select * from ( SELECT p.address, s.suburb
FROM `property` p
JOIN `suburbs` s
ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%')
) x
QUALIFY RANK() OVER (PARTITION BY address ORDER BY LENGTH(suburb) DESC) = 1
Query results:
address
suburb
12 Smith Street Surry Hills NSW
Surry Hills
15 Sunny Road North Bondi NSW
North Bondi
34 Jones Street Bondi NSW
Bondi
Consider below approach
SELECT p.address,
STRING_AGG(s.suburb ORDER BY LENGTH(s.suburb) DESC LIMIT 1) suburb
FROM `property` p
JOIN `suburbs` s
ON INITCAP(p.address) LIKE CONCAT('%', INITCAP(s.suburb),' ', INITCAP(s.state), '%')
GROUP BY p.address
if applied to sample data in your question - output is
Related
Unique table from multiple ones having same and different columns (SQL)
I have multiple datasets having different rows and fields. dataset1 Customer_ID Date Category Address City School 4154124 1/2/2021 A balboa st. Canterbury Middleton 2145124 1/2/2012 A somewhere world St. Augustine 1621573 1/2/2012 A my_street somewhere St. Augustine dataset2 Customer_ID Date Category Country Zipcode 14123 12/12/2020 B UK EW 416412 14/12/2020 B ES dataset3 Customer_ID Date Category School University 4124123 07/12/2020 C Middleton Oxford I would like a final dataset which includes all the columns (keeping only one copy of the common ones): Customer_ID Date Category Address City School Country Zipcode University 4154124 1/2/2021 A balboa st. Canterbury Middleton 2145124 1/2/2012 A somewhere world St. Augustine 1621573 1/2/2012 A my_street somewhere St. Augustine 14123 12/12/2020 B UK EW 416412 14/12/2020 B ES 4124123 07/12/2020 C Middleton Oxford would a left join be the best way to get the expected output? How I can keep Customer_ID Date and Category and duplicates column (e.g., School) only once?
You can achieve this using UNION ALL. SELECT Customer_ID, Date, Category, Address, City, School, '' AS Country, '' AS ZipCode, '' AS university FROM dataset1 UNION ALL SELECT Customer_ID, Date, Category, '', '', '', Country, Zipcode, '' FROM dataset2 UNION ALL SELECT Customer_ID, Date, Category, '', '', School, '', '', University FROM dataset3
Reduce Duplicate Records In Multi-Value Join SQL Query
Backstory: I have created a bunch of stored procedures that analyze my client's data. I am reviewing a list of vendors and trying to identify possible duplicates. It works pretty well, but each record has 2 possible addresses, and I'm getting duplicate results when matches are found in both addresses. Ideally I'd just need the records to appear in the results once. Process: I created a "clean" version of the address where I remove special characters and normalize to USPS standards. This helps me match West v W v W. or PO Box v P.O. Box v P O Box etc. I then take all of the distinct address values from both addresses ([cleanAddress] and [cleanRemit_Address]) and put into a master list. I then compare to the source table with a HAVING COUNT(*) > 1 to determine which addresses appear more than once. Lastly I take that final list of addresses that appear more than once and combine it with the source data for output. Problem: If you view the results near the bottom you'll see that I have 2 sets of dupes that are nearly identical except for some slight differences in the addresses. Both the Address and Remit_Address are essentially the same so it finds a match on BOTH the [cleanAddress] and [cleanRemit_Address] values for "SouthWestern Medical" and "NERO CO" so both sets of dupes appear twice in the list instead of once (see the desired results at the bottom). I need to match [cleanAddress] OR [cleanRemit_Address] but I don't know how to limit each record appearing once in the results. SSMS 18 SQL Server 2019 Queries: --SQL (Address List): Combines all addresses for a master list of all addresses in the table SELECT * INTO [address_list] FROM ( SELECT DISTINCT [NewAdd] FROM ( SELECT DISTINCT [cleanAddress] AS [NewAdd] FROM [sample_data] WHERE ( [cleanAddress] IS NOT NULL AND [cleanAddress] <> '' ) AND ( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' ) GROUP BY [cleanAddress] UNION SELECT DISTINCT [cleanRemit_Address] AS [NewAdd] FROM [sample_data] WHERE ( [cleanRemit_Address] IS NOT NULL AND [cleanRemit_Address] <> '' ) AND ( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' ) GROUP BY [cleanRemit_Address] ) q1 ) q2 ORDER BY [NewAdd] --SQL (Address Dupes): Determines which addresses appear in the data more than once SELECT * INTO [dupe_addresses] FROM ( SELECT [NewAdd] FROM [address_list] n LEFT JOIN [sample_data] pv ON ( ( n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) ) OR ( n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) ) ) WHERE ( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' ) GROUP BY [NewAdd] HAVING COUNT(*) > 1 ) q1 ORDER BY [NewAdd] --SQL (Address Query): Outputs the information of the matched addresses SELECT 'Address Match' AS [Reason], pv.[Supplier_No], pv.[Name], pv.[Address], pv.[City], pv.[State], pv.[Zip], pv.[Country], pv.[Remit_Address], pv.[Remit_City], pv.[Remit_State], pv.[Remit_Zip], pv.[Remit_Country] FROM [dupe_addresses] n LEFT JOIN [sample_data] pv ON ( (n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) ) OR (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) ) ) WHERE ( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' ) Sample Data: CREATE TABLE [sample_data] ( [Supplier_No] varchar(255), [Name] varchar(255), [Address] varchar(255), [City] varchar(255), [State] varchar(255), [Zip] varchar(255), [Country] varchar(255), [Remit_Address] varchar(255), [Remit_City] varchar(255), [Remit_State] varchar(255), [Remit_Zip] varchar(255), [Remit_Country] varchar(255), [cleanAddress] varchar(255), [cleanRemit_Address] varchar(255), CONSTRAINT [suppliers_pk] PRIMARY KEY ([Supplier_No]) ) INSERT INTO [sample_data] VALUES ('1039104','Geez Companies','100 Aero Hudson Rd','Streetsboro','OH','44241','','100 Aero Hudson Road','Streetsboro','OH','44241','USA','100 Aero Hudson Rd','100 Aero Hudson Rd'), ('1218409','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 W Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'), ('1243789','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 West Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'), ('1243636','SIRI SYSTEMS','15 BRAD ROAD','WEXFORD','PA','15090','','','','','','','15 BRAD RD',''), ('1152482','FLEETWOOD MACK','22 WINDSOCK CT','ADDISON','IL','60101','','PO BOX 951','CHICAGO','IL','60694-5124','','22 WINDSOCK CT','PO BOX 951'), ('1224483','Aerospace Junction','211500 Communicate Ave','Mingo Junction','OH','43939','USA','P O Box 99','Mingo Junction','OH','43939','USA','211500 Communicate Ave','PO Box 99'), ('1243397','Squeezy Felt','SCHREIBER DIST','NEW KENSINGTON','PA','15068','','','','','','','SCHREIBER DIST',''), ('1230895','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'), ('1243782','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'), ('1135880','RICHARD PRYOR SEMINARS','PO BOX 2194','KANSAS CITY','MO','64121-9468','USA','RICHARD PRYOR SEMINARS P O BOX 2194','KANSAS CITY','MO','64121-9468','USA','PO BOX 2194','RICHARD PRYOR SEMINARS PO BOX 2194'), ('1241328','INFINITY AND BEYOND','P.O. BOX 169','GASTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''), ('1259522','MILES STONES','PO BOX 169','GASSTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''), ('1255253','AT&T','PO Box 50221','Carol Stream','IL','60197','USA','','','','','','PO Box 50221',''), ('1135513','AT&T','PO Box 50221','Carol Stream','IL','60197-5080','USA','','','','','','PO Box 50221',''), ('1119161','Machine Co, Inc','3306 N Thorne Blvd','Chattanooga','TN','','','PO BOX 5301','CHATTANOOGA','TN','37406','USA','3306 N Thorne Blvd','PO BOX 5301'), ('1176587','Topsy Turvy','365 Welmington Road','Chicago','IL','60606','USA','','','','','','365 Welmington Rd',''), ('2156671','Topsy Turvvy, Inc.','P.O. Box 55217','Columbus','OH','43081','','365 Welmington Road','Chicago','IL','60606','USA','','365 Welmington Rd') Current Results: Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA Southern Lawn Care1004 W Hale Ave Osceola AR 72370 USA Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA Desired Results: Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
Just add a row_number per supplier to the final resultset and filter out only row number 1 only. Note the row_number function requires an order by clause which is used to determine which of the duplicate rows you wish to keep. Change that to suit your circumstances. WITH cte AS ( SELECT 'Address Match' AS [Reason], pv.[Supplier_No], pv.[Name], pv.[Address], pv.[City], pv.[State], pv.[Zip], pv.[Country], pv.[Remit_Address], pv.[Remit_City], pv.[Remit_State], pv.[Remit_Zip], pv.[Remit_Country] , ROW_NUMBER() OVER (PARTITION BY pv.[Supplier_No] ORDER BY pv.[Name]) rn FROM dupe_addresses n LEFT JOIN sample_data pv ON ( (n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL )) OR (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL)) ) WHERE ([Supplier_No] IS NOT NULL AND [Supplier_No] <> '') ) SELECT * FROM cte WHERE rn = 1 ORDER BY Supplier_No, [Name];
Count occurrences with exclude criteria
I have a Table City ID Austin 123 Austin 123 Austin 123 Austin 145 Austin 145 Chicago 12 Chicago 12 Houston 24 Houston 45 Houston 45 Now I want to count the occurrences of all Citis with different ids so since Chicago has only one id (=12) I am not interested in Chicago and it should not appear in the resultset that should looks like this: city Id Occurrences Austin 123 3 Austin 145 2 Houston 34 1 Houston 45 2 I am able to get myself an overview with select city, Id from Table group by city, Id But I am not sure how to only select the once having different ids and to count them. Could anyone help me out here?
You can use window functions and aggregation: select city, id, occurences from ( select city, id, count(*) occurences, count(*) over(partition by city) cnt_city from mytable group by city, id ) t where cnt_city > 1
How to show blank when value is repeated
SELECT * FROM Cities ORDER BY Country; This is the result. COUNTRY CITY PLACE Italy Milan Zone_A Italy Rome Zone_A Italy Rome Zone_B USA New York Zone_Q USA Atlanta Zone_A I would like to create a Stored Procedure that shows "blank" when the item is repeated. The final result should be the following. (Note that this rule is applied only in the first 2 columns, not in the third). COUNTRY CITY PLACE Italy Milan Zone_A Rome Zone_A Zone_B USA New York Zone_Q Atlanta Zone_A
If your version of maria DB supports window functions, you can use lag(): select case when lag(country) over(order by country, city, place) = country then null else country end country, case when lag(city) over(order by country, city, place) = city then null else city end city, place from cities order by country, city, place
subquery with NOT IN
I have a table of cities like: state city ----- ---- texas houston texas austin texas dallas texas san antonio texas beaumont texas brownsville texas el paso california anaheim california san diego california los angeles california oakland california simi valley california san francisco I need a query to find the states that don't have a city named 'houston' or 'dallas'. My first thought was this select distinct state from cities where city not in ('houston', 'dallas'); but that won't work. I think I need a subquery and a NOT IN of some sort..
A way you can do this is with a NOT EXISTS clause: Select Distinct State From Cities C1 Where Not Exists ( Select * From Cities C2 Where C2.City In ('Houston', 'Dallas') And C1.State = C2.State )
select distinct state from cities where state not in (SELECT state FROM cities WHERE city in ('houston', 'dallas'));
Another method, may be slightly faster: select distinct state from cities where state not in (select state from cities where city in ('houston', 'dallas'));
Select State from Cities group by State having count(case when Cities in ('houston', 'dallas') then cities end) = 0 This will return all states where the number of cities associated with that state and matching your criteria is 0 (i.e. there are no such cities associated with the state).