BigQuery: grouping by similar strings for a large dataset - sql

I have a table of invoice data with over 100k unique invoices and several thousand unique company names associated with them.
I'm trying to group these company names into more general groups to understand how many invoices they're responsible for, how often they receive them, etc.
Currently, I'm using the following code to identify unique company names:
SELECT DISTINCT(company_name)
FROM invoice_data
ORDER BY company_name
The problem is that this only gives me exact matches, when its obvious that there are many string values in company_name that are similar. For example: McDonalds Paddington, McDonlads Oxford Square, McDonalds Peckham, etc.
How can I make by GROUP BY statement more general?
Sometimes the issue isn't as simple as the example listed above, occasionally there is simply an extra space or PTY/LTD which throws off a GROUP BY match.
EDIT
To give an example of what I'm looking for, I'd be looking to turn the following:
company_name
----------------------
Jim's Pizza Paddington|
Jim's Pizza Oxford |
McDonald's Peckham |
McDonald's Victoria |
-----------------------
And be able to group by their company name rather than exclusively with an exact string match.

Have you tried using the Soundex function?
SELECT
SOUNDEX(name) AS code,
MAX( name) AS sample_name,
count(name) as records
FROM ((
SELECT
"Jim's Pizza Paddington" AS name)
UNION ALL (
SELECT
"Jim's Pizza Oxford" AS name)
UNION ALL (
SELECT
"McDonald's Peckham" AS name)
UNION ALL (
SELECT
"McDonald's Victoria" AS name))
GROUP BY
1
ORDER BY
You can then use the soundex to create groupings, with a split or other type of function to pull the part of the string which matches the name group or use a windows function to pull back one occurrence to get the name string. Not perfect but means you do not need to pull into other tools with advanced language recognition.

Related

Strict Match Many to One on Lookup Table

This has been driving me and my team up the wall. I cannot compose a query that will strict match a single record that has a specific permutation of look ups.
We have a single lookup table
room_member_lookup:
room | member
---------------
A | Michael
A | Josh
A | Kyle
B | Kyle
B | Monica
C | Michael
I need to match a room with an exact list of members but everything else I've tried on stack overflow will still match room A even if I ask for a room with ONLY Josh and Kyle
I've tried queries like
SELECT room FROM room_member_lookup
WHERE member IN (Josh, Michael)
GROUP BY room
HAVING COUNT(1) = 2
However this will still return room A even though that has 3 members I need a exact member permutation and that matches the room even not partials.
SELECT room
FROM room_member_lookup a
WHERE member IN ('Monica', 'Kyle')
-- Make sure that the room 'a' has exactly two members
and (select count(*)
from room_member_lookup b
where a.room=b.room)=2
GROUP BY room
-- and both members are in that room
HAVING COUNT(1) = 2
Depending on the SQL dialect, one can build a dynamic table (CTE or select .. union all) to hold the member set (Monica and Kyle, for example), and then look for set equivalence using MINUS/EXCEPT sql operators.

match tables with intermediate mapping table (fuzzy joins with similar strings)

I'm using BigQuery.
I have two simple tables with "bad" data quality from our systems. One represents revenue and the other production rows for bus journeys.
I need to match every journey to a revenue transaction but I only have a set of fields and no key and I don't really know how to do this matching.
This is a sample of the data:
Revenue
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty
Journeys
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty
Every station has a maximum of 9 alternative names and these are stored in a "station" table.
Stations
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...
I would like to test matching or joining the tables first with the original fields. This will generate some matches but there are many journeys that are not matched. For the unmatched revenue rows, I would like to change the product name (shorten it to two letters and possibly get many matches from production table) and then station names by first change the station_origin and then station_destination. When using a shorter product name I could possibly get many matches but I want the row from the production table with the most common product.
Something like this:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
I was told that maybe I should create an intermediate table with all combinations of stations and products and let a rank column decide the order. The station names in the station's table are in order of importance so "station name" is more important than "station name 2" and so on.
I started to do a query with a subquery per rank and do a UNION ALL but there are so many combinations that there must be another way to do this.
Don't know if this makes any sense but I would appreciate any help or ideas to do this in a better way.
Cheers,
Cris
To implement a complex joining strategy with approximate matching, it might make more sense to define the strategy within JavaScript - and call the function from a BigQuery SQL query.
For example, the following query does the following steps:
Take the top 200 male names in the US.
Find if one of the top 200 female names matches.
If not, look for the most similar female name within the options.
Note that the logic to choose the closest option is encapsulated within the JS UDF fhoffa.x.fuzzy_extract_one(). See https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83 to learn more about this.
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a

How to get a single value without using group by Oracle

I have this data and i need to combine all lines in a row in field fullname and get a single value from 3 equals from order field. How can i do that without using a group by?
Existing data
id order fullname
1 32 Jack Stinky Potato
2 32 Kevin Enormous Cucumber
3 32 Jerald Sad Onion
Expecting result
32 Jack Stinky Potato, Kevin Enormous Cucumber, Jerald Sad Onion
using group by would write
select order, wm_concat(fullname) from EmployeeCards
group by order
or this, but it doesn't rational.
select wm_concat(unique order), wm_concat(fullname) from EmployeeCards
or just select (unique order), wm_concat(fullname) from EmployeeCards
don't working. Which aggregate function shoul i use to get a single value? Thanks
Use LISTAGG:
SELECT
"order",
LISTAGG(fullname, ',') WITHIN GROUP (ORDER BY id) AS fullnames
FROM EmployeeCards
GROUP BY
"order";
Demo
Also, please avoid naming your database objects (e.g. tables, columns, etc.) using reserved SQL keywords, such as ORDER.

SQL test 21 combinations if a meal exists

I currently have a program that assigns meals to patients. A patient however can have a curtain diet. If the assigned meal isn't allowed due to the diet it automatically gets replaced by a replacement meal.
So database would be something like this:
Diet table: Diet_id, Diet_Name
Replacement table: meal_id, replacement_id, Diet_id
The thing I want to do is write a query in SQL Server to see if for every diet combination there's a replacement available.
For example:
A,B
A,C
A,B,C
A,B,X
A,B,C,X
A,B,...,X
So if diet with id 1 has no replacement for combination A,B,C I want the result to return the meal_id and diet combination that failed.
I currently have 21 different diets so that's fact(21) combinations. Too much to just iterate. Would there be an alternative to test all combinations?
some data:
https://dl.dropboxusercontent.com/u/3949922/diets.txt
To just get the ID's:
Select diet_id from Diet
except
Select diet_id from Replacement

Fuzzy text searching in Oracle

I have a large Oracle DB table which contains street names for a whole country, which has 600000+ rows. In my application, I take an address string as input and want to check whether specific substrings of this address string matches one or many of the street names in the table, such that I can label that address substring as the name of a street.
Clearly, this should be a fuzzy text matching problem, there is only a small chance that the substring I query has an exact match with the street names in DB table. So there should be some kind of fuzzy text matching approach. I am trying to read the Oracle documentation at http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm in which CONTAINS and CATSEARCH search operators are explained. But these seem to be used for more complex tasks like searching a match for the given string in documents. I just want to do that for a column of a table.
What do you suggest me in this case, does Oracle have support for such kind of fuzzy text matching queries?
UTL_MATCH contains methods for matching strings and comparing their similarity. The
edit distance, also known as the Levenshtein Distance, might be a good place to start. Since one string is a substring it may help to compare the edit distance
relative to the size of the strings.
--Addresses that are most similar to each substring.
select substring, address, edit_ratio
from
(
--Rank edit ratios.
select substring, address, edit_ratio
,dense_rank() over (partition by substring order by edit_ratio desc) edit_ratio_rank
from
(
--Calculate edit ratio - edit distance relative to string sizes.
select
substring,
address,
(length(address) - UTL_MATCH.EDIT_DISTANCE(substring, address))/length(substring) edit_ratio
from
(
--Fake addreses (from http://names.igopaygo.com/street/north_american_address)
select '526 Burning Hill Big Beaver District of Columbia 20041' address from dual union all
select '5206 Hidden Rise Whitebead Michigan 48426' address from dual union all
select '2714 Noble Drive Milk River Michigan 48770' address from dual union all
select '8325 Grand Wagon Private Sleeping Buffalo Arkansas 72265' address from dual union all
select '968 Iron Corner Wacker Arkansas 72793' address from dual
) addresses
cross join
(
--Address substrings.
select 'Michigan' substring from dual union all
select 'Not-So-Hidden Rise' substring from dual union all
select '123 Fake Street' substring from dual
)
order by substring, edit_ratio desc
)
)
where edit_ratio_rank = 1
order by substring, address;
These results are not great but hopefully this is at least a good starting point. It should work with any language. But you'll still probably want to combine this with some language- or locale- specific comparison rules.
SUBSTRING ADDRESS EDIT_RATIO
--------- ------- ----------
123 Fake Street 526 Burning Hill Big Beaver District of Columbia 20041 0.5333
Michigan 2714 Noble Drive Milk River Michigan 48770 1
Michigan 5206 Hidden Rise Whitebead Michigan 48426 1
Not-So-Hidden Rise 5206 Hidden Rise Whitebead Michigan 48426 0.5
You could make use of the SOUNDEX function available in Oracle databases. SOUNDEX computes a numeric signature of a text string. This can be used to find strings which sound similar and thus reduce the number of string comparisons.
Edited:
If SOUNDEX is not suitable for your local language, you can ask Google for a phonetic signature or phonetic matching function which performs better. This function has to be evaluated once per new table entry and once for every query. Therefore, it does not need to reside in Oracle.
Example: A Turkish SOUNDEX is promoted here.
To increase the matching quality, the street name spelling should be unified in a first step. This could be done by applying a set of rules:
Simplified example rules:
Convert all characters to lowercase
Remove "str." at the end of a name
Remove "drv." at the end of a name
Remove "place" at the end of a name
Remove "ave." at the end of a name
Sort names with multiple words alphabetically
Drop auxiliary words like "of", "and", "the", ...